What does it take to win at Kaggle? An Introduction to Data Strategy

For our latest project, we were set to work on a completed Kaggle competition and tasked with analyzing the appearance of West Nile Virus (WNV) in Chicago to see if we could identify where the virus would start.

Having learned my lesson from my experience with Ames housing prices, and being given a completed Kaggle leaderboard to start with, I wanted to know what constituted a ‘good’ score, and how hard it would be to get there.

Hoping for a nice clean graph, I took a look at the score submissions over time:

This is obviously a mess. If you look closely, there are some red, yellow, and green data points in there. These are respectively scores from groups with:

The lowest 10 scores on the board
Scores closest to the average
Scores in the top 10 overall

These are isolated here:

Rolling over will isolate the groups.

There seems to be a general elbow curve up and to the right, with groups slowly getting closer to a max score around .9. It also seems that groups who got low or medium scores in early enough had time to get into the realm of the high scorers by the end of the two-month competition.

One very important statistic jumps out: entries per team. The teams who took longer to think out their answer scored higher overall:

(The high scores go up over time)

The same teams were making a ton of attempts to get it right:

(Size of bubbles is attempts per team, shade is more blue for higher high scores)

You can barely see the orange - nobody got it right on their first try, not by a long shot.

So What is Data Strategy?

Data Strategy is the way your initial understanding of the data impacts how you approach the project.

Let's collect some facts for a strategy here:

We have 10 days to do this project
Our first few attempts got us to a score around .71 using a built-in model with little tuning
Most teams took about 14 days from the time of the first entry to break the .8 threshold
High Scoring teams had time for about 20-30 entries, on average
There is a tremendous amount of density in the .7-.8 range

Assessment: With a decent score on the board, it's worth trying out a few simple things to bump our score a little bit, but not worth getting too invested. The scores have approached a clear limit at a max around .9, so our value add is in the interpretation.

Strategy: Whip up a model and write a blog.