What did we learn from predicting song popularity?
A recent data science competition focused on whether one could predict the success of a Taylor Swift song from less than half a second of audio. One sample sounded like:
The objective was to predict whether that clip (and some 20k others) came from a popular song, or not. As mentioned in a previous post this is a case where there aren't descriptive elements that "make sense" - no genre, gender, demographics, contact history, or anything that marketers typically rely on to help them understand what is going on. Just 3,300 numbers - for the audio among you, down-sampled to 11k.
The winning models got to above 99% accuracy which sounds a bit to good to be true. While technically correct there are some interesting lessons to be learned.
First, by rethinking the problem as a segmentation problem rather than individual estimates the results got a lot better. That is, by grouping clips based on their similarity the accuracy improved - this is no different than targeting audiences as opposed to specific individuals.
Second, simple models tended to work as well as complex. In this case, accuracy mattered so effort was put on improving that as much as possible. But there are times when good enough is, well, good enough. It turns out that a simple model of "how similar is this clip to its nearest neighbor" worked very well. With this challenge, that makes sense, pop songs are similar across their 3-4 minutes. After the competition I tried some really, really simple (and possibly stupid ideas) and did better than my official submission. Don't over think it.
Third, we needed to take a step back and align technique with problem. The same data and the same logic resulted in different results based simply on the approach taken. Always use two or more methods.
A big thanx to Devin Diderickson for posting his approach and thought process. (He finished second, I finished sixth).