In our opinion, there were two main challenges in this contest:
1) The magnitude of the data: the dataset has about 260M music ratings
in the range 0-> 100. Each rating is composed of user, music item, time of rating and the numeric rating. Time of ratings is between 2249 - 6649 (this is from the top of my head - minor inaccuracies possible).
2) Special characteristics of the data: data includes some taxonomy information which relates music tracks to albums, artists and genere.
Here is some more information about the data from the contest website:
The dataset is split into three subsets:
* - Train data: in the file trainIdx1.txt
* - Validation data: in the file validationIdx1.txt
* - Test data: in the file testIdx1.txt
For each subset, user rating data is grouped by user. First line for a user is formatted as:
Each of the next <#UserRatings> lines describes a single rating by
The scores are integers lying between 0 and 100. All user id's and item id's are consecutive integers, both starting at zero. Dates are integers describing number of days elapsed since an undisclosed date.
An item has at least 20 ratings in the total dataset (including train, validation, and test sets).
Each user has at least 10 ratings in the training data, which were given by the user earlier than his validation ratings. Then, each user has exactly four ratings in the validation data, which come earlier in time than the ratings by the same user in the test data. Finally, the test data holds the last 6 ratings of each user.
For the test subset, we withheld the scores. The contestants are asked to provide predictions for these test scores. The predictions should be arranged by the given test set order. Each prediction is quantized to one of the 256 evenly spaced numbers between 0 and 100 (100*0/255, 100*1/255,...,100*254/255, 100*255/255). Thus, a single unsigned byte would be dedicated for a predicted value, and predicting the 6,005,940 ratings of the test set would require 6,005,940 bytes. (We provide C and Python programs for converting into well-formatted prediction files.)
The evaluation criterion is the root mean squared error (RMSE) between predicted ratings and true ones.
The test subset is further split into two equally sized subsets, called Test1 and Test2. Test1 is used to rank contestants on the public Leaderboard and Test2 is used for deciding the winners of the contest. The split between Test1 and Test2 is not disclosed.
The dataset statistics are as follows:
#Users #Items #Ratings #TrainRatings #ValidationRatings #TestRatings
1,000,990 624,961 262,810,175 252,800,275 4,003,960 6,005,940
Besides of the rating itself, there is taxonomy information as well. In other words:
* Each music item has an album.
* Each album has an artist.
* Each album has a genre.
Here is an excerpt from the contest website:
A distinctive feature of this dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. In addition, the items are tied together within a hierarchy. That is, for a track we know the identity of its album, performing artist and associated genres. Similarly we have artist and genre annotation for the albums. Both users and items (tracks, albums, artists and genres) are represented as meaningless anonymous numbers so that no identifying information is revealed.
Our solution method:
* We used GraphLab for rapid prototyping of multiple collaborative filtering algorithms and
for fine tuning their parameters. Multiple algorithms where blended together using the ensemble method (you are welcome to read more in our paper!).
* Most linear models take into account only the information about user, music item and rating. So the user-> item matrix is decomposed to build a linear model. However, if we know that a user likes Madona, he may like different Madona albums/or songs than the one he rated. So it is good to have this relation in the model. The same with user who likes Jazz, he may well like other albums of Jazz. Since the rating is for a single song, if we do not relate the artist to the song, we miss this information. We propose a novel method called MFITR (matrix factorization item taxonomy regularization)
1) Add feature vector for artist, that mean that if a user like a certain artist it will add to the ratings of other songs of this artist.
2) Add a linear connection between the artist and the items, or else if they will be independent, thus making section 1) not useful. So we added weights that links between user, artist and album. The weights force that if user rated one album or artist higher, it will result in other items from the same artist or album to be higher as well.
3) We did not use genere in our model, although in principal genre can be added using the same technique.
We believe that Yahoo KDD CUP contest data is a great academic resource for both academic researchers as well as high-tech companies who perform collaborative filtering tasks. This is why we release a large portion of our source code as part of GraphLab's collaborative filtering library. Python/Matlab scripts are available for converting the data between KDD format into GraphLab's format.
Recently, we learned from Noam Koenigstein, a Tel-Aviv PhD candidate, about a related paper that was submitted for publication: Yahoo! Music Recommendations: Modeling Music Ratings with Temporal Dynamics and Item Taxonomy. By Gideon Dror, Noam Koenigstein and Yehuda Koren: In ACM Conference on Recommender Systems (RecSys), 2011. While obtained independently, our solution has some similarities to their proposed solution by adding a feature vectors for artist and albums. Their paper have also nice overview of KDD data characteristics in section 3.
I got another interesting resource from Yao Wu: InnerSpace group website: http://apex.sjtu.edu.cn/apex_wiki/svdfeature - they won the 3rd place this year.
And here is the workshop website including all presentations.