Large Scale Machine Learning and Other Animals: kaggle

Showing posts with label kaggle. Show all posts

Saturday, February 2, 2013

Case study: million songs dataset

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli's cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities.

Instructions for computing item to item similarities:

1) For obtaining the dataset, download and extract this zip file.

2) Run createTrain.sh to download the million songs dataset and prepare GraphChi compatible format.
$ sh createTrain.sh
Note: this operation may take an hour or so to prepare the data.

3) Run GraphChi item based collaborative filtering, to find out the top 500 similar items for each item:

$ ./toolkits/collaborative_filtering/itemcf --training=train --K=500 --asym_cosine_alpha=0.15 --distance=3 --min_allowed_intersection=5
Explanation: --training points to the training file. --K=500 means we compute the top 500 similar items.
--distance=3 is Aillio's metric. --min_allowed_intersection=5 - means we take into account only items that were rated together by at least 5 users.

Note: this operation requires around 20GB of memory and may take a few ours...

Create user recommendations based on item similarities:

1) Run itemsim2rating to compute recommendations based on item similarities
$ rm -fR train.* train-topk.*
$ ./toolkits/collaborative_filtering/itemsim2rating --training=train --similarity=train-topk --K=500 membudget_mb 50000 --nshards=1 --max_iter=2 --Q=3 --clean_cache=1
Note: this operation may require 20GB of RAM and may take a couple of hours based on your computer configuration.

Output file is: train-rec

Evaluating the result

1) Prepare test data:
./toolkits/parsers/topk --training=test --K=500

Output file is: test.ids

2) Prepare training recommendations:

./toolkits/parsers/topk --training=train-rec --K=500

Output file is: train-rec.ids

3) Compute mean average precision @ 500:
./toolkits/collaborative_filtering/metric_eval --training=train-rec.ids --test=test.ids --K=500

About performance:

With the following settings: --min_allowed_intersection=5, K=500, Q=1, alpha=0.15 we get:

INFO: metric_eval.cpp(eval_metrics:114): 7.48179 Finished evaluating 100000 instances.

ESC[0mINFO: metric_eval.cpp(eval_metrics:117): Computed AP@500 metric: 0.151431

With --min_allowed_intersection=1, K=2500, Q=1, alpha=0.15 we get:

INFO: metric_eval.cpp(eval_metrics:114): 6.0811 Finished evaluating 100000 instances.
ESC[0mINFO: metric_eval.cpp(eval_metrics:117): Computed AP@500 metric: 0.167994

Acknowledgements:

Clive Cox, RummbleLabs.com for proposing to implement item based recommendations in GraphChi, and support in the process of implementing this method.
Fabio Aiolli, University of Padova, winner of Million songs dataset contest, for great support regarding implementation of his metric.

Friday, February 1, 2013

Spotlight: Kaggle's RTA Challenge

I had an interesting talk with José P. González-Brenes, A 6th year grad student from CMU LTI dept.
During the talk, I learned that Jose participated in the Kaggle's RTA challenge and actually won the 1st place out of more than 300 groups.

The challenge was for predicting RTA highway travel times. The data was recorded time of different segments different cars traveled. The winning solution (of Jose and Guido Matías Cortés) was composed of a very simple method - a random forest. Unfortunately, there was no paper published about it, but here is a blog post summarizing the solution method. And here is a link to their presentation. What is further interesting about the solution method is that it was composed of 90 lines of matlab code!

The reason we actually talked is that Jose was recently trying out my GraphChi collaborative filtering code for his research, so I gave him some advice on which methods to use. Once he has some interesting results I hope he will update us!

Friday, January 4, 2013

Interesting dataset: million songs dataset

As you probably all know we are always looking for additional free, high quality datasets to try some of our techniques on. I got the million songs dataset link from Clive Cox, Chief Scientist at Rammble Labs, our man in London.

Here is some information from their website:

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.
The Million Song Dataset is also a cluster of complementary datasets contributed by the community:

SecondHandSongs dataset -> cover songs

musiXmatch dataset -> lyrics

Last.fm dataset -> song-level tags and similarity

Taste Profile subset -> user data

Here is information on getting the dataset. Kaggle managed a contest for rating music items drawn from this dataset. For evaluating performance they used MAP@500 metric described here. Anyway I am soon going to try out our GraphChi CF toolbox on this dataset. Keep posted for some results!

An update: as promised, here are some GraphChi runtime results deployed on the million songs dataset and instructions how to reproduce them.

Large Scale Machine Learning and Other Animals

Saturday, February 2, 2013

Case study: million songs dataset

Instructions for computing item to item similarities:

Create user recommendations based on item similarities:

Evaluating the result

Friday, February 1, 2013

Spotlight: Kaggle's RTA Challenge

Friday, January 4, 2013

Interesting dataset: million songs dataset

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax