Large Scale Machine Learning and Other Animals: Interesting dataset: million songs dataset

Friday, January 4, 2013

Interesting dataset: million songs dataset

As you probably all know we are always looking for additional free, high quality datasets to try some of our techniques on. I got the million songs dataset link from Clive Cox, Chief Scientist at Rammble Labs, our man in London.

Here is some information from their website:

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.
The Million Song Dataset is also a cluster of complementary datasets contributed by the community:

SecondHandSongs dataset -> cover songs

musiXmatch dataset -> lyrics

Last.fm dataset -> song-level tags and similarity

Taste Profile subset -> user data

Here is information on getting the dataset. Kaggle managed a contest for rating music items drawn from this dataset. For evaluating performance they used MAP@500 metric described here. Anyway I am soon going to try out our GraphChi CF toolbox on this dataset. Keep posted for some results!

An update: as promised, here are some GraphChi runtime results deployed on the million songs dataset and instructions how to reproduce them.

Large Scale Machine Learning and Other Animals

Friday, January 4, 2013

Interesting dataset: million songs dataset

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax