Large Scale Machine Learning and Other Animals: Spotlight: Pandora Internet Radio

A while ago I met Eric Bieschke and Tao Ye at GeekSessions event in SF.
I will really impressed by Eric's talk presenting Pandora Internet Radio, and I am sure everyone will agree with me it is one of the coolest companies, with great large scale machine learning
applications. Here is a quick interview I held with Tao:

Q: Can you give a short description of Pandora, to those few who don't know about this company?
A: Pandora is the leader in internet radio in the United States, offering a personalized experience for each of our listeners. We have pioneered a new form of radio that uses intrinsic qualities of music to initially create stations and then adapts playlists in real-time based on the individual feedback of each listener.

The Music Genome Project and our playlist generating algorithms form the technology foundation that enables us to deliver personalized radio to our listeners. These proprietary technologies power our ability to predict listener music preferences and play music content suited to the tastes of
each individual listener. The extensive musicological database of the Music Genome Project has been meticulously built by a team of professional musicians and musicologists analyzing up to 480 attributes, or genes, for every song in our vast collection, to capture the fundamental musical
properties of each recording. When a listener enters a single song, artist, composer or genre to start a station a process we call seeding our complex mathematical algorithms combine the genes cataloged by the Music Genome Project with individual and collective feedback to suggest songs and buildpersonalized playlists.

Q: What is the magnitude of datasets you are working on?
A: As of July 2011, we had over 100 million registered users, and more
than 37 million Active monthly users. Since the launch of Pandora in 2005, our listeners
have created 1.9 billion stations and have given more than 11 billion thumbs.
Containing over 900,000 songs from over 90,000 artists, we believe the
Music Genome Project is the most comprehensive analysis of music in the
world.

Q: Are there unique properties of your data relative to other datasets
like yahoo KDD cup?
A: Compared to KDD cup 2011, our feedback dataset has binary data only
(thumb up or thumb downs) instead of numeric ratings. In addition all the feedbacks are
in context -- for a music/comedic seed. Since users can start stations
from a song, an artists or a genre, there are close to 1 million possible
"contexts" for recommendations to live in. This has both computation
implications (scale makes running complex algorithms harder) and
recommendation implications (in some cases makes the problem easier).

Our genome data has not only track/album/artist/genre meta data, but also
'gene' analysis for each track done by human music/comedic analysts. There are up to
450 gene values per track, capturing a track's musical (or comedic)
attributes from melody, harmony and instrumentation to rhythm, vocals and
lyrics.

Q: How does your current recommendation engine works? (maybe in general,
you probably do not want to reveal all secret recipes here)
A: We combine crowd feedback data and genome analysis data to provide
recommendations within context of station seed to each user. Our algorithm
recommends songs based on metrics such as thumbs ratio, genome nearest
neighbor and song novelty. It also additionally customize stations in real
time per user, based on instant user feedback.

Q: What are some future challenges you would like to solve? Specifically,
are you looking at online /real-time recommendations.
A: We're constantly improving the playlist algorithm. Many challenges lie
ahead.
* Pandora already provides online/real time personalized playlist. We
compute the building blocks to assist in making those choices offline, but
every song on Pandora was chosen specifically for that listener at that
moment. It is still a challenge to build a more refined set of real time
Metrics and infer listener preference, especially with limited user input
(many listeners don't thumb at all!).
* Past competitions emphasize on prediction accuracy optimization, however
at Pandora we value music variety greatly, hence understanding the
tradeoff between prediction accuracy and music variety/diversity and
striking the right balance is very important.
* We work on context relevant recommendations, from creating the best 4th
of July stations to ensuring new artist/song stations are good. These are
our cold start problems.
* Greater combination of different recommendation algorithms, including
content based, expert based and varies crowd based recommendation.

About Tao and Eric:

Tao Ye is a member of the Pandora playlist engineering team, currently
working on Pandora's playlist measurement and genome optimization. Most
recently, she spent 5 years as a research scientist at Sprint's IP and
wireless networking group, working on network monitoring and measurement
of large scale IP backbone. Prior to joining Sprint, she held lead
engineer and engineer roles working on Java systems at Consilient and Sun
Microsystems. She received a Master's degree from UC Berkeley
in Computer Science and duo Bachelor's degrees from State University of
New York at Stony Brook in Computer Science and Engineering Chemistry. She is expecting her Ph.D.
Degree from University of Melbourne on 12/2011.

Eric Bieschke runs playlist engineering for Pandora. As Pandora¹s second
employee he built initial prototypes for Pandora¹s playlist algorithms and
with his team has grown them to service more than 100M users who¹ve
thumbed 10 billion songs while listening to billions of hours of music. He
is currently working on optimally combining content based recommendations,
collective intelligence, and human machine cooperation in order to provide
the best experience for listeners.

Large Scale Machine Learning and Other Animals

Thursday, September 29, 2011

Spotlight: Pandora Internet Radio

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax