Large Scale Machine Learning and Other Animals: SpotLight: Michael Ekstrand and the LensKit Project

As part of our new collaboration with LensKit, where Graphlab collaborative filtering library will be used as one of LensKit engines, here is an overview of this interesting project from Michael Ekstrand, a PhD student at GroupLens research, Univ. of Minnesota.

- What is the goal of the LensKit project?

The goal of LensKit is to provide a flexible, extensible platform for researching, studying, and deploying recommender systems. We are primarily concerned with providing a framework for building and integrating recommenders, high-quality implementations of canonical algorithms which perform well on research-scale data sets, and tools for running reproducible offline evaluations of algorithms.

- What other relevant projects exist in the GroupLens lab?

We have been building recommenders internally for quite some time, starting with the GroupLens recommender system, followed by MovieLens and later recommendation efforts (such as various research paper recommenders which went under the name TechLens, and our new book recommender service BookLens). Several years ago, the MultiLens project made some of our recommender code available for use outside the lab.
LensKit is a brand-new, from-scratch implementation of core recommender algorithms that we will be using internally going forward. BookLens is currently built on top of it, and we plan to move MovieLens from its current internal recommender code, related to the MultiLens code, to LensKit sometime in the coming months. A number of of LensKit's design decisions have been driven by the needs of the BookLens project, as we have been making it suitable for both offline runs and integration into web applications. Future web-based recommender systems, both within GroupLens and externally, will be able to pick up LensKit and integrate it very easily with the complex needs of web server environments.

- Who is working on the project and for how long?

I started the project about 2 years ago, working on it off-and-on. Development picked up substantially in late 2011-early 2011, and we had our first public release early this year. Michael Ludwig has been involved with the project for most of that time, helping particularly with design and requirements work as he integrates it with the BookLens code, and also contributing code directly as well. Jack Kolb is an undergraduate who has been working with me since late spring, and we have had other students helping from time to time as well.

- What is the status of development?

Right now we are in late beta, with stable APIs for common recommender tasks, a robust infrastructure, and good implementations of several classic algorithms. We are currently working on documentation and a refactoring of our recommender configuration infrastructure; once that is completed and tested, we should be ready to declare a stable 1.0 release to build on going forward. It is pretty safe to build against LensKit at this point, though; the main interfaces are stable, and the APIs for configuration shouldn't change too much. Some code might need to be updated for 1.0, but it should be limited to the code to configure the recommender.

- Which open source license are you suing?

LensKit is licensed under the GNU GPL version 2 or later with a link exception to allow linking between it and modules under other licenses (whether proprietary or GPL-incompatible open source). LensKit can be used and deployed internally without restriction; projects distributing it externally must make its source code available to their users. The link exception is the same as that used by the GNU Classpath project.

Many of the libraries we depend on are licensed under the Apache license (APLv2).

- What are the interesting algorithms currently implemented?

We provide implementations of user-user and item-item collaborative filtering (with similarity matrix pre-computation and truncation in item-item), Funk's regularized gradient descent SVD (see also Paterek paper in KDD 2007), and Slope-One recommenders.

- What is the level of parallelism you allow (is there support for parallel execution?)

Currently, none of the algorithms are parallelized. The evaluator is capable of training and evaluating multiple algorithms in parallel, even sharing some of the major data structures like the rating snapshot between runs, to achieve good throughput on comparative evaluations. Parallelizing some of them is on our radar, though.

- Do you have some performance numbers on Netflix/KDD data or similar datasets?

We provide accuracy data from the MovieLens data sets and Yahoo! Music data sets in our RecSys 2011 paper (http://grouplens.org/node/479). Efficiency numbers are a bit rough, since we've been running parallel builds, but we can train an item-item model on the MovieLens 10M data set in about 15 minutes, and on one shard of the Y! Music data set (from WebScope, each shard has ~80M ratings) in about 20-26 hours. FunkSVD takes 30-50 minutes, depending on the model size and parameters, on ML10M and 14 hours on Y!M.

- Regarding the potential collaboration with GraphLab. What may be the benefits of this collaboration?

We are looking to integrate GraphLab to allow LensKit users to have easy access to high-quality implementations of a wider variety of matrix factorization methods in particular, and to leverage your existing work rather than re-building everything ourselves. It will also make it easy to integrate GraphLab-based algorithms with interesting recommender environments, such as web applications or comparative evaluation setups.

Overall, here at the GraphLab project we are very excited about this collaboration. We believe that LensKit has a few properties that are currently missing in GraphLab: namely output processing for finding the best recommendation as well as improved user interaction and better UI.

Large Scale Machine Learning and Other Animals

Friday, December 2, 2011

SpotLight: Michael Ekstrand and the LensKit Project

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax