Monday, November 14, 2011

SpotLight: Zeno Gantner - MyMediaLite

Here are some more details I got from Zeno Gantner, about his collaborative filtering project: MyMediaLite.
> 1) What is the focus of your library?
The focus of MyMediaLite is to provide
  1. a collection of useful recommender system algorithms,
  2. a toolkit for experimentation with such methods, and
  3. practical tools for using and the deploying the methods in the library.

There is no single specific target audience - we would like to cater both researchers and practitioners, as well as people who teach or learn about recommender systems.

Our long-term vision is to be a kind of Weka for recommender systems. We are not there, yet. For example, we only have command-line tools that expose most of the library's functionality, but no fancy GUI program.

While scalability and performance are important to us, we do not sacrifice usability for it. Thus, the library offers many useful additions beyond the raw algorithm implementations:
  • incremental updates for many models
  • storing and reloading of learned models
  • support for different text-based file formats, database support

What MyMediaLite is not: an off-the-shelf package that you can plug into your online shop. But using MyMediaLite for your online shop relieves you of having to implement a recommendation algorithm. You can use an efficient, well-tested implementation from the library instead, which can save you a lot of work.

So if you are a .NET developer and are looking for a recommender system/collaborative filtering solution, MyMediaLite may be worth a look ...

> 2) Who are the contributors?
Originally, there were 3 authors in our lab who worked on MyMediaLite. After the project "MyMedia", which was responsible for MyMediaLite's birth, ended, I am the main author. A guy from BBC R+D (hello, Chris!) has ported parts of the library to Java (available on our
download page

I occasionally get bug reports and patches from users, but there has not been a major outside contribution, e.g. a new recommender algorithm, yet. But I would be very happy accept such contributions. If you (the readers of Danny's blog) want to contribute, there isa list of interesting methods that could be implemented for MyMediaLite in our issue tracker.

The development process is really open. I keep the source code on github and gitorious. There is also a Google Group, a public issue tracker, and plenty of documentation, so it should be rather easy to start working on MyMediaLite.

My current goal is to turn MyMediaLite into a community project, that's why I go to conferences to present MyMediaLite, write about it on Twitter, do interviews, etc.

> 3) What are the main implemented algorithms?
The library addresses two main tasks: rating
(very popular in the public eye because of the Netflix
Prize) and item
from implicit/positive-only
. The latter task is particularly important in practice, because you always have that kind of data (clicks, purchase action), while you have to ask your users for ratings, and they will not always give them to you.

So for both tasks we have simple baseline algorithms (average ratings, most popular item, etc.) that you can use to check whether you screwed up: a rating algorithm should always have more accurate predictions than the item average, and an item recommendation algorithn should always come up with better suggestions than the globally most popular items).

Beyond that, we have several variants of k-nearest-neighborhood (kNN) algorithms for both tasks, both based on interaction data (collaborative filtering) and on item attribute data (content-based filtering).

Most importantly, we have matrix factorization techniques for both tasks. If you have enough data, those are the models to use. For rating prediction, we have a straightforward matrix factorization with SGD training that also models user and item biases. For item recommendation, we have weighted regularized matrix factorization (WR-MF), which is called weighted-alternating least squares in GraphLab, and BPR-MF, which optimizes for a ranking loss (AUC).

> 4) Out of the implemented algorithms, which 2-3 perform best (on datasets like netflix, kddcup?)
I would go for the matrix factorization techniques mentioned above.

> 5) Do you have some performance results (for example on netflix data) in terms of speed and accuracy?
We have some results on our website:

For single-core matrix factorization, on an Intel(R) Xeon(R) CPU E5410 with 2.33 GHz, one iteration over the Netflix data takes, depending on the number of factors, 2 (10 factors) to 10 minutes (120 factors).

> 6) Which parts of the library are serial and which are parallel?
Most parts of the library are serial. We have parallelized some parts that are really easy to parallelize, e.g. cross-validation, where the experiment on each fold is entirely independent of the other folds.

We have one parallel SGD method for rating prediction MF. It is based on work presented at this years's KDD by Rainer Gemulla, which basically separates the training examples into independent chunks, and then performs parallel training on those chunks.

I implemented the method because I saw the talk and liked the paper, and I wanted to try how parallelization in C# works - it is very, very simple.

I will try to add more parallelized algorithms, but it is not the main development focus. Most feature requests from users are concerned about evaluation protocols and measures, and features of the surrounding framework.

> 7) What do I need to install if I want to use the library?
The byte code binaries are the same on all platforms. You need a .NET 4.0 runtime. On Windows, Microsoft .NET is already installed, so all you need to do is to download the library and tools and run them. On Mac OS X and Linux (and other Unix variants), you need Mono 2.8 or later, which you may have to install. The latest Ubuntu comes with Mono 2.10.5 installed by default, so there you can also just run it.

We have tutorial on our website on how to run the command line programs.

If you want to build the library, you also need MonoDevelop, which runs on all major platforms. You can download it for free. On Windows, you can use Visual Studio 2010 instead.

> 8) Some companies are limited by GP license. Did you consider moving to more industry friendly license, or do you target mainly academic usage?
While I do not rule out a license change in the future, I am quite happy with the current license. MyMediaLite was derived from a codebase with a much more restrictive license (educational/research use only).

The GPL is actually quite industry-friendly, although not as permissive as the Apache or
BSD licenses. The Linux kernel is also licensed under the GNU GPL. Is there a business that does not run Linux because of its license?

I am not a lawyer, but the GPL allows companies and all other users everything they need to do with MyMediaLite:
  • run, modify, and distribute the code
  • create derived works
  • sell it
  • use it in their online shop to make more money
  • set it up as a web service, and sell recommendations as a software

If you modify the library, add new features etc., you are not required to release its source code as long as do not distribute the software itself.

Basically there are only two things that you cannot do with MyMediaLite under the GPL:
  • redistributing (e.g. selling) the code or derived works and
    denying your users the rights that we gave to you
  • sue us over some dubious software patents you are holding

No comments:

Post a Comment