I heard from Joey Gonzalez, about this project: http://incubator.apache.org/giraph/project-info.html :
"It is an open-source framework for large-scale graph computation that is community maintained (Apache incubator status), with contributors from LinkedIn and Yahoo!.
It runs on very large (4000 node) cloud systems, and does so efficiently (in memory using low-latency communication and fault-tolerant snapshots)."
"Giraph was recently featured at the HotCloud conference and has been going on a lecture circuit around the valley (...). Unlike Pegasus Giraph is relatively well designed with in memory storage, HadoopRPC messaging, infrequent snapshots, and map-only execution. Basically with the exception of startup and occasional snapshots, Giraph runs entirely in memory and communicates
over sockets."
It definitely sounds like an interesting project. Although no Giraph will beat our labardor! (GraphLab.. :-)
Tuesday, August 30, 2011
Big learning NIPS workshop
My friends Joey Gonzalez and Yucheng Low, are among the organizers of this year
NIPS workshop on big learning. Mark down your calendars for December 16-17 in Sierra Nevada, Spain.
Some of the interesting topics of this workshop:
Hardware Accelerated Learning: Practicality and performance of specialized high-performance hardware (e.g. GPUs, FPGAs, ASIC) for machine learning applications.
Applications of Big Learning: Practical application case studies; insights on end-users, typical data workflow patterns, common data characteristics (stream or batch); trade-offs between labeling strategies (e.g., curated or crowd-sourced); challenges of real-world system building.
Tools, Software, & Systems: Languages and libraries for large-scale parallel or distributed learning. Preference will be given to approaches and systems that leverage cloud computing (e.g. Hadoop, DryadLINQ, EC2, Azure), scalable storage (e.g. RDBMs, NoSQL, graph databases), and/or specialized hardware (e.g. GPU, Multicore, FPGA, ASIC).
Hopefully, if I manage to scrap enough travel money, I may even attend!
NIPS workshop on big learning. Mark down your calendars for December 16-17 in Sierra Nevada, Spain.
Some of the interesting topics of this workshop:
Hardware Accelerated Learning: Practicality and performance of specialized high-performance hardware (e.g. GPUs, FPGAs, ASIC) for machine learning applications.
Applications of Big Learning: Practical application case studies; insights on end-users, typical data workflow patterns, common data characteristics (stream or batch); trade-offs between labeling strategies (e.g., curated or crowd-sourced); challenges of real-world system building.
Tools, Software, & Systems: Languages and libraries for large-scale parallel or distributed learning. Preference will be given to approaches and systems that leverage cloud computing (e.g. Hadoop, DryadLINQ, EC2, Azure), scalable storage (e.g. RDBMs, NoSQL, graph databases), and/or specialized hardware (e.g. GPU, Multicore, FPGA, ASIC).
Hopefully, if I manage to scrap enough travel money, I may even attend!
Monday, August 29, 2011
Installing GNU Octave
GNU Octave is a Matlab like open source math environment. Recently, while talking to many companies, many of them have expressed their dissatisfaction with Maltlab license. In this post I will explain how to install Octave instead, for all the open source lovers.
After breaking my head most of today, I found out that the easiest way
to install Octave on Linux is found here.
Specifically for Fedora, you just need to do
On Ubuntu, you need to
For MAC OS it is also easy: you can download dmg file here
If you want to work really hard, you can try to compile the
sourcess.
However, this is really not recommended!!!
After you download the tgz file, you need to run ./configure
you may get this error:
point to the right blas/lapack libs:
http://lists.gnu.org/archive/html/octave-bug-tracker/2011-02/msg00262.html
The way to resolve it, is either update gnulib version, or add
the appropriate header file
#include
and change gnulib::floor to floor
Next you will get about 20 such erros in different files that needs to be
setup manually.
dir-ops.cc: In member function 'bool dir_entry::open(const std::string&)':
dir-ops.cc:61: error: 'strerror' is not a member of 'gnulib'
make[3]: *** [liboctave_la-dir-ops.lo] Error 1
#include<stdlib.h>
file-ops.cc: In function 'int octave_mkdir(const std::string&, mode_t, std::string&)':
file-ops.cc:378: error: 'strerror' is not a member of 'gnulib'
file-ops.cc: In function 'int octave_rename(const std::string&, const std::string&, std::string&)':
file-ops.cc:502: error: 'rename' is not a member of 'gnulib'
make[3]: *** [liboctave_la-file-ops.lo] Error 1
lo-mappers.cc: In function 'double xtrunc(double)':
lo-mappers.cc:48: error: 'trunc' is not a member of 'gnulib'
lo-mappers.cc: In function 'double xfloor(double)':
lo-mappers.cc:53: error: 'floor' is not a member of 'gnulib'
lo-mappers.cc: In function 'double xround(double)':
lo-mappers.cc:59: error: 'round' is not a member of 'gnulib'
lo-mappers.cc: In function 'float xtrunc(float)':
lo-mappers.cc:267: error: 'truncf' is not a member of 'gnulib'
lo-mappers.cc: In function 'float xround(float)':
lo-mappers.cc:273: error: 'round' is not a member of 'gnulib'
make[3]: *** [liboctave_la-lo-mappers.lo] Error 1
lo-utils.cc: In function 'void octave_putenv(const std::string&, const std::string&)':
lo-utils.cc:93: error: 'malloc' is not a member of 'gnulib'
lo-utils.cc: In function 'std::string octave_fgets(FILE*, bool&)':
lo-utils.cc:121: error: 'malloc' is not a member of 'gnulib'
lo-utils.cc:136: error: 'realloc' is not a member of 'gnulib'
oct-env.cc: In member function 'void octave_env::error(int) const':
oct-env.cc:534: error: 'strerror' is not a member of 'gnulib'
make[3]: *** [liboctave_la-oct-env.lo] Error 1
make[3]: Leaving directory `/mnt/bigbrofs/usr7/bickson/octave-3.4.2/liboctave'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/mnt/bigbrofs/usr7/bickson/octave-3.4.2/liboctave'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/mnt/bigbrofs/usr7/bickson/octave-3.4.2'
ct-md5.cc: In function 'std::string oct_md5_file(std::string)':
oct-md5.cc:73: error: 'fclose' is not a member of 'gnulib'
make[3]: *** [liboctave_la-oct-md5.lo] Error 1
oct-time.cc: In constructor 'octave_time::octave_time(const octave_base_tm&)':
oct-time.cc:62: error: 'mktime' is not a member of 'gnulib'
oct-time.cc: In member function 'void octave_strptime::init(const std::string&, const std::string&)':
oct-time.cc:269: error: 'strptime' is not a member of 'gnulib'
oct-time.cc:276: error: 'mktime' is not a member of 'gnulib'
ANOTHER ERROR I GOT:
build-aux/bootstrap: libtoolize -c -f ...
libtoolize: $pkgltdldir not a directory: `/usr0/software/libtool/share/libtool'
SOLUTION: make clean, and this error disappeared
ANOTHER ERROR:
./liboctave/.libs/liboctave.so: undefined reference to `METIS_NodeComputeSeparator'
SOLUTION:
setenv LIBS "-lmetis -L/usr/cs/lib/"
and then ./configure again and compile again
After breaking my head most of today, I found out that the easiest way
to install Octave on Linux is found here.
Specifically for Fedora, you just need to do
sudo yum install octave-forge
On Ubuntu, you need to
sudo apt-get install octave3.2
For MAC OS it is also easy: you can download dmg file here
If you want to work really hard, you can try to compile the
sourcess.
However, this is really not recommended!!!
After you download the tgz file, you need to run ./configure
you may get this error:
configure: error: You are required to have BLAS and LAPACK libraries
point to the right blas/lapack libs:
./configure --with-lapack="-llapack -L/usr/lib64/atlas/" --with-blas="-lblas -L/usr/lib64/atlas/" --prefix=/mnt/bigbrofs/usr6/bickson/octave/ --without-qhullwhen compiling I got this error:
Range.cc:440: error: 'floor' is not a member of 'gnulib'by reading on the web, it seems it is some gcc bug:
http://lists.gnu.org/archive/html/octave-bug-tracker/2011-02/msg00262.html
The way to resolve it, is either update gnulib version, or add
the appropriate header file
#include
and change gnulib::floor to floor
Next you will get about 20 such erros in different files that needs to be
setup manually.
dir-ops.cc: In member function 'bool dir_entry::open(const std::string&)':
dir-ops.cc:61: error: 'strerror' is not a member of 'gnulib'
make[3]: *** [liboctave_la-dir-ops.lo] Error 1
#include<stdlib.h>
file-ops.cc: In function 'int octave_mkdir(const std::string&, mode_t, std::string&)':
file-ops.cc:378: error: 'strerror' is not a member of 'gnulib'
file-ops.cc: In function 'int octave_rename(const std::string&, const std::string&, std::string&)':
file-ops.cc:502: error: 'rename' is not a member of 'gnulib'
make[3]: *** [liboctave_la-file-ops.lo] Error 1
lo-mappers.cc: In function 'double xtrunc(double)':
lo-mappers.cc:48: error: 'trunc' is not a member of 'gnulib'
lo-mappers.cc: In function 'double xfloor(double)':
lo-mappers.cc:53: error: 'floor' is not a member of 'gnulib'
lo-mappers.cc: In function 'double xround(double)':
lo-mappers.cc:59: error: 'round' is not a member of 'gnulib'
lo-mappers.cc: In function 'float xtrunc(float)':
lo-mappers.cc:267: error: 'truncf' is not a member of 'gnulib'
lo-mappers.cc: In function 'float xround(float)':
lo-mappers.cc:273: error: 'round' is not a member of 'gnulib'
make[3]: *** [liboctave_la-lo-mappers.lo] Error 1
lo-utils.cc: In function 'void octave_putenv(const std::string&, const std::string&)':
lo-utils.cc:93: error: 'malloc' is not a member of 'gnulib'
lo-utils.cc: In function 'std::string octave_fgets(FILE*, bool&)':
lo-utils.cc:121: error: 'malloc' is not a member of 'gnulib'
lo-utils.cc:136: error: 'realloc' is not a member of 'gnulib'
oct-env.cc: In member function 'void octave_env::error(int) const':
oct-env.cc:534: error: 'strerror' is not a member of 'gnulib'
make[3]: *** [liboctave_la-oct-env.lo] Error 1
make[3]: Leaving directory `/mnt/bigbrofs/usr7/bickson/octave-3.4.2/liboctave'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/mnt/bigbrofs/usr7/bickson/octave-3.4.2/liboctave'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/mnt/bigbrofs/usr7/bickson/octave-3.4.2'
ct-md5.cc: In function 'std::string oct_md5_file(std::string)':
oct-md5.cc:73: error: 'fclose' is not a member of 'gnulib'
make[3]: *** [liboctave_la-oct-md5.lo] Error 1
oct-time.cc: In constructor 'octave_time::octave_time(const octave_base_tm&)':
oct-time.cc:62: error: 'mktime' is not a member of 'gnulib'
oct-time.cc: In member function 'void octave_strptime::init(const std::string&, const std::string&)':
oct-time.cc:269: error: 'strptime' is not a member of 'gnulib'
oct-time.cc:276: error: 'mktime' is not a member of 'gnulib'
ANOTHER ERROR I GOT:
build-aux/bootstrap: libtoolize -c -f ...
libtoolize: $pkgltdldir not a directory: `/usr0/software/libtool/share/libtool'
SOLUTION: make clean, and this error disappeared
ANOTHER ERROR:
./liboctave/.libs/liboctave.so: undefined reference to `METIS_NodeComputeSeparator'
SOLUTION:
setenv LIBS "-lmetis -L/usr/cs/lib/"
and then ./configure again and compile again
Sunday, August 28, 2011
GraphLab and Mahout Compatability - Mahout SVD input format
In the last couple of months, we are working on increasing compatibility of GraphLab large scale machine learning project, with Apache Mahout large scale machine learning project. About 3 months ago, as a first step, we decided to change GraphLab license to Apache license to allow for better interaction between the projects.
This week I wrote some Java code, to allow translation of Mahout Hadoop sequence files using in Mahout's SVD solver into GraphLab's format and back. This will allow users of Mahout that rely on Hadoop infrastructure, to pre-process the matrix factorization problem as before, but change the actual solver used from Mahout SVD to GraphLab. The advantage, is that for problems which fit into memory, GraphLab run about x50 faster.
Here is the Java code:
This week I wrote some Java code, to allow translation of Mahout Hadoop sequence files using in Mahout's SVD solver into GraphLab's format and back. This will allow users of Mahout that rely on Hadoop infrastructure, to pre-process the matrix factorization problem as before, but change the actual solver used from Mahout SVD to GraphLab. The advantage, is that for problems which fit into memory, GraphLab run about x50 faster.
Here is the Java code:
import java.io.*; import java.util.Iterator; import org.apache.mahout.math.SequentialAccessSparseVector; import org.apache.mahout.math.Vector; import org.apache.mahout.math.VectorWritable; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; /** * Code for converting Mahout SequenceFile containing SequentialAccessSparseVector to GraphLab format * @author bickson * */ public class SeqFile2GraphLab { public static int Cardinality; /** * * @param args[0] - input svd file * @param args[2] - output csv file */ static void writeFloat(DataOutputStream dos, float a) throws java.io.IOException{ int floatBits = Float.floatToIntBits(a); byte floatBytes[] = new byte[4]; floatBytes[3] = (byte)(floatBits >> 24); floatBytes[2] = (byte)(floatBits >> 16); floatBytes[1] = (byte)(floatBits >> 8); floatBytes[0] = (byte)(floatBits); dos.write(floatBytes); } static void writeInt(DataOutputStream dos, int a) throws java.io.IOException{ byte floatBytes[] = new byte[4]; floatBytes[3] = (byte)(a >> 24); floatBytes[2] = (byte)(a >> 16); floatBytes[1] = (byte)(a >> 8); floatBytes[0] = (byte)(a); dos.write(floatBytes); } public static void main(String[] args){ if (args.length < 6){ System.err.println("Usage: java SeqFile2GraphLab [input seq file name] [output graphlab file name] [M - number of users] [N - number of movies] [K - number of time bins ] [ e - number of edges] [ OPTIONAL: transpose = false]"); System.exit(1); } String inputfile = args[0]; String outputfile = args[1]; int M = Integer.parseInt(args[2]); int N = Integer.parseInt(args[3]); int K = Integer.parseInt(args[4]); int e = Integer.parseInt(args[5]); if (M < 1 || N < 1 || K< 1 || e<1){ System.err.println("wrong input. M,N,K,e should be >=1"); System.exit(1); } boolean transpose = false; if (args.length >= 7) transpose = new Boolean(args[6]).booleanValue(); try { final Configuration conf = new Configuration(); final FileSystem fs = FileSystem.get(conf); final SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(inputfile), conf); DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(outputfile))); writeInt(dos,M); writeInt(dos,N); writeInt(dos,K); writeInt(dos,e); System.out.println("Writing a matrix of size: " + M + " users, " + N + " movies, " + K + " time bins " + e + " edges."); IntWritable key = new IntWritable(); VectorWritable vec = new VectorWritable(); int edges = 0; while (reader.next(key, vec)) { SequentialAccessSparseVector vect = (SequentialAccessSparseVector)vec.get(); //System.out.println("key " + key + " value: " + vect); Iterator<Vector.Element> iter = vect.iterateNonZero(); while(iter.hasNext()){ Vector.Element element = iter.next(); if (!transpose){ assert(element.index() < M); assert(key.get() < N); writeFloat(dos,element.index()+1); writeFloat(dos,key.get()+1+M); } else { assert(element.index() < N); assert(key.get() < M); writeFloat(dos,key.get()+1); writeFloat(dos,element.index()+1+M); } writeFloat(dos,1); writeFloat(dos,(float)vect.getQuick(element.index())); //System.out.println("col: " + key.get() + " row: " + element.index() + " val: " + vect.getQuick(element.index())); edges++; if (edges % 1000000 == 0) System.out.println("edges: " + edges); } } if (edges != e){ System.err.println("Wrong number of edges in file. Should be " + e + " in practice we have : " + edges); } reader.close(); dos.close(); System.out.println("Done writing !" + e + " edges"); } catch(Exception ex){ ex.printStackTrace(); } } }Compilation:
javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar *.javaExample run, for converting netflix data, from Mahout's format into GraphLab's:
run2: java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/commons-collections-3.2.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/google-collections-1.0-rc2.jar SeqFile2GraphLab netflix.seq netflix 480189 17770 27 1408395
More on shutgun
I got inquiries from several companies who wanted to use shotgun, our large scale sparse logistic regression / lasso solver. I thought about writing a quick tutorial that could be useful.
PAPER: shotgun paper appears in this year ICML 2011. The paper explains the approach we took to allow running our algorithm on multicore machines. It analyzes the theory, and justifies the cases where parallel execution does not hurt accuracy.
CODE: the shotgun code is found here: http://select.cs.cmu.edu/code/
TARGET: the goal of this code, is to handle large scale problems, that from the one hand fit into a multicore machine, but from the other hand, other solvers such as GLMNET, Boy'd l1 interior point methods and liblinear fail to scale.
LICENSE: The code is licensed under Apache license.
INTERFACES: We have both a C code version, as well as Matlab interface for running the code from within Matlab. Due to Patrick Harrington (OneRiot.com) request, we added support for Matrix Market input format.
Additional R interface is found here, thanks to Steve Lianoglou, Cornell graduate student.
COST FUNCTION:
We use the following cost function formulation.
For Lasso:
argmin_x sum_i [(A_i*x - y_i)^2 + lambda * |x|_1]
For sparse logistic regression:
argmin_x sum_i [-log(1 + exp(-y_i * x* A_i) ) + lambda * |x|_1]
where |x|_1 is the first norm (sum of absolute value of the vector x).
Matlab Usage:
C Usage:
PAPER: shotgun paper appears in this year ICML 2011. The paper explains the approach we took to allow running our algorithm on multicore machines. It analyzes the theory, and justifies the cases where parallel execution does not hurt accuracy.
CODE: the shotgun code is found here: http://select.cs.cmu.edu/code/
TARGET: the goal of this code, is to handle large scale problems, that from the one hand fit into a multicore machine, but from the other hand, other solvers such as GLMNET, Boy'd l1 interior point methods and liblinear fail to scale.
LICENSE: The code is licensed under Apache license.
INTERFACES: We have both a C code version, as well as Matlab interface for running the code from within Matlab. Due to Patrick Harrington (OneRiot.com) request, we added support for Matrix Market input format.
Additional R interface is found here, thanks to Steve Lianoglou, Cornell graduate student.
COST FUNCTION:
We use the following cost function formulation.
For Lasso:
argmin_x sum_i [(A_i*x - y_i)^2 + lambda * |x|_1]
For sparse logistic regression:
argmin_x sum_i [-log(1 + exp(-y_i * x* A_i) ) + lambda * |x|_1]
where |x|_1 is the first norm (sum of absolute value of the vector x).
Matlab Usage:
x = shotgun_logreg(A,y,lambda) x = shotgun_lasso(A,y,lambda)
C Usage:
./mm_lasso [ A input matrix_file] [y input vector file] [x vector output file] [algorithm] [ threshold] [ K] [max_iter] [num_threads] [lammbda] Program inputs are: Matrix and vector files are mandaroty inputs Usage: ./mm_lasso -m matrix A in sparse matrix market format -v vector y in sparse matrix market format -o output file name (will contain solution vector x, default is x.mtx) -a algorithm (1=lasso, 2=logitic regresion, 3 = find min lambda for all zero solution) -t convergence threshold (default 1e-5) -k solution path length (for lasso) -i max_iter (default 100) -n num_threads (default 2) -l lammbda - positive weight constant (default 1) -V verbose: 1=verbose, 0=quiet (default 0)
Wednesday, August 24, 2011
More interesting ML contests
I just saw an interesting post by Zeno Gantner. A website representing multiple ML contests is found on: http://www.kaggle.com/
The highest rewarding competition is: http://www.heritagehealthprize.com/
for a total prize of 3M $!!
The highest rewarding competition is: http://www.heritagehealthprize.com/
for a total prize of 3M $!!
Wednesday, August 17, 2011
Efficient multicore collaborative filtering
Recently we got the 5th place in ACM Yahoo! KDD CUP (track1). We report out solution method on arxiv. This is a joint work with Yucheng Low from CMU, and Yao Wu, Qiang Yan and Qing Yang from the Chinese National Academy of Science.
In our opinion, there were two main challenges in this contest:
1) The magnitude of the data: the dataset has about 260M music ratings
in the range 0-> 100. Each rating is composed of user, music item, time of rating and the numeric rating. Time of ratings is between 2249 - 6649 (this is from the top of my head - minor inaccuracies possible).
2) Special characteristics of the data: data includes some taxonomy information which relates music tracks to albums, artists and genere.
Here is some more information about the data from the contest website:
The dataset is split into three subsets:
* - Train data: in the file trainIdx1.txt
* - Validation data: in the file validationIdx1.txt
* - Test data: in the file testIdx1.txt
For each subset, user rating data is grouped by user. First line for a user is formatted as:
* <usedid>|<#UserRatings>\n
Each of the next <#UserRatings> lines describes a single rating by, sorted in chronological order. Rating line format is:
* <itemid>\t<score>\t<date\t><time>\n
The scores are integers lying between 0 and 100. All user id's and item id's are consecutive integers, both starting at zero. Dates are integers describing number of days elapsed since an undisclosed date.
An item has at least 20 ratings in the total dataset (including train, validation, and test sets).
Each user has at least 10 ratings in the training data, which were given by the user earlier than his validation ratings. Then, each user has exactly four ratings in the validation data, which come earlier in time than the ratings by the same user in the test data. Finally, the test data holds the last 6 ratings of each user.
For the test subset, we withheld the scores. The contestants are asked to provide predictions for these test scores. The predictions should be arranged by the given test set order. Each prediction is quantized to one of the 256 evenly spaced numbers between 0 and 100 (100*0/255, 100*1/255,...,100*254/255, 100*255/255). Thus, a single unsigned byte would be dedicated for a predicted value, and predicting the 6,005,940 ratings of the test set would require 6,005,940 bytes. (We provide C and Python programs for converting into well-formatted prediction files.)
The evaluation criterion is the root mean squared error (RMSE) between predicted ratings and true ones.
The test subset is further split into two equally sized subsets, called Test1 and Test2. Test1 is used to rank contestants on the public Leaderboard and Test2 is used for deciding the winners of the contest. The split between Test1 and Test2 is not disclosed.
The dataset statistics are as follows:
#Users #Items #Ratings #TrainRatings #ValidationRatings #TestRatings
1,000,990 624,961 262,810,175 252,800,275 4,003,960 6,005,940
Besides of the rating itself, there is taxonomy information as well. In other words:
* Each music item has an album.
* Each album has an artist.
* Each album has a genre.
Here is an excerpt from the contest website:
A distinctive feature of this dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. In addition, the items are tied together within a hierarchy. That is, for a track we know the identity of its album, performing artist and associated genres. Similarly we have artist and genre annotation for the albums. Both users and items (tracks, albums, artists and genres) are represented as meaningless anonymous numbers so that no identifying information is revealed.
Our solution method:
* We used GraphLab for rapid prototyping of multiple collaborative filtering algorithms and
for fine tuning their parameters. Multiple algorithms where blended together using the ensemble method (you are welcome to read more in our paper!).
* Most linear models take into account only the information about user, music item and rating. So the user-> item matrix is decomposed to build a linear model. However, if we know that a user likes Madona, he may like different Madona albums/or songs than the one he rated. So it is good to have this relation in the model. The same with user who likes Jazz, he may well like other albums of Jazz. Since the rating is for a single song, if we do not relate the artist to the song, we miss this information. We propose a novel method called MFITR (matrix factorization item taxonomy regularization)
1) Add feature vector for artist, that mean that if a user like a certain artist it will add to the ratings of other songs of this artist.
2) Add a linear connection between the artist and the items, or else if they will be independent, thus making section 1) not useful. So we added weights that links between user, artist and album. The weights force that if user rated one album or artist higher, it will result in other items from the same artist or album to be higher as well.
3) We did not use genere in our model, although in principal genre can be added using the same technique.
We believe that Yahoo KDD CUP contest data is a great academic resource for both academic researchers as well as high-tech companies who perform collaborative filtering tasks. This is why we release a large portion of our source code as part of GraphLab's collaborative filtering library. Python/Matlab scripts are available for converting the data between KDD format into GraphLab's format.
Recently, we learned from Noam Koenigstein, a Tel-Aviv PhD candidate, about a related paper that was submitted for publication: Yahoo! Music Recommendations: Modeling Music Ratings with Temporal Dynamics and Item Taxonomy. By Gideon Dror, Noam Koenigstein and Yehuda Koren: In ACM Conference on Recommender Systems (RecSys), 2011. While obtained independently, our solution has some similarities to their proposed solution by adding a feature vectors for artist and albums. Their paper have also nice overview of KDD data characteristics in section 3.
I got another interesting resource from Yao Wu: InnerSpace group website: http://apex.sjtu.edu.cn/apex_wiki/svdfeature - they won the 3rd place this year.
And here is the workshop website including all presentations.
In our opinion, there were two main challenges in this contest:
1) The magnitude of the data: the dataset has about 260M music ratings
in the range 0-> 100. Each rating is composed of user, music item, time of rating and the numeric rating. Time of ratings is between 2249 - 6649 (this is from the top of my head - minor inaccuracies possible).
2) Special characteristics of the data: data includes some taxonomy information which relates music tracks to albums, artists and genere.
Here is some more information about the data from the contest website:
The dataset is split into three subsets:
* - Train data: in the file trainIdx1.txt
* - Validation data: in the file validationIdx1.txt
* - Test data: in the file testIdx1.txt
For each subset, user rating data is grouped by user. First line for a user is formatted as:
* <usedid>|<#UserRatings>\n
Each of the next <#UserRatings> lines describes a single rating by
* <itemid>\t<score>\t<date\t><time>\n
The scores are integers lying between 0 and 100. All user id's and item id's are consecutive integers, both starting at zero. Dates are integers describing number of days elapsed since an undisclosed date.
An item has at least 20 ratings in the total dataset (including train, validation, and test sets).
Each user has at least 10 ratings in the training data, which were given by the user earlier than his validation ratings. Then, each user has exactly four ratings in the validation data, which come earlier in time than the ratings by the same user in the test data. Finally, the test data holds the last 6 ratings of each user.
For the test subset, we withheld the scores. The contestants are asked to provide predictions for these test scores. The predictions should be arranged by the given test set order. Each prediction is quantized to one of the 256 evenly spaced numbers between 0 and 100 (100*0/255, 100*1/255,...,100*254/255, 100*255/255). Thus, a single unsigned byte would be dedicated for a predicted value, and predicting the 6,005,940 ratings of the test set would require 6,005,940 bytes. (We provide C and Python programs for converting into well-formatted prediction files.)
The evaluation criterion is the root mean squared error (RMSE) between predicted ratings and true ones.
The test subset is further split into two equally sized subsets, called Test1 and Test2. Test1 is used to rank contestants on the public Leaderboard and Test2 is used for deciding the winners of the contest. The split between Test1 and Test2 is not disclosed.
The dataset statistics are as follows:
#Users #Items #Ratings #TrainRatings #ValidationRatings #TestRatings
1,000,990 624,961 262,810,175 252,800,275 4,003,960 6,005,940
Besides of the rating itself, there is taxonomy information as well. In other words:
* Each music item has an album.
* Each album has an artist.
* Each album has a genre.
Here is an excerpt from the contest website:
A distinctive feature of this dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. In addition, the items are tied together within a hierarchy. That is, for a track we know the identity of its album, performing artist and associated genres. Similarly we have artist and genre annotation for the albums. Both users and items (tracks, albums, artists and genres) are represented as meaningless anonymous numbers so that no identifying information is revealed.
Our solution method:
* We used GraphLab for rapid prototyping of multiple collaborative filtering algorithms and
for fine tuning their parameters. Multiple algorithms where blended together using the ensemble method (you are welcome to read more in our paper!).
* Most linear models take into account only the information about user, music item and rating. So the user-> item matrix is decomposed to build a linear model. However, if we know that a user likes Madona, he may like different Madona albums/or songs than the one he rated. So it is good to have this relation in the model. The same with user who likes Jazz, he may well like other albums of Jazz. Since the rating is for a single song, if we do not relate the artist to the song, we miss this information. We propose a novel method called MFITR (matrix factorization item taxonomy regularization)
1) Add feature vector for artist, that mean that if a user like a certain artist it will add to the ratings of other songs of this artist.
2) Add a linear connection between the artist and the items, or else if they will be independent, thus making section 1) not useful. So we added weights that links between user, artist and album. The weights force that if user rated one album or artist higher, it will result in other items from the same artist or album to be higher as well.
3) We did not use genere in our model, although in principal genre can be added using the same technique.
We believe that Yahoo KDD CUP contest data is a great academic resource for both academic researchers as well as high-tech companies who perform collaborative filtering tasks. This is why we release a large portion of our source code as part of GraphLab's collaborative filtering library. Python/Matlab scripts are available for converting the data between KDD format into GraphLab's format.
Recently, we learned from Noam Koenigstein, a Tel-Aviv PhD candidate, about a related paper that was submitted for publication: Yahoo! Music Recommendations: Modeling Music Ratings with Temporal Dynamics and Item Taxonomy. By Gideon Dror, Noam Koenigstein and Yehuda Koren: In ACM Conference on Recommender Systems (RecSys), 2011. While obtained independently, our solution has some similarities to their proposed solution by adding a feature vectors for artist and albums. Their paper have also nice overview of KDD data characteristics in section 3.
I got another interesting resource from Yao Wu: InnerSpace group website: http://apex.sjtu.edu.cn/apex_wiki/svdfeature - they won the 3rd place this year.
And here is the workshop website including all presentations.
Saturday, August 6, 2011
The quiet rise of Gaussian Belief Propagation (GaBP)
Gaussian Belief Propagation is an inference method on a Gaussian graphical model which is related to solving a linear system of equations, one of the fundamental problems in computer science and engineering. I have published my PhD thesis on applications of GaBP in 2008.
When I started working on GaBP, it was absolutely useless algorithm with no documented applications.
Recently, I am getting a lot of inquiries from people who applying GaBP on real world problems. Some examples:
* Carnegie Mellon graduate student Kyung-Ah Sohn, working with Eric Xing, is working on regression problem for finding causal genetic variants of gene expressions, considered using GaBP for computing matrix inverses.
* UCSC researcher Daniel Zerbino using suing GaBP for smoothing genomic sequencing measurements with constraints.
* UCSB graduate student Yun Teng is working on implementing GaBP as part of the KDT (knowledge discovery toolbox package).
Furthermore, I was very excited to find out today from Noam Koenigstein, a Tel Aviv university graduate about Microsoft Research Cambridge project called MatchBox, which is using Gaussian BP for collaborative filtering and being actually deployed in MS. Some examples to other conversations I had are:
* Wall Street undisclosed company (that asked to remain private) who is using GaBP for parallel computation of linear regression of online stock market data.
* A gas and oil company was considering to use GaBP for computing the main diagonal
of the inverse of a sparse matrix.
I thought about writing on the common questions I am getting so my answers can be useful for a wider audience. Here are some of the questions:
Q: What is the difference between GaBP and other iterative methods for solving a linear system of equations like the Jacobi method?
A: GaBP typically converges in a fewer number of iterations relative to Jacobi, since it deploys information from the second derivative (the precision/variance of the Gaussian). However, each iteration is more computation intensive since for each non-zero entry of the linear relation matrix a Jacobi like computation is done. Thus, the algorithms works well for sparse matrices.
Q: When should I use GaBP?
A: I believe there is no silver bullet and the answer depends on the data. I recommend to always try several algorithms like Jacobi, conjugate gradients etc. and compare performance to GaBP. On some cases GaBP works very well. For example: matrices with zero, -1, 1 entries. Matrices with same number of non zeros in each row and column. CDMA. 2D Poisson problem.
Q: Can GaBP be applied to dense matrices?
A: Yes, there is an algorithm variants for dense matrices - see section 3.3 of my PhD thesis.
Q: Does GaBP always converge?
A: No, like the Jacobi method, its converges is relates to the data matrix. However, it is always possible to force converges, at the cost of higher computation. See http://arxiv.org/abs/0901.4192
Q: What is the best way to understand GaBP better?
A: I recommend downloading the Matlab code from here, and experimenting with it on some toy problems.
Q: Can GaBP converge in cases where Jacobi does not converge?
A: Yes, in some cases. See example in equation 57 in here.
Q: Can the main diagonal of the sparse inverse matrix be computed using GaBP?
A: Yes, but only an approximation of it. The main inverse if approximated by the
output precision entries.
Q: can GaBP be used on non-square/non-symmetric matrices?
A: Yes, see: section III in here.
Q: Can GaBP be used for computing the full inverse of a sparse matrix?
A: Yes. The way to compute the full inverse is by solving n systems of linear equations where
the observation vector b_i = e_i, namely the all zero vector with 1 in its i-th entry. This can be done efficiently in parallel. Note that while the linear system may be sparse, the inverse solution is typically not sparse.
Q: Is there an efficient multicore implementation of GaBP?
A: Yes! as a part of the GraphLab linear solver library.
Q: What is the relation to linear belief functions?
A: Linear belief functions describe a joint Gaussian distribution, combine them, marginalize, etc. Compared to representing a joint Gaussian by its covariance matrix, linear belief functions allows to have infinite variance in some directions, which corresponds to no belief about the value in that direction. The same can be done in GaBP, by having a zero precision value of that node. However, careful handling of the ordering of computation is needed to avoid division by zero.
Q: I need to solve Ax=b where A = UU' and U is given and not sparse.
The problem is that I would like to avoid computation of UU'.
A: First you need to solve U\b using GaBP for getting the observation
and then you need to solve U'\(U\b).
Q: Recently i found your PHD thesis about solving systems of linear equations using GaBP and then the GraphLab. I have two questions, what's the advantage of GaBP over algorithms like sparseLM in solving linear systems of equations?
A: I am not familiar with sparseLM algorithm - do you mean http://www.ics.forth.gr/~lourakis/sparseLM/ ? It seem to be a non-linear algorithm. When the cost function is linear (like in a linear system) there is no benefit in using a non-linear method.
Q: Is it possible to incrementally solve the "Ax=b" equation as new measurements are added to the matrices "A" and "b"?
A: No. The GaBP is an offline algorithm. The scenario you are describing sounds like a Kalman filter problem - where new measurements are constantly introduced into the model. You can take a look
at my paper: "Distributed Kalman filter via Gaussian Belief Propagaion". You may also want to take a look at rank one update algorithms: where matrix A is updated based on new observations.
If you have any other interesting questions - please email me I am always happy to answer!
When I started working on GaBP, it was absolutely useless algorithm with no documented applications.
Recently, I am getting a lot of inquiries from people who applying GaBP on real world problems. Some examples:
* Carnegie Mellon graduate student Kyung-Ah Sohn, working with Eric Xing, is working on regression problem for finding causal genetic variants of gene expressions, considered using GaBP for computing matrix inverses.
* UCSC researcher Daniel Zerbino using suing GaBP for smoothing genomic sequencing measurements with constraints.
* UCSB graduate student Yun Teng is working on implementing GaBP as part of the KDT (knowledge discovery toolbox package).
Furthermore, I was very excited to find out today from Noam Koenigstein, a Tel Aviv university graduate about Microsoft Research Cambridge project called MatchBox, which is using Gaussian BP for collaborative filtering and being actually deployed in MS. Some examples to other conversations I had are:
* Wall Street undisclosed company (that asked to remain private) who is using GaBP for parallel computation of linear regression of online stock market data.
* A gas and oil company was considering to use GaBP for computing the main diagonal
of the inverse of a sparse matrix.
I thought about writing on the common questions I am getting so my answers can be useful for a wider audience. Here are some of the questions:
Q: What is the difference between GaBP and other iterative methods for solving a linear system of equations like the Jacobi method?
A: GaBP typically converges in a fewer number of iterations relative to Jacobi, since it deploys information from the second derivative (the precision/variance of the Gaussian). However, each iteration is more computation intensive since for each non-zero entry of the linear relation matrix a Jacobi like computation is done. Thus, the algorithms works well for sparse matrices.
Q: When should I use GaBP?
A: I believe there is no silver bullet and the answer depends on the data. I recommend to always try several algorithms like Jacobi, conjugate gradients etc. and compare performance to GaBP. On some cases GaBP works very well. For example: matrices with zero, -1, 1 entries. Matrices with same number of non zeros in each row and column. CDMA. 2D Poisson problem.
Q: Can GaBP be applied to dense matrices?
A: Yes, there is an algorithm variants for dense matrices - see section 3.3 of my PhD thesis.
Q: Does GaBP always converge?
A: No, like the Jacobi method, its converges is relates to the data matrix. However, it is always possible to force converges, at the cost of higher computation. See http://arxiv.org/abs/0901.4192
Q: What is the best way to understand GaBP better?
A: I recommend downloading the Matlab code from here, and experimenting with it on some toy problems.
Q: Can GaBP converge in cases where Jacobi does not converge?
A: Yes, in some cases. See example in equation 57 in here.
Q: Can the main diagonal of the sparse inverse matrix be computed using GaBP?
A: Yes, but only an approximation of it. The main inverse if approximated by the
output precision entries.
Q: can GaBP be used on non-square/non-symmetric matrices?
A: Yes, see: section III in here.
Q: Can GaBP be used for computing the full inverse of a sparse matrix?
A: Yes. The way to compute the full inverse is by solving n systems of linear equations where
the observation vector b_i = e_i, namely the all zero vector with 1 in its i-th entry. This can be done efficiently in parallel. Note that while the linear system may be sparse, the inverse solution is typically not sparse.
Q: Is there an efficient multicore implementation of GaBP?
A: Yes! as a part of the GraphLab linear solver library.
Q: What is the relation to linear belief functions?
A: Linear belief functions describe a joint Gaussian distribution, combine them, marginalize, etc. Compared to representing a joint Gaussian by its covariance matrix, linear belief functions allows to have infinite variance in some directions, which corresponds to no belief about the value in that direction. The same can be done in GaBP, by having a zero precision value of that node. However, careful handling of the ordering of computation is needed to avoid division by zero.
Q: I need to solve Ax=b where A = UU' and U is given and not sparse.
The problem is that I would like to avoid computation of UU'.
A: First you need to solve U\b using GaBP for getting the observation
and then you need to solve U'\(U\b).
Q: Recently i found your PHD thesis about solving systems of linear equations using GaBP and then the GraphLab. I have two questions, what's the advantage of GaBP over algorithms like sparseLM in solving linear systems of equations?
A: I am not familiar with sparseLM algorithm - do you mean http://www.ics.forth.gr/~lourakis/sparseLM/ ? It seem to be a non-linear algorithm. When the cost function is linear (like in a linear system) there is no benefit in using a non-linear method.
Q: Is it possible to incrementally solve the "Ax=b" equation as new measurements are added to the matrices "A" and "b"?
A: No. The GaBP is an offline algorithm. The scenario you are describing sounds like a Kalman filter problem - where new measurements are constantly introduced into the model. You can take a look
at my paper: "Distributed Kalman filter via Gaussian Belief Propagaion". You may also want to take a look at rank one update algorithms: where matrix A is updated based on new observations.
If you have any other interesting questions - please email me I am always happy to answer!