Thursday, December 29, 2011

Multicore parser - part 2 - parallel perl tutorial

In the first part of this post, I described how to program a multicore parser, where the task is to translate string IDs into consecutive integers that will be used for formatting the input of many machine learning algorithms. The output of part 1 is a map between strings and unsigned ints. The map is built using a single pass over all the dataset.

Now an additional task remains, namely translating the records (in my case phone call records) into a Graph to be used in Graphlab. This is an embarrassingly parallel task - since the map is read only - multiple threads can read it in parallel and translate the record names into graph edges. For example the following records:
YXVaVQJfYZp BqFnHyiRwam 050803 235959 28
YXVaVQJfYZp BZurhasRwat 050803 235959 6
BqFnHyiRwam jdJBsGbXUwu 050803 235959 242
are translated into undirected edges:
1 2 
1 3
2 4
etc. etc.
The code is part of Graphlab v2 and can be downloaded from our download page.

In the current post, I will quickly explain how to continue setting up the parser.
The task we have now is to merge multiple phone calls into a single edge. It is also useful to sort the edges by their node id. I have selected to program this part in perl, since as you are going to see in a minute it is going to be a very easy task.

INPUT: A gzipped files with phone call records, where each row has two columns: the caller and the receiver; each of them as an unsigned integer. It is possible the same row will repeat multiple times in the file (in case multiple phone calls between the same pair of people where logged in different times).
OUTPUT: A gzipped output file with sorted unique phone call records. Each unique caller receiver pair will appear only once.

Tutorial - how to execute a parallel task in Perl.
1) Download and extract Parallel Fork Manager
wget http://search.cpan.org/CPAN/authors/id/D/DL/DLUX/Parallel-ForkManager-0.7.5.tar.gz
tar xvzf Parallel-ForkManager-0.7.5.tar.gz
mv Parallel-ForkManager-0.7.5 Parallel

2) Create a file named parser.pl with the following lines in it:
#!/bin/perl -w
my $THREADS_NUM = 8;
use Parallel::ForkManager;
$pm = new Parallel::ForkManager($THREADS_NUM);

opendir(DIR, "your/path/to/files");
@FILES= readdir(DIR); 

foreach $file (@FILES) {

  # Forks and returns the pid for the child:
  my $pid = $pm->start and next;

  print "working on file " . $file;
  system("gunzip -c $file | sort -u  -T . -n -k 1,1 -k 2,2 -S 4G >" . $file . "gz" );

  $pm->finish; # Terminates the child process
}

closedir(DIR);
Explanation
1) ForkManager($THREADS_NUM) sets the number of parallel threads - in this case 8.
2) For each file, the file is unzipped using "gunzip -c", and sorted uniquely (-u command line flag). The -T flag is an optional argument in case your temp drive does not have enough space. -k 1,1 sets the sorting key to be the first column and the second -k 2,2 sets column 2 as the key in case of a match in the first column. -S flag sets the buffer size such that the full input file will fit into memory. 4G is 4 Gygabytes of memory.
3) The system() command runs any shell command line, so you can change the parallel loop execution to perform your own task easily.

Overall, we got in a few lines of code a parallel execution environment which would be much harder to setup otherwise.

Wednesday, December 28, 2011

NSF grant awarded to GraphLab

I am glad to announce, that using the great help of Joel Welling, from Pittsburgh Supercomputing Center, GraphLab project was awarded an additional NSF grant. We have obtained an additional 200,000 CPU hours on both BlackLight supercomputer as well as Kraken NSF supercomputer. Grant was obtained via XSEDE (Xtreme Science and Engineering Discovery Environment). We are going to use those our for scaling up GraphLab for systems with thousands of cores.

Sunday, December 25, 2011

How to write a multicore parser

When dealing with machine learning, one usually ignores the (usually boring!) task of preparing the data to be used in any of the machine learning algorithms. Most of the algorithms have either linear algebra or statistical foundation and thus the data has to be converted to a numeric form.

In the last couple of weeks I am working on the efficient design of multicore parser, that allows converting raw string data into a format usable by many machine learning algorithms. Specifically, I am using CDR (call data records) from a large European country. However, the dataset has several typical properties, so I believe my experience is useful for other domains.

The raw CDR data I am using looks like this:
YXVaVQJfYZp BqFnHyiRwam 050803 235959 28
xtGBWvpYgYK jdJBsGbXUwu 050803 235959 242
ZpYQLFFKyTa atslZokWZRL 050803 235959 504
WMjbYygLglR BqFnCfuNgio 050803 235959 51
hcLiEoumskU RcNSJSEmidT 050803 235959 7
qBvUQlMABPv atslBvPNusB 050803 235959 3609
jdSqVjxPlBn BqFnHyiRwam 050803 235959 23
VCWyivVLzRr atslSbOOWXz 050803 235959 8411
PnLaFqLJrEV atslZokWZRL 050803 235959 8806
PnLaCsEnqei atslBvPNusB 050803 235959 590
The first column is an anonymized caller ID, the second column is an anonymized receiver ID, the third column is the date, the fourth is the time, and the last column is the duration of the call.

Now to the data magnitude. If your dataset is small, no need for any fancy parsing, you can write a python/perl/matlab script to convert it to numeric form and avoid reading further... However, this dataset is rather big: every day there are about 300M unique phone calls. So depending on how many days you aggregate together, you can get to quite a large magnitude. For a month there are about 9 billion phone calls logged.

To make the CDR useful, we need to convert the hashed string ID into a number, hopefully a consecutive increasing number. That way we can express the phone call information as a matrix.
Then we can use any of the fundamental machine learning algorithms like: SVM, Lasso, Sparse logistic regression, matrix factorization, etc. etc.

One possible approach for converting strings to integer, is taken in Vowpal Wabbit, where strings are hashed into numeric IDs during the run. However, there is a risk that two different string IDs will be mapped into the same integer. So depending on the application this may be acceptable. I have chosen to take a different approach - which is to simply assign a growing consecutive ID to each string.

I have implemented the code in GraphLab, where GraphLab is not the intuitive tool to be used for this task (although it was convenient to use). In a multicore machine, several GraphLab threads are running in parallel and parsing different portions of the input files concurrently. We have to be careful that node IDs will remain consecutive across the different files. Since stl/boost data structures are typically not thread safe, I had to use a mutex for defending against concurrent insertions to the map. (Concurrent reads from a stl/boost map are perfectly fine).

void assign_id(uint & outval, const string &name){

  //find if the string is already in the map.
  //this part is thread safe since find() is thread safe
  boost::unordered_map<string,uint>::iterator it = hash2nodeid.find(name);
  if (it != hash2nodeid.end()){
     outval = it->second;
     return;
  }

  //if not, we need to insert it to the map
  //now, we must lock since operator[] is not thread safe
  mymutex.lock();
  outval = hash2nodeid[name];
  if (outval == 0){
      hash2nodeid[name] = ++conseq_id;
      outval = conseq_id;
  }
  mymutex.unlock();
}


One should be careful here, since as I verified using gprof profiler, about 95% of the running time is wasted on this critical section of assigning strings to ints.

Initially I used std::map<string,int> but I found it to be rather slow. It seems that std::map is implemented using an underlying tree and so insertions are costly: log(N). I switched to boost::unordered_map which is actually a hash table implementation with O(1) insertions. This gave x2 speedup in runtime.

Second, since each day of input file amount to about 5GB of gzipped file, I used boost gzipped stream for avoiding the intermediate extraction of the input files. Here is an example:
char linebuf[128];
    std::ifstream in_file(filename).c_str(), std::ios::binary);
    boost::iostreams::filtering_stream<boost::iostreams::input> fin;
    fin.push(boost::iostreams::gzip_decompressor());
    fin.push(in_file); 

    while(true){
      fin.getline(linebuf, 128);
      if (fin.eof())
        break;
      //parse the line
    }
    fin.pop();
    fin.pop();
    in_file.close();
Overall, I believe the result is quite efficient: for parsing 115GB of compressed CDR data (total of 6.4 billion phone calls) it takes 75 minutes on a 4 core machine. There where about 182M unique IDs assigned. (Quad core AMD Opteron 2.6Ghz). Total of 12.8 billion map lookups (about 3M lookups a second).

Some performance graph:


Summary of lessons learned:

  1. C parsing is way more efficient than perl/python/matlab.
  2. Opening gzipped files is a waste of time and space - better work directly on the gzipped version.
  3. Parallel parsing has a good speedup up to a few (3) cores. More cores do not improve (due to heavy IO..).
  4. Better use hash table than sorted tree: boost::unordered_map is twice is fast than std::map

Wednesday, December 21, 2011

Tuesday, December 20, 2011

PETSc and SLEPc parallel scientific computing packages

A few days ago I asked a physicist friend of mine, Doron Naveh, from the Nano Dept at CMU
which is the best MPI based package for solving eigenvalue problems. He recommended using PETSc.

Here is a quick explanation on how to setup PETSc, SLEPc and run a simple SVD solver.

From PETSc webpage:

PETSc, pronounced PET-see (the S is silent), is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. It supports MPI, shared memory pthreads, and NVIDIA GPUs, as well as hybrid MPI-shared memory pthreads or MPI-GPU parallelism.

From Slepc homepage:
SLEPc is a software library for the solution of large scale sparse eigenvalue problems on parallel computers. It is an extension of PETSc and can be used for either standard or generalized eigenproblems, with real or complex arithmetic. It can also be used for computing a partial SVD of a large, sparse, rectangular matrix, and to solve quadratic eigenvalue problems.


Here are quick installation instructions:

Download the latest version
For example:
wget http://ftp.mcs.anl.gov/pub/petsc/release-snapshots/petsc-3.2-p5.tar.gz
tar xvzf petsc-3.2-p5.tar.gz
cd petsc-3.2-p5
c
./configure --with-cc=gcc --with-fc=gfortran --download-f-blas-lapack=1 --download-mpich=1
make all test

Tip: you may want to add --with-precision=single to work with floats instead of doubles (to save memory).
Tip: you may want to use --download-openmpi=1 instead.

Download SLEPc
wget http://www.grycap.upv.es/slepc/download/distrib/slepc-3.2-p3.tar.gz
tar xvzf slepc-3.2-p3/
setenv PETSC_DIR /mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5
setenv SPLEC_DIR ./slepc-3.2-p3
setenv PETSC_ARCH arch-linux2-c-debug

cd $SLEPC_DIR
./configure
make SLEPC_DIR=$PWD PETSC_DIR=/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5 PETSC_ARCH=arch-linux2-c-debug
make test

Example for running SVD.
create a directory named svd inside petsc-3.2-p5:
mkdir svd
cd svd

Get the file ex14.c from: http://www.grycap.upv.es/slepc/handson/handson4.html
wget http://www.grycap.upv.es/slepc/documentation/current/src/svd/examples/tutorials/ex14.c.html

Create the following file named Makefile, inside the directory svd and put the following in it:
SLEPC_LIB=../arch-linux2-c-debug/lib/
SLEPC_INC=../include/
SLEPC_ARCH_INC=../arch-linux2-c-debug/include/

PETSC_INC=../../petsc-3.2-p5/include/
PETSC_ARCH_INC=../../petsc-3.2-p5/arch-linux2-c-debug/include/
PETSC_LIB=../../petsc-3.2-p5/arch-linux2-c-debug/lib/

MPI_LIB=../../petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/lib

ex14:
        mpicc -o ex14 ex14.c -L${SLEPC_LIB} -L${PETSC_LIB} -L${MPI_LIB} -I${PETSC_INC} -I${SLEPC_INC} -I${SLEPC_ARCH_INC} -I${PETSC_ARCH_INC} -lslepc -lpetsc -lmpich -lm -lflapack -lfblas -lfmpich -lX11 -lmpl -lpthread -lrt -lgfortran
Note: before mpicc it should be a tab (and not spaces).

In Matlab/Octave perpare a Petsc binary file as follows:

>> addpath ../../petsc-3.2-p5/bin/matlab/
>> A=rand(10,10); 
>> PetscBinaryWrite('A',sparse(A));

Now run the example:
<25|0>bickson@biggerbro:~/petsc/slepc-3.2-p3/svd_example$ ./ex14 -file A -svd_nsv 10

Singular value problem stored in file.

 Reading REAL matrix from a binary file...
 Number of iterations of the method: 1
 Solution method: cross

 Number of requested singular values: 1
 Stopping condition: tol=1e-08, maxit=100
 Number of converged approximate singular triplets: 10

          sigma            relative error
   --------------------- ------------------
        5.389506           6.40913e-16
        1.680646            6.9377e-16
        1.364558             1.005e-15
        1.102023           9.88382e-16
        0.825067           2.23038e-15
        0.756176           1.99212e-15
        0.498011            2.9545e-15
        0.335541           1.51681e-15
        0.187449             4.207e-14
        0.124114           3.88213e-14

Useful command line options:
-help             #display help
-info             #verbose mode
-svd_type         #select type of SVD method. Legal types are: cross,cyclic,lanczos,trlanczos,lapack
-svd_view         #display more info about the solver
-svd_nsv          #set the number of singular values requested
-svd_max_it       # maximal number of iterations

Running in parallel:
mpirun -np 8 slepc_program [command line arguments]


Troubleshooting
Problem:
../arch-linux2-c-debug/lib//libslepc.a(veccomp.c.o): In function `SlepcSumNorm2_Local':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/veccomp0.h:173: undefined reference to `MPI_Abort'
../arch-linux2-c-debug/lib//libslepc.a(veccomp.c.o): In function `VecNormCompEnd':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/veccomp0.h:185: undefined reference to `MPI_Type_free'
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/veccomp0.h:186: undefined reference to `MPI_Type_free'
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/veccomp0.h:187: undefined reference to `MPI_Op_free'

Solution: You should link againt MPI, for example add -lmpich to the Makefile command.

Problem:
../arch-linux2-c-debug/lib//libslepc.a(veccomp.c.o): In function `GetNorm2':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/veccomp0.h:137: undefined reference to `sqrt'
Solution: You should link to math lib, namely add -lm to the Makefile command line.

problem:
../arch-linux2-c-debug/lib//libslepc.a(contiguous.c.o): In function `SlepcUpdateVectors_Noncontiguous_Inplace':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/contiguous.c:186: undefined reference to `dgemm_'
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/contiguous.c:199: undefined reference to `dgemm_'
../arch-linux2-c-debug/lib//libslepc.a(contiguous.c.o): In function `SlepcUpdateStrideVectors':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/contiguous.c:391: undefined reference to `dgemm_'
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/contiguous.c:401: undefined reference to `dgemm_'
../arch-linux2-c-debug/lib//libslepc.a(contiguous.c.o): In function `SlepcVecMAXPBY':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/contiguous.c:475: undefined reference to `dgemv_'
Solution: You should link against blas and lapack, namely add -lfblas -lflapack to your Makefile command

Problem:
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libpetsc.a(xops.c.o): In function `PetscDrawLine_X':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:29: undefined reference to `XSetForeground'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:33: undefined reference to `XDrawLine'
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libpetsc.a(xops.c.o): In function `PetscDrawArrow_X':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:45: undefined reference to `XSetForeground'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:49: undefined reference to `XDrawLine'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:52: undefined reference to `XDrawLine'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:53: undefined reference to `XDrawLine'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:55: undefined reference to `XDrawLine'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:56: undefined reference to `XDrawLine'
Solution: Add -lX11 to the makefile compilation command line

problem:
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libmpich.a(init.o): In function `PMPI_Init':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/mpi/init/init.c:106: undefined reference to `MPL_env2str'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/mpi/init/init.c:132: undefined reference to `MPL_env2bool'

Solution: Add -lmpl to compilation command line

problem:
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libmpich.a(mpiu_thread.o): In function `MPIU_Thread_create':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/util/thread/mpiu_thread.c:66: undefined reference to `pthread_create'

Solution: add -lpthread to compilation command line

Problem:
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libmpich.a(ad_iwrite.o): In function `ADIOI_GEN_aio_wait_fn':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/mpi/romio/adio/common/ad_iwrite.c:266: undefined reference to `aio_suspend64'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/mpi/romio/adio/common/ad_iwrite.c:276: undefined reference to `aio_error64'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/mpi/romio/adio/common/ad_iwrite.c:278: undefined reference to `aio_return64'
Solution: use mpicc as the compiler instead of gcc

Problem:
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libflapack.a(dgesvd.o): In function `dgesvd':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dgesvd.f:215: undefined reference to `_gfortran_concat_string'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dgesvd.f:379: undefined reference to `_gfortran_concat_string'
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libflapack.a(dhseqr.o): In function `dhseqr':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dhseqr.f:345: undefined reference to `_gfortran_concat_string'
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libflapack.a(dlasd0.o): In function `dlasd0':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dlasd0.f:200: undefined reference to `_gfortran_pow_i4_i4'
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libflapack.a(dlasda.o): In function `dlasda':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dlasda.f:330: undefined reference to `_gfortran_pow_i4_i4'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dlasda.f:341: undefined reference to `_gfortran_pow_i4_i4'
Solution: Add -lgfortran to the compiler command line

Problem:
[0]PETSC ERROR: --------------------- Error Message ------------------------------------
[0]PETSC ERROR: Must indicate a file name with the -file option.!
[0]PETSC ERROR: ------------------------------------------------------------------------
Solution: You did not give an input file using the -file command line option. Problem:
13|0>bickson@bigbro6:~/petsc/slepc-3.2-p3$ ./configure
Checking environment...
ERROR: PETSc is not configured for architecture arch-installed-petsc
Solution: You forgot to define PETSC_ARCH. Define it using: setenv PETSC_ARCH arch-linux2-c-debug

Friday, December 16, 2011

News from NIPS - Divide and Conquer Matrix Factorization and efficient K-means

Here are some interesting papers that caught my eyes at NIPS. While being very simple I believe they can be very effective in practice.

"Divide and Conquer Matrix Factorization" is a joint work by Lester Mackey, Ameet Talwalkar and Michael Jordan from Berkeley. The idea is pretty simple. Assume you have a large matrix you can not fit into memory you would like to factorize to two low rank matrices. There are 3 steps in the construction:

  1. The large matrix is first partitioned into several small matrices by slicing the original matrix to column groups. 
  2. Each smaller matrix is solved in parallel using a cluster of computers, using any desired matrix factorization algorithm. 
  3. Now it remains how to aggregate the result back for getting the original factorized solutions of the larger matrix. This can be done by column projection, namely computing the least square solution of the smaller matrix onto the larger one. (See details in the paper!) and averaging the different solutions together.
It seems this construction has a pretty descent speedup. I am considering implementing this method in GraphLab as well. Ping me if you are interested in using this algo.


The second paper, Fast and Accurate K-means for Large Datasets is a method for speeding up k-means computation by Michael Shindler Alex Wong andAdam Meyerson. It got the best student paper award. The idea is pretty simple and actually reminds me the power iteration clustering.

The observation at the basis of this construction is that one of the heavy computational burdens in K-means is computing distances from each data point to all K-cluster heads. A very simple trick is proposed: a random vector is selected and the data point as well of all cluster locations are projected into it. Now we got a list of scalars for each cluster and it remains to find out which is closest to the projected data point. This significantly saves work since instead of computing D-dimensional distances we compute only 1-dimensional distances. It seems that in practice accuracy of the method is not significantly worse.

Monday, December 12, 2011

MPI vs. Hadoop

Here is an interesting blog post from John Langford (Yahoo! Research NY) about some of the pros and cons of MPI vs. Hadoop. Different approaches to ML are discussed and summarized by John as follows:
The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:
  • For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
  • For smaller datasets with computationally intense models, GPU approaches seem very compelling.
  • For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug. The model needs to fit into memory. If that’s not the case, then other approaches are required.

Anyone who reads this blog and is attending the NIPS big ML workshop is encouraged to contact me since I plan to be there! Note that the asynchronous approach reference (bullet 3) actually refers to our GraphLab project.  I liked the phrasing: it's scary in practice..  But we do support also BSP execution so it is always possible to debug that way.

Wednesday, December 7, 2011

Preview of GraphLab v2 new features!

GraphLab v2 new features are going to be presented in Intel's ISTC-CC retreat. The new features are the result of the hard work for Joey Gonzalez, who spent the summer as part of our collaboration with Yahoo! Research, in Alex Smola's lab.

The main challenge identified by Joey with GraphLab v1, is the handling of natural (power-law) graphs.

The problem with iterative graph algorithms of such graphs, is that when computing the update of a high degree node one has to touch large portion of the graph, and thus is is hard to distribute computation. Furthermore, computation of a high degree node is done sequentially and thus we loose some of the parallelism. Last, whenever an algorithm use locking of neighbors state (to avoid concurrent writes to their data structures) large number of locks has to be acquired which significantly degrades performance. 


The first solution is factorized update functions (shown above). The work on a high degree nodes is divided into smaller pieces of work which can be done in parallel.

The second solution is associate/commutative update functions, a.k.a delta updates (shown below). Whenever sums over a large number of nodes have to be computed, when the state of most of them is constant, there is no much point in summing up the result again and again. In this case, only the change in the result is tracked.


Some performance results on Yahoo! webgraph (25.5M nodes, 355M edges):

GraphLab v2 experiment version can be downloaded from our mercurial code base (using the command hg update -v2). We plan a full release in a couple of months.

Tuesday, December 6, 2011

GraphLab debugging with Eclipse

Here are fresh instructions I got from Young Cha, A graduate student at UCLA on
how to setup Graphlab to work with Eclipse.

1. Installing CDT for Eclipse
(The instructions below uses Eclipse's plug-in install function. There are other ways of installing CDT for Eclipse. Please refer tohttp://eclipse.org/cdt/ for more information)
  • In the menu bar, select 'Help' -> 'Install New Software..'
  • Click 'Add'
  • Type 'CDT' in Name field and http://download.eclipse.org/tools/cdt/releases/galileo in Location field (If you use a different version of Eclipse, please replace 'galileo' with your version ('helios' or 'indigo')
  • Check 'CDT Main Features' and 'CDT Optional Features' (You may choose only the modules you need. If you don't know about it, please check all)
  • Click 'Next' and install CDT
2. Using Cmake's CDT generator to convert GraphLab into a Eclipse CDT project
  • Let's assume that your graphlabapi directory is located in ~/graphlab/graphlabapi and you want to create CDT project at ~/graphlab/cdt_build
    • cd ~/graphlab
    • mkdir cdt_build
    • cd cdt_build
    • cmake -G"Eclipse CDT4 - Unix Makefiles" -D CMAKE_BUILD_TYPE=Debug ../graphlabapi
  • Please refer to http://www.cmake.org/Wiki/Eclipse_CDT4_Generator for more information
3. Importing GraphLab to Eclipse CDT
  • In the menu bar, select 'File' -> 'Import..'
  • Select 'General' -> 'Existing Projects into Workspace'
  • Click 'Next'
  • Click ''Browse' and select the 'cdt_build' directory you created previously
  • Please refer to http://www.cmake.org/Wiki/Eclipse_CDT4_Generator for more information
Thanks so much Young! Your instructions are going to be very helpful for our other users!

Friday, December 2, 2011

SpotLight: Michael Ekstrand and the LensKit Project

As part of our new collaboration with LensKit, where Graphlab collaborative filtering library will be used as one of LensKit engines, here is an overview of this interesting project from Michael Ekstrand, a PhD student at GroupLens research, Univ. of Minnesota.

- What is the goal of the LensKit project?

The goal of LensKit is to provide a flexible, extensible platform for researching, studying, and deploying recommender systems. We are primarily concerned with providing a framework for building and integrating recommenders, high-quality implementations of canonical algorithms which perform well on research-scale data sets, and tools for running reproducible offline evaluations of algorithms.

- What other relevant projects exist in the GroupLens lab?

We have been building recommenders internally for quite some time, starting with the GroupLens recommender system, followed by MovieLens and later recommendation efforts (such as various research paper recommenders which went under the name TechLens, and our new book recommender service BookLens). Several years ago, the MultiLens project made some of our recommender code available for use outside the lab.
LensKit is a brand-new, from-scratch implementation of core recommender algorithms that we will be using internally going forward. BookLens is currently built on top of it, and we plan to move MovieLens from its current internal recommender code, related to the MultiLens code, to LensKit sometime in the coming months. A number of of LensKit's design decisions have been driven by the needs of the BookLens project, as we have been making it suitable for both offline runs and integration into web applications. Future web-based recommender systems, both within GroupLens and externally, will be able to pick up LensKit and integrate it very easily with the complex needs of web server environments.

- Who is working on the project and for how long?

I started the project about 2 years ago, working on it off-and-on. Development picked up substantially in late 2011-early 2011, and we had our first public release early this year. Michael Ludwig has been involved with the project for most of that time, helping particularly with design and requirements work as he integrates it with the BookLens code, and also contributing code directly as well. Jack Kolb is an undergraduate who has been working with me since late spring, and we have had other students helping from time to time as well.

- What is the status of development?

Right now we are in late beta, with stable APIs for common recommender tasks, a robust infrastructure, and good implementations of several classic algorithms. We are currently working on documentation and a refactoring of our recommender configuration infrastructure; once that is completed and tested, we should be ready to declare a stable 1.0 release to build on going forward. It is pretty safe to build against LensKit at this point, though; the main interfaces are stable, and the APIs for configuration shouldn't change too much. Some code might need to be updated for 1.0, but it should be limited to the code to configure the recommender.

- Which open source license are you suing?

LensKit is licensed under the GNU GPL version 2 or later with a link exception to allow linking between it and modules under other licenses (whether proprietary or GPL-incompatible open source). LensKit can be used and deployed internally without restriction; projects distributing it externally must make its source code available to their users. The link exception is the same as that used by the GNU Classpath project.
Many of the libraries we depend on are licensed under the Apache license (APLv2).

- What are the interesting algorithms currently implemented?

We provide implementations of user-user and item-item collaborative filtering (with similarity matrix pre-computation and truncation in item-item), Funk's regularized gradient descent SVD (see also Paterek paper in KDD 2007), and Slope-One recommenders.

- What is the level of parallelism you allow (is there support for parallel execution?)

Currently, none of the algorithms are parallelized. The evaluator is capable of training and evaluating multiple algorithms in parallel, even sharing some of the major data structures like the rating snapshot between runs, to achieve good throughput on comparative evaluations. Parallelizing some of them is on our radar, though.

- Do you have some performance numbers on Netflix/KDD data or similar datasets?

We provide accuracy data from the MovieLens data sets and Yahoo! Music data sets in our RecSys 2011 paper (http://grouplens.org/node/479). Efficiency numbers are a bit rough, since we've been running parallel builds, but we can train an item-item model on the MovieLens 10M data set in about 15 minutes, and on one shard of the Y! Music data set (from WebScope, each shard has ~80M ratings) in about 20-26 hours. FunkSVD takes 30-50 minutes, depending on the model size and parameters, on ML10M and 14 hours on Y!M.

- Regarding the potential collaboration with GraphLab. What may be the benefits of this collaboration?

We are looking to integrate GraphLab to allow LensKit users to have easy access to high-quality implementations of a wider variety of matrix factorization methods in particular, and to leverage your existing work rather than re-building everything ourselves. It will also make it easy to integrate GraphLab-based algorithms with interesting recommender environments, such as web applications or comparative evaluation setups.

Overall, here at the GraphLab project we are very excited about this collaboration. We believe that LensKit has a few properties that are currently missing in GraphLab: namely output processing for finding the best recommendation  as well as improved user interaction and better UI.

Thursday, December 1, 2011

Linear stable models

I just heard from Alex Ihler, UC Irvine, that he participated in early Novemeber in Workshop called "Counting, Inference, and Optimization on Graphs" in Princeton University, NJ. Yongi Mao from Otawa University Gave a nice talk title "Normal Factor Graphs, Linear Algebra and Probablistic Modeling".

In my NIPS 2010 paper, I (and my advisor Carlos Guestrin) where the first to show how to compute inference in linear models involving heavy tailed stable distributions. This was the first closed form solution to this problem. The trick was to use duality and compute inference in the Fourier (characteristic function) domain. My work is heavily based on Mao's work, since he was the first to formulate this duality in a general linear model. Here are his related papers:
Y. Mao and F. R. Kschischang. On factor graphs and the Fourier transform. In IEEE Trans. Inform. Theory, volume 51, pages 1635–1649, August 2005.
Y. Mao, F. R. Kschischang, and B. J. Frey. Convolutional factor graphs as probabilistic models. In UAI ’04: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 374–381, Arlington, Virginia, United States, 2004. AUAI Press.

It seems that Mao was quite frustrated that his work seemed merely theoretical at that time, while it was hard for him to find an application to his construction. So he was delighted to see that I have actually used his construction towards a very useful application. That is why he titles my work "a missed opportunity" because he is a bit upset that he did not think about it at the time... But anyway this is science - a small progress at each step..

Power iteration clustering

I attended today a very interesting talk give by Prof. William Cohen from CMU (Given at HUJI). The talk describes a very simple spectral clustering method called power iteration clustering. A paper at the same name by Frank Lio and William Cohen appeared in last year's ICML 2010.

What I like about the method, is its simplicity and application to really large datasets. While some of the HUJI audience regarded the presentation quite skeptically since no theoretical results where presented to justify the great empirical performance, it seems that empirically the power iteration clustering method works just like spectral clustering. I guess that future theoretical results will soon to follow. And let me tell you a secret: some of the most successful methods deployed today in practice are simple methods like linear and (sparse) logistic regression, matrix factorization, K-means clustering etc.

In traditional spectral clustering, a square matrix describing the problem is constructed. (Typically the matrix is normalized s.t. the sum of each row equals to one). Then the K top eigenvectors  are computed. For example, if this is a social network then this matrix is the adjacency matrix. When the data is big, we can view this process of eigen decomposition as a dimensionality reduction step from sparse data in high dimension to a lower representation which is typically dense. The second step in spectral clustering is application of some clustering method like K-means to the eigenvectors. It was shown that typically data points which are close to each other has similar values in their eigenvectors.
That is why it makes sense to cluster those points together. 

Note: I am cheating a bit by not providing all the details. A great tutorial for spectral clustering is found here: http://arxiv.org/abs/0711.0189.

In contrast, the proposed power iteration clustering (PIC) computes power iteration using the data matrix. Since the sum of each row is equal to one, it can be thought of a weighted average calculation, where each node computes the average of its neighbors. Of course this process will lead to the same global average in all graph nodes. The main observation in the PIC algorithm, is that the process should be stopped early before every node in the graph has an equivalent sum.  When stopped at the right moment, each node store a cluster distinctive value. Intuitively, this is because nodes which are highly connected in a cluster push their intermediate results to be close to each other. Then the second step of K-means clustering is computed. This step is easier (relative to traditional spectral clustering), since we cluster just scalars in a single vector and not K dimensional vectors.

In terms of work, instead of computing K largest eigenvetors in the traditional spectral clsutering methods, PIC computes only a single eigenvector (which is actually a linear combination of several eigenvectors). So it is K times faster then traditional spectral clustering. 

Here is the algorithm:





The image above shows a demonstration of the power iteration method (taken from their ICML 2010 paper). (a) Shows a problem that K-means clustering has a difficulty operating on. (b) - (d) evolution of the power iteration method. Y-axis is the output scalar value of the output vector. The x-axis represent the different points. The colors are the output of the K-means clustering of scalars. While (b)-(c) are still not accurate, eventually the values of the vector converge very nicely to three clusters.
In principle,  if we would have continued to step (e) where all nodes have converged, we will get the same value in all nodes that would not have been useful.

Anyway, since the conjugate gradient, Lanczos and SVD methods implemented
in GraphLab (As well as K-meams) it is going to be very easy to implement this algorithm in GraphLab. I am planning to do it soon.