Tuesday, April 5, 2011

Yahoo! KDD Cup using Graphlab

I got the following question from an avid reader of this blog:

I was wondering whether you could give me some directions on how to
setup GraphLab to run on KDDCUP, especially with regards to the data format of the input files for training, validation and testing (i.e. creation of predictions for submission).

Many thanks,

Nicholas

==========================================

Nicholas Ampazis
Assistant Professor
Director, Intelligent Data Exploration and Analysis Laboratory (IDEAL)
Department of Financial and Management Engineering,
University of the Aegean
41 Koudouriotou street, Chios, 82100, Greece
I think this may interest some other people so I am posting the answer here. Currently GraphLab was tested with matrix factorization, but soon I will handle also tensor factorization (divides ranking of different times to groups) and also Monte Carlo Sampling on top of it. So there is a wide range of algorithms you can actually try out after you install GraphLab.





Installation

The best way to start is to download the code from mercurial repository - there is the latest version of the matrix factorization code. Source is found here: http://graphlab.org/download.html

Note that for the matrix factorization you will also need to install itpp (which relies on BLAS/LaPaCK).
Installation instructions for Linux 32 bit are here:http://bickson.blogspot.com/2011/06/graphlab-pmf-on-32-bit-linux.html and for Linux 64 bit are here: http://bickson.blogspot.com/2011/02/installing-blaslapackitpp-on-amaon-ec2.html


It is always better to install itpp first, before executing the ./configure script, that way GraphLab will automatically detect itpp and installation becomes simpler.


After downloading GraphLab you should configure using;
./configure --bootstrap
This should install cmake and boost if they are missing on your system.



Once you compile successfully, it means that the application code of the matrix factorization code is compiled as well, it will be found in the directory demoapps/pmf

Setting up the input files - method 1 - using Matlab
1) Download the file save_c_gl4a.m to your local directory using:
wget http://www.graphlab.ml.cmu.edu/save_c_gl4a.m
2) Download the KDD Yahoo! Cup files (track1 dataset) from: 
http://kddcup.yahoo.com/datasets.php
3) Use the following Matlab script to convert the text dataset in binary graphlab format:
(It may take a couple of hours to finish depends on your machine..)
Note that you need to run the script 3 times - for runmode=1 (training data)
runmode= 2 (validation data), runmode=3 (test data).

%Script for converting KDD CUP 2011 data, written by Danny Bickson, CMU
%Can be round in matlab or octave
nUsers=1000990;
nItems=624961;
nRatings=262810175;
nTrainRatings=252800275;
nProbeRatings=4003960;
nTestRatings=6005940;


runmode=3;

filname='';
outfile='';
ratings=0;
switch runmode
    case 1
        disp('converting kdd cup 2011 training data - track 1');
        filename='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/trainIdx1.txt';
        ratings=nTrainRatings;
        outfile='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/kddcup';
    case 2
        disp('converting kdd cup 2011 validation data - track 1');
        filename='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/validationIdx1.txt';
        ratings=nProbeRatings;
        outfile='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/kddcupe';
    case 3
        disp('converting kdd cup 2011 test data - track 1');
        filename='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/testIdx1.txt';
        ratings=nTestRatings;
        outfile='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/kddcupt';
end

ff=fopen(filename,'r');
if (ff < 0)
 error('failed to open input file for reading');
end
fout = fopen(outfile,'w');
if (fout < 0)
 error('failed to open file for writing');
end
%write output file matrix market format header
fprintf(fout, '%%%%MatrixMarket matrix coordinate real general\n');
fprintf(fout,'%d %d %d\n', nUsers, nItems, ratings);
cnt=1;
for j=1:nUsers
    [a,num]=fscanf(ff,'%d|%d',2);
    assert(num==2);

    user=a(1);
    if (mod(j,1000)==0)
        disp(['user: ', num2str(user),' ratings: ', num2str(a(2))]);
    end
    if (runmode==3)
        assert(a(2)==6);
    end


    for i=1:a(2)
        b=-100;
        if (runmode<=2)
            [b,num]=fscanf(ff,'%d %d %d %d:%d:%d',6);
            assert(num==6);
        else
            [b,num]=fscanf(ff,'%d %d %d:%d:%d',5);
            assert(num==5);
        end

        if (runmode<=2)
            fprintf(fout, '%d %d %d %d\n', user+1, b(1)+1, b(2), b(3));
        else
            fprintf(fout, '%d %d %d %d\n', user+1, b(1)+1, 1, b(2));
        end
        cnt=cnt+1;
    end
end

assert(cnt==ratings+1);

fclose(fout);

Setting up the input files - method 2 - Python
http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab-part-2.html

Running GraphLab

1. cd into graphlabapi/release/demoapps/pmf or graphlabapi/debug/demoapss/pmf 

(depends if you want to debug or not).

2. Link the generate files from preparing input file, named kddcup (training), kddcupe (validation) and kddcupt(test) into your working directory using
ln -s /path/to/track1/kddcup* .
3. Run GraphLab example:
<73|0>bickson@bigbro6:~/newgraphlab/graphlabapi/release/demoapps/pmf$ ./pmf kddcup 0 --ncpus=8 --float=true --zero=true --lambda=1 --D=20 --scheduler="round_robin(max_iterations=15)"
Setting run mode ALS_MATRIX
INFO   :pmf.cpp(main:1233): ALS_MATRIX starting

loading data file kddcup
Loading kddcup TRAINING
Matrix size is: 1000990 624961 1
Creating 252800275 edges...
...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
.....loading data file kddcupe
Loading kddcupe VALIDATION
Matrix size is: 1000990 624961 1
Creating 4003960 edges...
.....................loading data file kddcupt
Loading kddcupt TEST
Matrix size is: 1000990 624961 6649
Creating 6005940 edges...
...............................setting regularization weight to 1
PTF_ALS for matrix (1000990, 624961, 6649):252800275.  D=20
pU=1, pV=1, pT=1, muT=1, D=20
nuAlpha=1, Walpha=1, mu=0, muT=1, nu=20, beta=1, W=1, WT=1 BURN_IN=10
complete. Obj=4.85456e+11, TRAIN RMSE=61.9728 TEST RMSE=75.7258.
 Entering last iter with 4
442.833) Iter ALS 4  Obj=5.91422e+10, TRAIN RMSE=21.6079 TEST RMSE=22.7737.
Entering last iter with 5
546.91) Iter ALS 5  Obj=5.71611e+10, TRAIN RMSE=21.2429 TEST RMSE=22.6440.
Entering last iter with 6
652.415) Iter ALS 6  Obj=5.63004e+10, TRAIN RMSE=21.0826 TEST RMSE=22.5745.
Entering last iter with 7
758.478) Iter ALS 7  Obj=5.58187e+10, TRAIN RMSE=20.9926 TEST RMSE=22.5299.
...

4. Explanation of basic runtime flags
    kddcup // input file name. Program optionally search also for optional inputs like validation and test data.
                  The convention is that validation data has the same file name ending with e (kddcupe) and test                   data ending with t (kddcupt).
    0 // the run mode. 0 stands for alternating least squares.
    --ncpus=XX // number of CPU used (should be equal to the number of cores you have)
    --lambda=0.1 //regularization weight for alternating least squares (this prameter should be fine tuned based on the problem
    --float=true //mandatory flag, indicating dataset is written in float format (yes, there is also an option for saving the dataset in double format if increased accuracy is desired)
    --zero=true //for KDDcup, this is mandatory, since some of the matrix/tensor values are zero. Without it the program will assert when there is zero matrix value. (In Netflix dataset there are no zero values).
    --scheduler="round_robin(max_iterations=XX)" the number of iterations to run
  --D=XX  // the width of the factorized matrix. As D is larger we get a better
approximation but slower running time.

5. More fancy runtime flags:
Other runmodes:
1 //Bayesian matrix factorization
2 //Bayesian tensor factorization
3 //Bayesian tensor factorization, supports for multiple ratings in different times.
4 //Alternating tensor factorization
--loadfactors=true // start initial guess from factors saved in previous run. Factor file name will be kddcup20.out where D=20 etc.
--scaling=100 //group the 6500 time units into groups of 100 (for tensor)
--truncating=2261 // remove unused time slots of ratings (for tensor)

Reading the Output
 When GraphLab detects a file name of kddcup, the output will be written to the file kddcupt.kdd.out
in the same working directory. This file name has the right format to be submitted into the contest website.

Additional output file of the name kddcupXX.out will be generated, where X is the width D of the approximating matrix. Instruction on how to read output files in Matlab are found on http://www.graphlab.ml.cmu.edu/pmf.html

3 comments:

  1. I got the following message and aborted:

    INFO: pmf.cpp(main:573): PMF/BPTF/ALS/SVD++/SGD/SVD Code written By Danny Bickson, CMU
    Send bug reports and comments to danny.bickson@gmail.com
    WARNING: pmf.cpp(main:575): Code compiled with GL_NO_MULT_EDGES flag - this mode does not support multiple edges between user and movie in different times
    WARNING: pmf.cpp(main:578): Code compiled with GL_NO_MCMC flag - this mode does not support MCMC methods.
    Setting run mode ALS_MATRIX (Alternating least squares)
    INFO: pmf.cpp(start:374): ALS_MATRIX (Alternating least squares) starting

    loading data file kddcup
    Loading kddcup TRAINING
    Matrix size is: USERS 1000990 MOVIES 624961 TIME BINS 6649
    Creating 252800275 edges (observed ratings)...
    ..pmf: /home/wash/wash/graphlabapi2/demoapps/pmf/io.hpp:533: int read_mult_edges(FILE*, int, testtype, graph_type*, bool) [with edgedata = edge_float]: Assertion `total == (int)e' failed.
    Aborted

    ReplyDelete
  2. Hi Pang,
    Your error indicates, that there are missing edges in the training data file. (The total number of edges seen on file was less than the reported 252800275 edges). Please verify you have followed the instructions

    on: http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab-part-2.html

    Specifically, verify that the md5sum of the resulting binary input file is:

    1. <34|0>bickson@bigbro6:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ md5sum kddcupe
    2. aa76bb1d0e6e897e270ed65d021ed1d8 kddcupe
    3. <35|0>bickson@bigbro6:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ md5sum kddcupt
    4. 917599ce7f715890a2705dc04851ac12 kddcupt
    5. <36|0>bickson@bigbro6:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ md5sum kddcup
    6. 345b168a208757b3098c6674b2fb653a kddcup

    ReplyDelete
  3. hello sir....i got error on running command /graphlabapi$ ./configure --bootstrap


    Could NOT find MPI (missing: MPI_LIBRARY MPI_INCLUDE_PATH)
    -- MPI Not Found! Distributed Executables will not be compiled
    CMake Error at CMakeLists.txt:197 (message):
    Kyoto Cabinet includes not found. Run bootstrap!


    -- Configuring incomplete, errors occurred!

    ReplyDelete