Large Scale Machine Learning and Other Animals: April 2011

Wednesday, April 27, 2011

Yahoo! KDD Cup using Graphlab - Track 2?

I was delighted to hear from Suhrid Balakrishnan from AT&T Labs, that he is using GraphLab pmf for factorizing a linear model for Yahoo! KDD cup - track 2. Initially I focused only on track1, but it seems that Graphlab is potentially useful for track2 as well.

Overall, this month I am aware of 20 installations of Graphlab pmf code on various research groups all over the world. Specifically I got feedback from University of Austin, University of the Aegean, University of Macedonia, Carnegie Mellon University and AT&T Labs.

I got several valuable inputs regarding his experience with GraphLab I wanted to share and ask if anyone had the same experience.

1) It is recommended to download latest version from mercurial repository. See explanation:
in my previous post http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html
I am constantly improving the matrix factorization code and adapting it to the KDD dataset.

2) It is better to install itpp first, since Graphlab auto detects it and it saves much later trouble.

3) Suhrid got an interesting error when saving the factorized matrices U,V. It seems that on his Ubuntu system, factor ordering was somehow reversed.
The following matlab code solved this issue:

Ud=reshape(U(:),size(U,2),size(U,1));
Ud=Ud';

I have opened a google group for discussions and questions concerning KDD usage of GraphLab. Everyone is welcome to join:
* Group name: GraphLab KDD
* Group home page: http://groups.google.com/group/graphlab-kdd
* Group email address graphlab-kdd@googlegroups.com

Sunday, April 24, 2011

Hearst Machine Learning Challenge

One of the main problems of ML researchers is in getting hold of real data.
I got this note from Mehta Krishna, co-founder of Thinkalytics:

Hi,
Check out another datamining competition to keep you busy after the KDD CUP @ www.hearstchallenge.com . $30k in prize money and a lot of fun!
Regards,
Krishna

This is an interesting contest, where the task is to classify user reaction to emails.
Release date for the actual dataset is June first. Submission is Aug. 31.

Saturday, April 16, 2011

Yahoo! KDD CUP using GraphLab - Part 2

Preparing the input

I got from Sanmi Koyejo , a graduate student in University of Austin, Texas, a Python script for converting KDD Yahoo! Cup dataset into GraphLab format. Thanks so much!

It may be preferable to the Matlab script, since it seems that for some users the Matlab script goes out of memory.

I have attached the python script for reading the kdd dataset. The package requires numpy so it can use the file writing method.

Additional information about the conversion procedure is kindly supplied by Yoyo here.

NOTE: For running the resulting files in Graphlab, you will need to have access to a 64 bit machine. (32 bit machine can not load this dataset in its current form).

'''
Created on Apr 16, 2011
Read KDD cup data (Low Memory) to store in format suitable for graphlab pmf 

Requires numpy, uses ndarray.tofile() to write the binary file

Module uses (A LOT) less memory than readKddData.py by reading and writing one user at a time
The tradeoff is that the max user id, item id, days and number of ratings must be known beforehand
This is because pmf expects this input in the first line of the File

Known Issue: Number of test (-f3) items is 624959, although we hard-code 624961.
This restriction comes from a bug(?) in pmf (to be fixed soon)
Ignore this warning if this your only issue

usage: python readKddLM.py --help
python readKddLM.py -i trainIdx.txt -o kddcup -f 1
python readKddLM.py -i validationIdx.txt -o kddcupe -f 2
python readKddLM.py -i testIdx.txt -o kddcupt -f 3
@author: Sanmi Koyejo; sanmi.k@mail.utexas.edu
'''

from optparse import OptionParser
from numpy import array, dtype, amax, maximum, zeros, int_

def readLine(fileHandle, splitter='|'):
    ''' read single line'''
    line = fileHandle.readline()
    if not line: #EOF
        return line # return null and let caller handle it
    return line.rstrip().split(splitter) # split the line and remove newline character

def readChunk(fileHandle, chunkSize, splitter='\t'):
    '''read a pre-specified chunksize'''
    for _ in range(chunkSize):
        line = fileHandle.readline()
        if not line: #EOF
            break
        yield line.rstrip().split(splitter)
        
def readOneUser(fileHandle, testFlag=False, verbose=True):
    ''' reads data for one user and returns rating Matrix'''
    
    while 1:
        line = readLine(fileHandle)
        if not line: break # EOF
        assert(len(line)==2)
        userID = float(line[0])
        nRatings = int(line[1])
        
        rateMat = zeros( (nRatings, 4), dtype=dtype('f4'))
        rateMat[:,0] = userID+1 # user ID
        
        for count, line in enumerate(readChunk(fileHandle, nRatings)):
            # note allow last user to break nratings constraint. All other users should satisfy this
            rateMat[count, 1] = float(line[0])+1 # item ID
            
            if testFlag:
                assert(len(line)==3) # error checking
                rateMat[count, 2] = float(line[1]) # day
                rateMat[count, 3] = 1.0 # rating
            else:
                assert(len(line)==4)
                rateMat[count, 2] = float(line[2]) # day
                rateMat[count, 3] = float(line[1]) # rating
        
        if verbose and nRatings != count+1: 
            '''User had a different number of items than expected 
            will only work for last user, any difference for other users will trigger assert errors'''
            print "Warning: Expected", nRatings, "ratings from user; id:", int(userID), ", read", count+1, "ratings."
            rateMat = rateMat[:count+1,:]
        yield rateMat

def KddDataParser(infile, outfile, size, testFlag, verbose):
    ''' read data for each user and write to binary format'''
    
    # setup storage for max user, item, nratings
    readLen = 0
    readSize = zeros(3, dtype=dtype('i4'))
        
    # open reader and writer file handles
    if verbose: print "opening input file", infile
    readhandle = open(infile, 'rb')
    if verbose: print "opening output file", outfile
    writehandle = open(outfile, 'wb')
    
    # write the size information
    size.tofile(writehandle)
    
    # read for each user
    for count, rateMat in enumerate(readOneUser(readhandle, testFlag, verbose)):
        readSize = maximum(readSize, int_(amax(rateMat[:,:3], axis=0)) ) # max user, max item, max time
        readLen  += rateMat.shape[0]  
        
        rateMat[:,1]+=float(size[0]) # itemID = itemID+maxUser
        rateMat.tofile(writehandle)
        
        if verbose: 
            if count%50000 == 0: print 'read', rateMat.shape[0], 'ratings from user', int(rateMat[0,0])-1
    
    # close reader and writer file handles
    readhandle.close()
    writehandle.close()
    
    if verbose: print "data conversion completed"
    
    return readSize, readLen

def main():
    usage = "usage: %prog [options] arg"
    parser = OptionParser(usage)
    parser.add_option("-q", "--quiet", action="store_false", dest="verbose", default=True)
    parser.add_option("-i", "--infile", dest="infile",
                      help="input file name", default="smallTrain.txt") # fixme
    parser.add_option("-o", "--outfile", dest="outfile",
                      help="output file name", default="kddcupLM")
    parser.add_option("-f", "--filetype", dest="filetype", type='int',
                      help="training=1, validation=2, test=3", default=1)
    parser.add_option("-u", "--nuser", dest="nuser",
                      help="max ID of users, if not set, defaults to expected KDD size")
    parser.add_option("-m", "--nitem", dest="nitem",
                      help="max ID of items, if not set, defaults to expected KDD size")
    parser.add_option("-t", "--ntime", dest="ntime",
                      help="max number of days, if not set, defaults to expected KDD size")
    parser.add_option("-r", "--nrate", dest="nrate",
                      help="number of ratings, if not set, defaults to expected KDD size")
    (options, args) = parser.parse_args()
    
    # setup nUser/nitem/nTime defaults based on train/valid/test
    nuser = 1000990 if options.nuser== None else options.nuser 
    nitem = 624961 if options.nitem== None else options.nitem
    '''TODO: once pmf is modified, change definition of nitem
    nitem (train, valid)== 624961
    nitem(test)== 624959 
    Should not affect results'''

    if options.filetype==1:  
        ntime = 6645 if options.ntime== None else options.ntime
        nrate = 252800275 if options.nrate== None else options.nrate
        istest= False
    elif options.filetype==2:
        ntime = 6645 if options.ntime== None else options.ntime
        nrate = 4003960 if options.nrate== None else options.nrate
        istest= False
    elif options.filetype==3:
        ntime = 6649 if options.ntime== None else options.ntime
        nrate = 6005940 if options.nrate== None else options.nrate
        istest   = True
    else:
        errorStr = "--filetype input: "+`options.filetype`+". Allowed values are 1, 2, 3"
        raise LookupError(errorStr)
    
    size = array([nuser, nitem, ntime, nrate], dtype=dtype('i4'))
    
    [nUser, nItem, nDays], nRate = KddDataParser(options.infile, options.outfile, size, istest, options.verbose)
    
    print 'input nuser:', nuser, ', max ID of user read:', nUser
    print 'input nitem:', nitem, ', max ID of item read:', nItem
    print 'input ndays:', ntime, ', max day read', nDays
    print 'input nrate:', nrate, ', Number of ratings read:', nRate
    
    if (nuser!=nUser) or (nitem!=nItem) or (ntime!=nDays) or (nrate!=nRate):
        print "Warning: input parameters differ from output parameters,",
        print "graphlab pmf may not run correctly !!!"
    
if __name__ == '__main__':
    main()

Sanity check 0: The downloaded file size from Yahoo! should be:
-rw-r--r-- 1 bickson users 134407201 2011-01-24 07:46 testIdx1.txt
-rw-r--r-- 1 bickson users 5967164350 2011-01-24 10:23 trainIdx1.txt
-rw-r--r-- 1 bickson users 104193447 2011-01-24 10:25 validationIdx1.txt

Sanity check 1: The output file size should be:
$ ls -l kddcup*
-rw-r–r– 1 sil sil 4044804416 2011-06-27 18:18 kddcup
-rw-r–r– 1 sil sil 64063376 2011-06-28 12:51 kddcupe
-rw-r–r– 1 sil sil 96095056 2011-06-28 16:22 kddcupt
(Thanks Yoyo!)

Sanity check 2: you can use the md5sum command to verify creation of inputs.
You should get the following numbers:

<34|0>bickson@bigbro6:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ md5sum kddcupe
aa76bb1d0e6e897e270ed65d021ed1d8  kddcupe
<35|0>bickson@bigbro6:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ md5sum kddcupt
917599ce7f715890a2705dc04851ac12  kddcupt
<36|0>bickson@bigbro6:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ md5sum kddcup
345b168a208757b3098c6674b2fb653a  kddcup

If you got different output, please check carefully that the command line arguments used are as instructed.

Sanity check 3: When running the third script, you should see the output:
> data conversion completed
> input nuser: 1000990 , max ID of user read: 1000990
> input nitem: 624961 , max ID of item read: 624959
> input ndays: 6649 , max day read 6649
> input nrate: 6005940 , Number of ratings read: 6005940
> Warning: input parameters differ from output parameters, graphlab pmf may
> not run correctly !!!

Thursday, April 14, 2011

TeraGrid Symposium

A quick update from Pittsburgh Supercomputing TeraGrid symposium held Today at the Holiday Inn Hotel, Pittsburgh.

At his opening lecture, Prof. Randy Bryant, Dean of Carnegie Mellon Computer Science Department gave a very nice overview of current state of big data processing: http://www.psc.edu/data-analytics/proceedings/BryantSlides.pdf

Furthermore, our system GraphLab, is mentioned as one of the viable solutions when Map-Reduce fails to scale. No need to mention that Prof. Bryant is not related anyhow to our team and this is his unbiased opinion. We did not pay him $$ to talk about GraphLab .. :-)

And on another matter: to all the geeks in the Bay Area, I am coming lecture about GraphLab at geekSessions 2.1 at SF, May 3rd. Come to say Hi!

Friday, April 8, 2011

GraphLab on BlackLight!!

I am super excited to report that GraphLab is up and running on BlackLight, the largest
shared memory computer in the world! With 32TB shared memory and 4,096 cores.

A tutorial for BlackLight is found here

I will soon post some performance results for matrix factorization algorithms, as we make more progress in testing.

Below you can find some instructions on how to install Graphlab on BlackLight, to those of you who are lucky enough to get an account.. :-)

GraphLab Installation

1) Login using ssh into tg-login1.blacklight.psc.teragrid.org

2) Follow the instructions on http://graphlab.org/download.html to obtain
GraphLab code/

3)

module load cmake boost kyotocabinet IT++
cd graphlabapi
./configure --bootstrap --itpp_include_dir=${ITPP_INC} --itpp_static_link_dir=${ITPP_LIB} -D MKL_PATH=${MKL_PATH}
cd release
make -j 8

Note: Thanks to Joel Welling from Pittsburgh Supercomputing Center, who significantly helped simplifying installation as well as improving performance.

Example GraphLab PMF job
Create a file named kddcup.job with the following content:

#!/bin/csh
#PBS -l ncpus=16
#ncpus must be a multiple of 16
#PBS -l walltime=4:00:00                  
#PBS -j oe
#PBS -q batch
#PBS -m bea
set echo

ja

#move to my $SCRATCH directory
cd $SCRATCH

#copy executable to $SCRATCH
cp $HOME/graphlabapi/release/demoapps/pmf/pmf .

#run my executable
omplace -nt $PBS_NCPUS ./pmf kddcup 0 --scheduler="round_robin(max_iterations=20)" --float=true --zero=true --lambda=1 --D=150 --ncpus=$PBS_NCPUS --aggregatevalidation=true
cp $SCRATCH/kddcupt.kdd.out $HOME/$PBS_JOBID.kdd.out

ja -chlst

Submit this job using the command

qsub kddcup.job

Check the status of the job using the command

qstat

Check remaining qouta:

bickson@tg-login1:~> xbanner


PSC Grantnumber: DMS110004P Teragrid Grantnumber: DMS110015
P.I. Name: Carlos Guestrin
 Resource        = BLACKLIGHT
 Charge ID       = ms3bdkp
 Start Date      = 02/10/2011
 Expiration Date = 02/10/2012
 Allocation      = 50000.00
 Remaining       = 48541.23
 Last Job        = 06/16/2011

Last Accounting Update: 06/16/2011

Tuesday, April 5, 2011

Yahoo! KDD Cup using Graphlab

I got the following question from an avid reader of this blog:

I was wondering whether you could give me some directions on how to
setup GraphLab to run on KDDCUP, especially with regards to the data format of the input files for training, validation and testing (i.e. creation of predictions for submission).

Many thanks,
Nicholas

==========================================

Nicholas Ampazis
Assistant Professor
Director, Intelligent Data Exploration and Analysis Laboratory (IDEAL)
Department of Financial and Management Engineering,
University of the Aegean
41 Koudouriotou street, Chios, 82100, Greece

I think this may interest some other people so I am posting the answer here. Currently GraphLab was tested with matrix factorization, but soon I will handle also tensor factorization (divides ranking of different times to groups) and also Monte Carlo Sampling on top of it. So there is a wide range of algorithms you can actually try out after you install GraphLab.

Installation

The best way to start is to download the code from mercurial repository - there is the latest version of the matrix factorization code. Source is found here: http://graphlab.org/download.html

Note that for the matrix factorization you will also need to install itpp (which relies on BLAS/LaPaCK).
Installation instructions for Linux 32 bit are here:http://bickson.blogspot.com/2011/06/graphlab-pmf-on-32-bit-linux.html and for Linux 64 bit are here: http://bickson.blogspot.com/2011/02/installing-blaslapackitpp-on-amaon-ec2.html

It is always better to install itpp first, before executing the ./configure script, that way GraphLab will automatically detect itpp and installation becomes simpler.

After downloading GraphLab you should configure using;

./configure --bootstrap

This should install cmake and boost if they are missing on your system.

Once you compile successfully, it means that the application code of the matrix factorization code is compiled as well, it will be found in the directory demoapps/pmf

Setting up the input files - method 1 - using Matlab
1) Download the file save_c_gl4a.m to your local directory using:
wget http://www.graphlab.ml.cmu.edu/save_c_gl4a.m
2) Download the KDD Yahoo! Cup files (track1 dataset) from:
http://kddcup.yahoo.com/datasets.php
3) Use the following Matlab script to convert the text dataset in binary graphlab format:
(It may take a couple of hours to finish depends on your machine..)
Note that you need to run the script 3 times - for runmode=1 (training data)
runmode= 2 (validation data), runmode=3 (test data).

%Script for converting KDD CUP 2011 data, written by Danny Bickson, CMU
%Can be round in matlab or octave
nUsers=1000990;
nItems=624961;
nRatings=262810175;
nTrainRatings=252800275;
nProbeRatings=4003960;
nTestRatings=6005940;


runmode=3;

filname='';
outfile='';
ratings=0;
switch runmode
    case 1
        disp('converting kdd cup 2011 training data - track 1');
        filename='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/trainIdx1.txt';
        ratings=nTrainRatings;
        outfile='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/kddcup';
    case 2
        disp('converting kdd cup 2011 validation data - track 1');
        filename='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/validationIdx1.txt';
        ratings=nProbeRatings;
        outfile='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/kddcupe';
    case 3
        disp('converting kdd cup 2011 test data - track 1');
        filename='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/testIdx1.txt';
        ratings=nTestRatings;
        outfile='/mnt/bigbrofs/usr7/bickson/kddcup/track1/track1/kddcupt';
end

ff=fopen(filename,'r');
if (ff < 0)
 error('failed to open input file for reading');
end
fout = fopen(outfile,'w');
if (fout < 0)
 error('failed to open file for writing');
end
%write output file matrix market format header
fprintf(fout, '%%%%MatrixMarket matrix coordinate real general\n');
fprintf(fout,'%d %d %d\n', nUsers, nItems, ratings);
cnt=1;
for j=1:nUsers
    [a,num]=fscanf(ff,'%d|%d',2);
    assert(num==2);

    user=a(1);
    if (mod(j,1000)==0)
        disp(['user: ', num2str(user),' ratings: ', num2str(a(2))]);
    end
    if (runmode==3)
        assert(a(2)==6);
    end


    for i=1:a(2)
        b=-100;
        if (runmode<=2)
            [b,num]=fscanf(ff,'%d %d %d %d:%d:%d',6);
            assert(num==6);
        else
            [b,num]=fscanf(ff,'%d %d %d:%d:%d',5);
            assert(num==5);
        end

        if (runmode<=2)
            fprintf(fout, '%d %d %d %d\n', user+1, b(1)+1, b(2), b(3));
        else
            fprintf(fout, '%d %d %d %d\n', user+1, b(1)+1, 1, b(2));
        end
        cnt=cnt+1;
    end
end

assert(cnt==ratings+1);

fclose(fout);

Setting up the input files - method 2 - Python
http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab-part-2.html

Running GraphLab

1. cd into graphlabapi/release/demoapps/pmf or graphlabapi/debug/demoapss/pmf
(depends if you want to debug or not).

2. Link the generate files from preparing input file, named kddcup (training), kddcupe (validation) and kddcupt(test) into your working directory using

ln -s /path/to/track1/kddcup* .

3. Run GraphLab example:

<73|0>bickson@bigbro6:~/newgraphlab/graphlabapi/release/demoapps/pmf$ ./pmf kddcup 0 --ncpus=8 --float=true --zero=true --lambda=1 --D=20 --scheduler="round_robin(max_iterations=15)"
Setting run mode ALS_MATRIX
INFO   :pmf.cpp(main:1233): ALS_MATRIX starting

loading data file kddcup
Loading kddcup TRAINING
Matrix size is: 1000990 624961 1
Creating 252800275 edges...
...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
.....loading data file kddcupe
Loading kddcupe VALIDATION
Matrix size is: 1000990 624961 1
Creating 4003960 edges...
.....................loading data file kddcupt
Loading kddcupt TEST
Matrix size is: 1000990 624961 6649
Creating 6005940 edges...
...............................setting regularization weight to 1
PTF_ALS for matrix (1000990, 624961, 6649):252800275.  D=20
pU=1, pV=1, pT=1, muT=1, D=20
nuAlpha=1, Walpha=1, mu=0, muT=1, nu=20, beta=1, W=1, WT=1 BURN_IN=10
complete. Obj=4.85456e+11, TRAIN RMSE=61.9728 TEST RMSE=75.7258.
 Entering last iter with 4
442.833) Iter ALS 4  Obj=5.91422e+10, TRAIN RMSE=21.6079 TEST RMSE=22.7737.
Entering last iter with 5
546.91) Iter ALS 5  Obj=5.71611e+10, TRAIN RMSE=21.2429 TEST RMSE=22.6440.
Entering last iter with 6
652.415) Iter ALS 6  Obj=5.63004e+10, TRAIN RMSE=21.0826 TEST RMSE=22.5745.
Entering last iter with 7
758.478) Iter ALS 7  Obj=5.58187e+10, TRAIN RMSE=20.9926 TEST RMSE=22.5299.
...

4. Explanation of basic runtime flags

    kddcup // input file name. Program optionally search also for optional inputs like validation and test data.
                  The convention is that validation data has the same file name ending with e (kddcupe) and test                   data ending with t (kddcupt).
    0 // the run mode. 0 stands for alternating least squares.
    --ncpus=XX // number of CPU used (should be equal to the number of cores you have)
    --lambda=0.1 //regularization weight for alternating least squares (this prameter should be fine tuned based on the problem
    --float=true //mandatory flag, indicating dataset is written in float format (yes, there is also an option for saving the dataset in double format if increased accuracy is desired)
    --zero=true //for KDDcup, this is mandatory, since some of the matrix/tensor values are zero. Without it the program will assert when there is zero matrix value. (In Netflix dataset there are no zero values).
    --scheduler="round_robin(max_iterations=XX)" the number of iterations to run
  --D=XX  // the width of the factorized matrix. As D is larger we get a better
approximation but slower running time.

5. More fancy runtime flags:

Other runmodes:
1 //Bayesian matrix factorization
2 //Bayesian tensor factorization
3 //Bayesian tensor factorization, supports for multiple ratings in different times.
4 //Alternating tensor factorization
--loadfactors=true // start initial guess from factors saved in previous run. Factor file name will be kddcup20.out where D=20 etc.
--scaling=100 //group the 6500 time units into groups of 100 (for tensor)
--truncating=2261 // remove unused time slots of ratings (for tensor)

Reading the Output
When GraphLab detects a file name of kddcup, the output will be written to the file kddcupt.kdd.out
in the same working directory. This file name has the right format to be submitted into the contest website.

Additional output file of the name kddcupXX.out will be generated, where X is the width D of the approximating matrix. Instruction on how to read output files in Matlab are found on http://www.graphlab.ml.cmu.edu/pmf.html

Large Scale Machine Learning and Other Animals