Large Scale Machine Learning and Other Animals: KDD Cup

Showing posts with label KDD Cup. Show all posts

Monday, April 23, 2012

KDD CUP Update

As the contest is heating up I am getting more and more interesting feedback. As you can see in the above image, thanks to JustinYan, we are now at place 5 in track 2.

I have significantly extended the parser library in GraphLab version 2 to create useful tools for anyone who wants to compete. Here are some of the currently available tools:

kdd_train_data_parser - clean the training data from opposite signals, and split it into meaningful validation and training sets.
kdd_test_splitter - split the test data into two test files, as instructed by the contest
kddcup_output_builder - take two vector prediction files (output of graphlab or any other software like Vowpal Wabbit), find the highest prediction files, and merge them back into a zip submission file containing the solution.
kdd_usr_itm_feature_parsers - parse user and item features and translate them to GraphLab or VW input formats.
kdd_linear_model_builder - build a linear feature model for the training/validation that can be used for classification.

Since we don't have much time to perfect everything before the code release, I would love to get any help/feedback/suggestions from you!!

Some more detailed instructions:

GraphLab v2 setup
1) install graphlab v2, USING MERCURIAL option as instructed here: http://graphlab.org/download.html
2) After checking out, you should do:
hg pull
hg update v2
./configure
cd release
make

Example runs:
More details are found here.

Let me know how it went!

Friday, April 20, 2012

KDD CUP 2012 Track 1 using GraphLab..

As some of you may recall, last year we had some great excitement with KDD CUP 2011, where we won the 5th place in track1. Additionally, the group who won the contest was using GraphLab as well. This year I really wanted to take a look again at this interesting and high visibility contest, but I did not find the time for it up to now. I got some interesting questions from students at Georgia Tech ML course taught by Le Song, were the course task is to compete in the KDD CUP 2012.

In this blog post I will share some of my approaches of dealing with KDD CUP 2012 track 1. I must warn my readers that it is very preliminary results, but I will be glad to get any inputs along the way.
Acknowledgement: my close collaborator, JustinYan from the Chinese National Academy of Science
is to credit most of the below results. By the way he is looking for summer internship in the US. Contact him if you are looking for intern. He is based in China - he will need a US intern visa.

First try - matrix factorization using alternating least squares.

Goal: build a linear model for the user/item matrix and predict the missing values (the test data).

Preparing the training data:

Some explanations and motivation. Track1 does not have a separate validation dataset, as was for example in Netflix and last year KDD CUP. It is very useful to have a validation dataset, or else one may easily overfit over the training dataset. Following JustinYan advice, I have separated the data based on its time properties. According to the unix timestamps of the data,
ratings are distributed from day 283 to day 314. (int date is in [131834878,1321027199] , 131834878 is 2011.10.12 00:00:00 (unix time +8) and 1321027199 is 2011.11.11(unix time +8). ). We take the last 5 days (310-314) as the validation and the first days (283-309) as the training set.

And here is what I did to convert the data for Graphlab v1 matrix factorization format:
1) download and unzip the file track1.zip.
2) I am using graphlab matrix factorization version which assumes there is a single rating between each user and data item. In rec_log_train.txt there are many items which are ranked several times. (Later we can add support to multiple ratings of the same item). Currently I am also avoiding the temporal aspect
we can later add that. Filter out duplicate ratings using the command:
sort -n -k 1,1 -k 2,2 rec_log_train.txt > rec_log_train.sorted.txt
3) Use graphlab version 2 kdd train parser to filter out duplicate ratings:
<468|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers$ ./kdd_train_data_parser rec_log_train.sorted.txt
WARNING: kdd_train_data_parser.cpp(main:224): Eigen detected. (This is actually good news!)
INFO: kdd_train_data_parser.cpp(main:226): GraphLab parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Currently implemented parsers are: Call data records, document tokens
INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO: io.hpp(gzip_in_file:800): Opening input file: rec_log_train.sorted.txt
INFO: io.hpp(gzip_out_file:831): Opening output file rec_log_train.sorted.txt.out
INFO: kdd_train_data_parser.cpp(operator():181): Finished parsing total of 73209277 lines in file rec_log_train.sorted.txt
INFO: kdd_train_data_parser.cpp(operator():197): Finished exporting a total of 39752418 ratings
Finished in 103.088

Explanation: whenever there are multiple instances of the same user-item pair, with the same ratings, they are pruned and only one is left. Whenever there are two or more instances of the same user-item pair with conflicting ratings, namely both -1 and 1, they are ignored and removed from the training data.

Preparing the test data:

The test data contains the following format:
[user ] [ item ] [ empty rating ] [ time ]
It has unique 34910937 user/item pairs. The test data is explained here:

The rec_log_test.txt file contains data for both the public leaderboard set and final evaluation set. The file is sorted temporally. Any data with a timestamp < 1321891200 is used for the public leaderboard set, and >= 1321891200 is used for the final evaluation set. The goal is to predict which items that were recommended to the user were followed by the user, within each set.

In other words, the test data can be divided into two. In each set we should recommend the highest ranked items for this set. testA has 19,349,608 unique users and testB has 15,561,329 unique users.

Here are some more detailed instructions:
5) Download the test data file: rec_log_test.txt.zip
6) Unzip the file using
# unzip rec_log_test.txt.zip
Use GraphLab v2 kdd_test_splitter to prepare the test:
./kdd_test_splitter rec_log_test.txt --ncpus=1
WARNING: kdd_test_splitter.cpp(main:159): Eigen detected. (This is actually good news!)
INFO: kdd_test_splitter.cpp(main:161): GraphLab Parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Schedule all vertices
INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO: io.hpp(load_matrixmarket_graph:326): Reading matrix market file: rec_log_test.txt
Rows: 2421057
Cols: 2421057
Nonzeros: 34910937
Constructing all vertices.
Adding edges.
Graph size: 34910937
Start finalize..
INFO: kdd_test_splitter.cpp(operator():132): Finished going over 19349608 rating for 1196411 unique users
INFO: kdd_test_splitter.cpp(operator():133): Successfully created output file: testA.sorted
INFO: io.hpp(load_matrixmarket_graph:326): Reading matrix market file: rec_log_test.txt
Rows: 2421057
Cols: 2421057
Nonzeros: 34910937
Constructing all vertices.
Adding edges.
Graph size: 34910937
Start finalize..
INFO: kdd_test_splitter.cpp(operator():132): Finished going over 15561329 rating for 1196411 unique users
INFO: kdd_test_splitter.cpp(operator():133): Successfully created output file: testB.sorted
Finished in 164.793
INFO: kdd_test_splitter.cpp(main:184): Finished exporting test data!

The output of this procedure are split test files, testA.sorted holding earlier ratings and testB.sorted holding later ratings. Ratings are sorted first by the user and then by the item numbers.

Understanding the cost function

Track one using AP@3 (average precision for 3 items) as its error measure. Here is pseudo code I got from JustinYan which computes it:

private static double ProcessList(List<Sample> labelList1)
{
//labelList is an ordered list of the top 3 computed predictions
List<Sample> labelList = labelList1.OrderByDescending(x => x.Prediction).Take(3).ToList();
//realList is a list of ratings that where equal to one
List<Sample> realList = labelList1.Where(x => x.RealValue > 0).ToList();
//realClickCount is the number of items a user clicked on
double realClickCount =realList.Count();
double ap = 0;
int count = 0;
//if user clicked on nothing, we get AP of zero.
if (realClickCount == 0)
return 0;
//for each of the recorded ratings
for (int i = 0; i < labelList.Count; i++)
{
//if one of the top 3 items we recommended, had actually been
//selected increase our precision.
if (labelList[i].RealValue > 0)
{
ap += ++count * 1.0 / (i + 1);
}
}
return ap / (realClickCount);
}

GraphLab now supports AP@3 cost function calculation using the command line switch --calc_ap=true

Running GraphLab SGD

8) Rename the test file to be recognized in graphlab using:

ln -s testA.sorted rec_log_train.sorted.out.txtt

6) Run bias-SGD using the command:

./pmf rec_log_train.sorted.txt.out 15 --D=50 '--scheduler=round_robin(max_iterations=8,block_size=1)' --matrixmarket=true --minval=-1 --maxval=1 --zero=true --matrix_market_tokens_per_row=4 --ncpus=8 --sgd_step_dec=0.98 --sgd_gamma=0.005 --sgd_lambda=0.005 --exporttest=true --calc_ap=true --reduce_mem_consumption=true --exportlinearmodel=false
INFO: pmf.cpp(do_main:441): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING: pmf.cpp(do_main:448): Program compiled with it++ Support
Setting run mode Bias-SGD
INFO: pmf.cpp(start:291): Bias-SGD starting

loading data file rec_log_train.sorted.txt.out
Loading Matrix Market file rec_log_train.sorted.txt.out TRAINING
Loading rec_log_train.sorted.txt.out TRAINING
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 33589556
loading data file rec_log_train.sorted.txt.oute
Loading Matrix Market file rec_log_train.sorted.txt.oute VALIDATION
Loading rec_log_train.sorted.txt.oute VALIDATION
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 6162863
loading data file rec_log_train.sorted.txt.outt
Loading Matrix Market file rec_log_train.sorted.txt.outt TEST
Loading rec_log_train.sorted.txt.outt TEST
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 19349608
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 0 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 1 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 2 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 3 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 5 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 4 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 6 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 7 started.

Bias-SGD for matrix (2421057, 2421057, 1):33589556. D=50
SVD++ 50 factors
complete. Objective=0, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000.
INFO: pmf.cpp(run_graphlab:238): starting with scheduler: round_robin
max iterations = 8
step = 1
max_iterations = 8
INFO: asynchronous_engine.hpp(run:111): Worker 0 started.

INFO: asynchronous_engine.hpp(run:111): Worker 1 started.

INFO: asynchronous_engine.hpp(run:111): Worker 2 started.

INFO: asynchronous_engine.hpp(run:111): Worker 3 started.

INFO: asynchronous_engine.hpp(run:111): Worker 4 started.

INFO: asynchronous_engine.hpp(run:111): Worker 5 started.

INFO: asynchronous_engine.hpp(run:111): Worker 6 started.

INFO: asynchronous_engine.hpp(run:111): Worker 7 started.

Entering last iter with 1
21.4589) Iter SVD 1, TRAIN RMSE=0.5666 VALIDATION RMSE=0.5769.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.328273 AP@3 for validation: 0.292205
Entering last iter with 2
68.1817) Iter SVD 2, TRAIN RMSE=0.5393 VALIDATION RMSE=0.5700.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332562 AP@3 for validation: 0.308924
Entering last iter with 3
114.985) Iter SVD 3, TRAIN RMSE=0.5301 VALIDATION RMSE=0.5680.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332799 AP@3 for validation: 0.312879
Entering last iter with 4
162.032) Iter SVD 4, TRAIN RMSE=0.5249 VALIDATION RMSE=0.5673.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.333644 AP@3 for validation: 0.314124
Entering last iter with 5
219.16) Iter SVD 5, TRAIN RMSE=0.5213 VALIDATION RMSE=0.5671.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.336898 AP@3 for validation: 0.319216
Entering last iter with 6
289.758) Iter SVD 6, TRAIN RMSE=0.5180 VALIDATION RMSE=0.5670.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.340856 AP@3 for validation: 0.321952
Entering last iter with 7
360.691) Iter SVD 7, TRAIN RMSE=0.5146 VALIDATION RMSE=0.5670.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.344672 AP@3 for validation: 0.322379
Entering last iter with 8
431.754) Iter SVD 8, TRAIN RMSE=0.5113 VALIDATION RMSE=0.5671.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.352534 AP@3 for validation: 0.325194
INFO: asynchronous_engine.hpp(run:119): Worker 5 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 2 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 1 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 6 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 7 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 4 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 3 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 0 finished.

Final result. Obj=0, TRAIN RMSE= 0.0000 VALIDATION RMSE= 0.0000.
Finished in 466.943851 seconds
INFO: io.hpp(export_test_file:418): **Completed successfully (mean prediction: -0.740713
INFO: read_matrix_market.hpp(save_matrix_market_vector:250): Saved output vector to file: rec_log_train.sorted.txt.out.test.predictions vector size: 19349608
INFO: read_matrix_market.hpp(save_matrix_market_vector:251): You can read it with Matlab/Octave using the script mmread.m found on http://graphlab.org/mmread.m
Performance counters are: 0) EDGE_TRAVERSAL, 1421.59
Performance counters are: 2) CALC_RMSE_Q, 0.296477
Performance counters are: 6) CALC_OBJ, 1.17573

As you can see, each iteration of bias-SGD takes around 60 seconds using 8 cores machine. So it is enough to have a single good multicore machine. The output is the file: rec_log_train2.txt.test.predictions
Let's look at the output:

head rec_log_train2.txt.test.predictions
%%MatrixMarket matrix array real general
%%output predictions for test data
19349608 1
-1
-0.7164881229401
0.07807982712984
-0.9745061397552
-0.7039368152618
-0.6436884999275
-1
-0.6042827963829

For each user/item pair in the testA.sorted data, we get a prediction.

Now we store the results for testA:

mv rec_log_train2.txt.test.predictions testA.predictions

And again we run graphlab, this time with testB:

./pmf rec_log_train.sorted.txt.out2 15 --D=50 '--scheduler=round_robin(max_iterations=8,block_size=1)' --matrixmarket=true --minval=-1 --maxval=1 --zero=true --matrix_market_tokens_per_row=4 --ncpus=8 --lambda=0.01 --sgd_gamma=0.005 --sgd_lambda=0.005 --sgd_step_dec=0.98 --exporttest=true --calc_ap=true
INFO: pmf.cpp(do_main:441): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING: pmf.cpp(do_main:448): Program compiled with it++ Support
Setting run mode Bias-SGD
INFO: pmf.cpp(start:291): Bias-SGD starting

loading data file rec_log_train.sorted.txt.out2
Loading Matrix Market file rec_log_train.sorted.txt.out2 TRAINING
Loading rec_log_train.sorted.txt.out2 TRAINING
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 33589556
loading data file rec_log_train.sorted.txt.out2e
Loading Matrix Market file rec_log_train.sorted.txt.out2e VALIDATION
Loading rec_log_train.sorted.txt.out2e VALIDATION
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 6162863
loading data file rec_log_train.sorted.txt.out2t
Loading Matrix Market file rec_log_train.sorted.txt.out2t TEST
Loading rec_log_train.sorted.txt.out2t TEST
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 15561329
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 0 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 1 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 2 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 3 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 4 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 5 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 6 started.

INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 7 started.

Bias-SGD for matrix (2421057, 2421057, 1):33589556. D=50
SVD++ 50 factors
complete. Objective=0, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000.
INFO: pmf.cpp(run_graphlab:238): starting with scheduler: round_robin
max iterations = 8
step = 1
max_iterations = 8
INFO: asynchronous_engine.hpp(run:111): Worker 1 started.

INFO: asynchronous_engine.hpp(run:111): Worker 0 started.

INFO: asynchronous_engine.hpp(run:111): Worker 2 started.

INFO: asynchronous_engine.hpp(run:111): Worker 3 started.

INFO: asynchronous_engine.hpp(run:111): Worker 4 started.

INFO: asynchronous_engine.hpp(run:111): Worker 5 started.

INFO: asynchronous_engine.hpp(run:111): Worker 6 started.

INFO: asynchronous_engine.hpp(run:111): Worker 7 started.

Entering last iter with 1
42.9776) Iter SVD 1, TRAIN RMSE=0.5666 VALIDATION RMSE=0.5769.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.327499 AP@3 for validation: 0.291468
Entering last iter with 2
113.61) Iter SVD 2, TRAIN RMSE=0.5393 VALIDATION RMSE=0.5700.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.331413 AP@3 for validation: 0.308851
Entering last iter with 3
184.123) Iter SVD 3, TRAIN RMSE=0.5301 VALIDATION RMSE=0.5680.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332523 AP@3 for validation: 0.312983
Entering last iter with 4
251.239) Iter SVD 4, TRAIN RMSE=0.5249 VALIDATION RMSE=0.5673.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.334125 AP@3 for validation: 0.314801
Entering last iter with 5
307.7) Iter SVD 5, TRAIN RMSE=0.5213 VALIDATION RMSE=0.5671.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.336783 AP@3 for validation: 0.318807
Entering last iter with 6
377.682) Iter SVD 6, TRAIN RMSE=0.5180 VALIDATION RMSE=0.5670.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.340606 AP@3 for validation: 0.320896
Entering last iter with 7
447.13) Iter SVD 7, TRAIN RMSE=0.5146 VALIDATION RMSE=0.5670.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.344897 AP@3 for validation: 0.322647
Entering last iter with 8
517.358) Iter SVD 8, TRAIN RMSE=0.5113 VALIDATION RMSE=0.5671.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.351736 AP@3 for validation: 0.324784
INFO: asynchronous_engine.hpp(run:119): Worker 2 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 0 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 5 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 6 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 1 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 4 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 7 finished.

INFO: asynchronous_engine.hpp(run:119): Worker 3 finished.

Final result. Obj=0, TRAIN RMSE= 0.0000 VALIDATION RMSE= 0.0000.
Finished in 552.339237 seconds
INFO: io.hpp(export_test_file:418): **Completed successfully (mean prediction: -0.733542
INFO: read_matrix_market.hpp(save_matrix_market_vector:250): Saved output vector to file: rec_log_train.sorted.txt.out2.test.predictions vector size: 15561329
INFO: read_matrix_market.hpp(save_matrix_market_vector:251): You can read it with Matlab/Octave using the script mmread.m found on http://graphlab.org/mmread.m
Performance counters are: 0) EDGE_TRAVERSAL, 2109.31
Performance counters are: 2) CALC_RMSE_Q, 0.264258
Performance counters are: 6) CALC_OBJ, 1.1756

Final parsing and compilation of submission

Use GraphLab v2, for creating the results in KDD cup format. This code go over the recommendation and builds a list of the top 3 recommendations out of the test items.

Submission file contains exactly 1,340,127 users (713,611 from testA, and 626,516 from testB).

There are four inputs to the output creation. The first one is the sorted testA we created previously using matlab. The second is the prediction file which is output from GraphLab.

<309|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers$ ./kddcup_output_builder testA.sorted testA.predictions testB.sorted testB.predictions --ncpus=1
WARNING: kddcup_output_builder.cpp(main:165): Eigen detected. (This is actually good news!)
INFO: kddcup_output_builder.cpp(main:167): GraphLab Parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Schedule all vertices
INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO: io.hpp(load_matrixmarket_graph:325): Reading matrix market file: testB.sorted
Rows: 2421057
Cols: 2421057
Nonzeros: 15561329
Constructing all vertices.
Adding edges.
Graph size: 15561329
INFO: io.hpp(load_matrix_market_vector:544): Going to read matrix market vector from input file: testB.predictions
INFO: kddcup_output_builder.cpp(operator():131): Finished going over 15561329 rating for 626516 unique users
INFO: kddcup_output_builder.cpp(operator():132): Successfully created output file: testB.sorted.out
INFO: io.hpp(load_matrixmarket_graph:325): Reading matrix market file: testA.sorted
Rows: 2421057
Cols: 2421057
Nonzeros: 19349608
Constructing all vertices.
Adding edges.
Graph size: 19349608
INFO: io.hpp(load_matrix_market_vector:544): Going to read matrix market vector from input file: testA.predictions
INFO: kddcup_output_builder.cpp(operator():131): Finished going over 19349608 rating for 713611 unique users
INFO: kddcup_output_builder.cpp(operator():132): Successfully created output file: testA.sorted.out
Finished in 82.1905
INFO: kddcup_output_builder.cpp(main:189): Merging the two files together
updating: testA.sorted.out (deflated 72%)
INFO: kddcup_output_builder.cpp(main:208): Successfully created zip file: submission.zip
INFO: kddcup_output_builder.cpp(main:214): removed temporary files
INFO: kddcup_output_builder.cpp(main:216): Created successfully zip file: submission.zip

Preliminary results

Here is a very preliminary report on the results I was able to obtain so far:

ALS - 0.254 on the leaderboard

SVD++ - 0.29 on the leaderboard (Thanks for my collaborators in Hungarian National Academy of Science, specifically Tamas Kiss)

bias-SVD 0.330 on the leaderboard (After some optimization JustinYan got 0.34 with it).
logistic regression - 0.336

Disclaimer

As the code is evolving, there may be some bugs/inaccuracies in the code. However, I did not want to wait for the last minute to get a perfect solution that no one will be able to use. Please ping me if you find any bugs or have any suggestions.

Wednesday, April 18, 2012

KDD CUP 2011 reflection

I just learned from my collaborator JustinYan from the chinese national academy of science,about the paper A Linear Ensemble of Individual and Blended Models for Music Rating Prediction" was published by the Taiwanese team from the National Taiwan University, which won KDD CUP 2011 both tracks one and two.

What is specially interesting about this paper IMHO, is that GraphLab matrix factorization library was used for computing two models out of their ensemble.

That raises the number of teams who utilized GraphLab and won high places to two in this contest.

Wednesday, December 21, 2011

GraphLab on KDD CUP - Pittsburgh Supercomputing Center News

Here is a nice article about our participation in this year's ACM KDD CUP.

Wednesday, August 17, 2011

Efficient multicore collaborative filtering

Recently we got the 5th place in ACM Yahoo! KDD CUP (track1). We report out solution method on arxiv. This is a joint work with Yucheng Low from CMU, and Yao Wu, Qiang Yan and Qing Yang from the Chinese National Academy of Science.

In our opinion, there were two main challenges in this contest:
1) The magnitude of the data: the dataset has about 260M music ratings
in the range 0-> 100. Each rating is composed of user, music item, time of rating and the numeric rating. Time of ratings is between 2249 - 6649 (this is from the top of my head - minor inaccuracies possible).
2) Special characteristics of the data: data includes some taxonomy information which relates music tracks to albums, artists and genere.

Here is some more information about the data from the contest website:
The dataset is split into three subsets:

* - Train data: in the file trainIdx1.txt
* - Validation data: in the file validationIdx1.txt
* - Test data: in the file testIdx1.txt

For each subset, user rating data is grouped by user. First line for a user is formatted as:

* <usedid>|<#UserRatings>\n

Each of the next <#UserRatings> lines describes a single rating by , sorted in chronological order. Rating line format is:

* <itemid>\t<score>\t<date\t><time>\n

The scores are integers lying between 0 and 100. All user id's and item id's are consecutive integers, both starting at zero. Dates are integers describing number of days elapsed since an undisclosed date.

An item has at least 20 ratings in the total dataset (including train, validation, and test sets).

Each user has at least 10 ratings in the training data, which were given by the user earlier than his validation ratings. Then, each user has exactly four ratings in the validation data, which come earlier in time than the ratings by the same user in the test data. Finally, the test data holds the last 6 ratings of each user.

For the test subset, we withheld the scores. The contestants are asked to provide predictions for these test scores. The predictions should be arranged by the given test set order. Each prediction is quantized to one of the 256 evenly spaced numbers between 0 and 100 (100*0/255, 100*1/255,...,100*254/255, 100*255/255). Thus, a single unsigned byte would be dedicated for a predicted value, and predicting the 6,005,940 ratings of the test set would require 6,005,940 bytes. (We provide C and Python programs for converting into well-formatted prediction files.)

The evaluation criterion is the root mean squared error (RMSE) between predicted ratings and true ones.

The test subset is further split into two equally sized subsets, called Test1 and Test2. Test1 is used to rank contestants on the public Leaderboard and Test2 is used for deciding the winners of the contest. The split between Test1 and Test2 is not disclosed.

The dataset statistics are as follows:
#Users #Items #Ratings #TrainRatings #ValidationRatings #TestRatings
1,000,990 624,961 262,810,175 252,800,275 4,003,960 6,005,940

Besides of the rating itself, there is taxonomy information as well. In other words:
* Each music item has an album.
* Each album has an artist.
* Each album has a genre.
Here is an excerpt from the contest website:
A distinctive feature of this dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. In addition, the items are tied together within a hierarchy. That is, for a track we know the identity of its album, performing artist and associated genres. Similarly we have artist and genre annotation for the albums. Both users and items (tracks, albums, artists and genres) are represented as meaningless anonymous numbers so that no identifying information is revealed.

Our solution method:
* We used GraphLab for rapid prototyping of multiple collaborative filtering algorithms and
for fine tuning their parameters. Multiple algorithms where blended together using the ensemble method (you are welcome to read more in our paper!).
* Most linear models take into account only the information about user, music item and rating. So the user-> item matrix is decomposed to build a linear model. However, if we know that a user likes Madona, he may like different Madona albums/or songs than the one he rated. So it is good to have this relation in the model. The same with user who likes Jazz, he may well like other albums of Jazz. Since the rating is for a single song, if we do not relate the artist to the song, we miss this information. We propose a novel method called MFITR (matrix factorization item taxonomy regularization)
1) Add feature vector for artist, that mean that if a user like a certain artist it will add to the ratings of other songs of this artist.
2) Add a linear connection between the artist and the items, or else if they will be independent, thus making section 1) not useful. So we added weights that links between user, artist and album. The weights force that if user rated one album or artist higher, it will result in other items from the same artist or album to be higher as well.
3) We did not use genere in our model, although in principal genre can be added using the same technique.

We believe that Yahoo KDD CUP contest data is a great academic resource for both academic researchers as well as high-tech companies who perform collaborative filtering tasks. This is why we release a large portion of our source code as part of GraphLab's collaborative filtering library. Python/Matlab scripts are available for converting the data between KDD format into GraphLab's format.

Recently, we learned from Noam Koenigstein, a Tel-Aviv PhD candidate, about a related paper that was submitted for publication: Yahoo! Music Recommendations: Modeling Music Ratings with Temporal Dynamics and Item Taxonomy. By Gideon Dror, Noam Koenigstein and Yehuda Koren: In ACM Conference on Recommender Systems (RecSys), 2011. While obtained independently, our solution has some similarities to their proposed solution by adding a feature vectors for artist and albums. Their paper have also nice overview of KDD data characteristics in section 3.

I got another interesting resource from Yao Wu: InnerSpace group website: http://apex.sjtu.edu.cn/apex_wiki/svdfeature - they won the 3rd place this year.

And here is the workshop website including all presentations.

Friday, June 17, 2011

KDD Cup progress update - We are in the 5th place!

I am proud to report that yesterday JustinYan, a graduate student from the Institute of Automation, Chinese Academy of Sciences, a member of the LeBuSiShu team, suggested that I will join their team.
The reason is that LeBuSiShu is using GraphLab ALS solution as one of the components in their final result. Currently LeBuSiShu is the 5th group in the leaderboard (out of 100 groups!) with RMSE = 22.2820.

Anyway, of course I was honored to accept as well as quite excited! Although when writing the GraphLab collaborative filtering code of matrix factorization I had no competition in mind, it is always nice to know that someone finds your solution useful.

Besides of JustinYun, another nice team member in the LeBuSiShu group is: Yao Wu, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academic of Sciences.

A quick update 6/23: We are actually now at the 4th place! The Chinese group is an amazing team, they simply work around the clock. I myself am running some fancy MCMC methods for slightly improving prediction, but the Chinese team is responsible of 99.9% of the progress.

Thursday, June 9, 2011

SVD (Lanczos algorithm) in GraphLab

I have now implemented the Lanczos algorithm in GraphLab, as part of
GraphLab's collaborative filtering library.

Here are some performance results using the Netflix data (100M non-zeros):

On a 16 core machine (2.6Ghz Quad-Core AMD Opteron x 2), we get a speedup of about 9. Each Iteration over Netflix data takes 10 seconds using 16 cores (iteration is composed in multiplying by the matrix A and then multiplying by A^T).

Using KDD Cup data (track1), we have 252M non-zeros, and each iteration takes 31.4 seconds on the same 16 cores machine.

To give a concrete performance comparison, I have also computed svds() in Matlab on Netflix data. For extracting 2 eigenvalues it takes 126.2 seconds in Matlab, while in GraphLab the same computation (using 16 cores) takes 21.9 seconds. So we are about x6 times faster than Matlab. (Note that there may be a potential difference in the implemented algorithm so this comparison should be taken with a grain of salt).

GraphLab PMF on 32 bit Ubuntu

NOTE: This code is deprecated. Please take a look here for GraphChi (multicore implementation): http://bickson.blogspot.co.il/2012/12/collaborative-filtering-with-graphchi.html
Or here for GraphLab (distributed implementation): http://graphlab.org/toolkits/collaborative-filtering/
As GraphLab collaborative filtering library is growing, we are adding new algorithms and getting more and more users (around 110 unique installations originated from this blog in the last couple of months). Yesterday I had an interesting email exchange with Timmy Wilson, our man in Cleveland Ohio. Timmy is working on clustering in social networks.

Timmy tried to install Graphlab on linode, but encountered some problems. Here is what he writes:

Installing Graphlab on Ubuntu 10.04 (32bit)
Or, 'I wish i had installed 64bit Ubuntu'

I followed Danny's instructions here:
http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html

At the end of this email, attached are all the commands I used.

I can't say for sure, but i assume everything would have went without
a hitch if i were running the 64bit version of Ubuntu. When trying to compile GraphLab I got the following error:

CMakeFiles/disk_graph_test.cxxtest.dir/disk_graph_test.cxx.o: In
function `graphlab::atomic::inc()':
/home/timmyt/graphlabapi/src/graphlab/parallel/atomic.hpp:39:
undefined reference to `__sync_add_and_fetch_8'

This is the output of my
gcc version:

$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
4.4.3-4ubuntu5'
--with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --enable-multiarch --enable-linker-build-id
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4
--program-suffix=-4.4 --enable-nls --enable-clocale=gnu
--enable-libstdcxx-debug --enable-plugin --enable-objc-gc
--enable-targets=all --disable-werror --with-arch-32=i486
--with-tune=generic --enable-checking=release --build=i486-linux-gnu
--host=i486-linux-gnu --target=i486-linux-gnu
Thread model: posix
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)

Luckily, the graphlab guys take this stuff pretty seriously. "There
never was a compilation problem we failed to solve the same day...",
Danny told me.

Here Yucheng Low describes my problem:

> The problems you are encountering are somewhat complicated.
> I suspect you have a strange mismatch between what gcc thinks your system is
> and what your system architecture actually is.
>
> According to an email you sent before, gcc seems to think the target
> architecture is 486. I am going to assume that you are not actually running
> this on a 486 machine. The missing "__sync_add_and_fetch_8" instructions
> are available 586 and beyond. What architecture are you running on?
>
> This ITPP error looks like ITPP was build without SSE2 support possibly
> because it used the default architecture as defined by gcc. However, for
> performance reasons, the GraphLab compile flags force SSE/SSE2 through
> -mfpmath=sse -msse2
>
> Two possible solutions:
> 1: line 310 through line 328 of the root CMakeLists.txt
> To each set of compile flags add -march=pentium, delete -mfpmath=sse -msse2
>
> 2: More fundamentally, try to figure out why gcc thinks you are running on a
> 486.

First the more interesting problem:

> 2: More fundamentally, try to figure out why gcc thinks you are running on a
> 486.

I wrote the guys at http://www.linode.com/ and got a response in 15minutes:

> Ubuntu compiles their software optimized for the 486 architecture in its 32-bit
> iteration. There is not a significant performance increase when optimizing for
> 686 compared to optimizing for 486, so 486 optimization remains the default
> for 32-bit Ubuntu/Debian systems.

Mystery solved.

I opted for the first solution:

> 1: line 310 through line 328 of the root CMakeLists.txt
> To each set of compile flags add -march=pentium, delete -mfpmath=sse -msse2

Then build pmf:

$ cd ~/graphlabapi/release/demoapps/pmf
$ make

Now I got the following error:
PMF compilation error:

[ 97%] Built target graphlab
[100%] Built target itdiff
Linking CXX executable pmf
CMakeFiles/pmf.dir/pmf.o: In function `itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::init_gen_rand(unsigned int)':
/usr/local/include/itpp/base/random_dsfmt.h:161: undefined reference
to `itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::sse2_param_mask'
CMakeFiles/pmf.dir/pmf.o: In function `itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::do_recursion(itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::W128_T*, itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::W128_T*,
itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::W128_T*,
itpp::DSFMT<19937 data-blogger-escaped-10376655713290109737ull="" data-blogger-escaped-1047295u="" data-blogger-escaped-1048063u="" data-blogger-escaped-117="" data-blogger-escaped-19="" data-blogger-escaped-1ull="" data-blogger-escaped-4237361149u="" data-blogger-escaped-4291106551315987578ull="" data-blogger-escaped-4294966079u="" data-blogger-escaped-4432916062321256576ull="" data-blogger-escaped-4498102069230399ull="" data-blogger-escaped-4501400546508797ull="">::W128_T*)':
/usr/local/include/itpp/base/random_dsfmt.h:375: undefined reference

It seems that libitpp.a location was not identified. I linked to the installed itpp libraries by adding the following line below line 210 of CMakeLists.txt:

link_directories(/usr/local/lib/)

Awesome -- it works!

Timmy has also provided step by step instructions:

Instructions

http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html

Installing itpp

http://bickson.blogspot.com/search/label/itpp

$ sudo apt-get install --yes --force-yes automake autoconf libtool* gfortran  
$ sudo apt-get install --yes --force-yes liblapack-dev
$ export LDFLAGS="-L/usr/lib -lgfortran"
$ cd ~/
$ wget http://sourceforge.net/projects/itpp/files/itpp/4.2.0/itpp-4.2.tar.gz  
$ tar xvzf itpp-4.2.tar.gz  
$ cd itpp-4.2  
$ ./autogen.sh  
$ ./configure --without-fft --with-blas=/home/ubuntu/lapack-3.3.0/blas_LINUX.a --with-lapack=/home/ubuntu/lapack-3.3.0/lapack_LINUX.a CFLAGS=-fPIC CXXFLAGS=-fPIC CPPFLAGS=-fPIC  
$ make  
$ sudo make install

Installing graphlabapi

$ hg clone https://graphlabapi.googlecode.com/hg/ graphlabapi
$ cd graphlabapi/
$ ./configure --bootstrap

CMakeLists.txt edits

$ vim CMakeLists.txt

add following line below line 210:

link_directories(/usr/local/lib/)

replace all instances of: (be careful - only for 32 bit Linux!)

-mfpmath=sse -msse2

with

-march=pentium

Build PMF

$ cd ~/graphlabapi/release/demoapps/pmf
$ make

Test

$ ./pmf smalltest 0 --float=true --scheduler="round_robin(max_iterations=15)"

Thanks so much Timmy - we really appreciate your feedback!

Monday, May 23, 2011

SVD++ Koren's collaborative filtering algorithm implemented in GraphLab

Due to many user requests, I have implemented Yehuda Koren's SVD++ collaborative filtering algorithm in GraphLab. Thanks to Nicholas Ampazis, and to Yehuda Koren who supplied his C++ version.
Implementation is based on the paper: Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model by Yehuda Koren.

Note that unlike the original paper, our implementation is parallel, thus exploiting multiple cores whenever they are available.

Here are some timing results of the multicore implemenation:

(I used 8 core machine, 5 SVD++ iterations using Netflix partial dataset with 3M ratings)

Here are some accuracy results:

It seems that additional cores improve accuracy.

The way to run SVD++ is to
0) Istall GraphLab based on the instructions: http://graphlab.org/download.html
1) Run with run mode = 5
Example:
./pmf netflix 5 --ncpus=XX --scheduler="round_robin(max_iterations=10)"

Two other options are --minval=XX and --maxval=XX
for kddcup, it should be --minval=0 and --maxval=100
(if file name is kddcup it will automatically set those values).
For Netflix data, it should be --minval=1 and --maxval=5

Example run on the full Netflix dataset (using 8 cores:)

<55|0>bickson@biggerbro:~/newgraphlab/graphlabapi/release/demoapps/pmf$ ./pmf netflix-r 5 --ncpus=8 --scheduler=round_robin

INFO:     pmf.cpp(main:1081): PMF/ALS Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING:  pmf.cpp(main:1083): Code compiled with GL_NO_MULT_EDGES flag - this mode does not support multiple edges between user and movie in different times
WARNING:  pmf.cpp(main:1086): Code compiled with GL_NO_MCMC flag - this mode does not support MCMC methods.
WARNING:  pmf.cpp(main:1089): Code compiled with GL_SVD_PP flag - this mode only supports SVD++ run.
Setting run mode SVD_PLUS_PLUS
INFO:     pmf.cpp(main:1126): SVD_PLUS_PLUS starting

loading data file netflix-r
Loading netflix-r TRAINING
Matrix size is: USERS 480189 MOVIES 17770 TIME BINS 27
Creating 99072112 edges (observed ratings)...
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................loading data file netflix-re
Loading netflix-re VALIDATION
Matrix size is: USERS 480189 MOVIES 17770 TIME BINS 27
Creating 1408395 edges (observed ratings)...
........loading data file netflix-rt
Loading netflix-rt TEST
skipping file
setting regularization weight to 1
PTF_ALS for matrix (480189, 17770, 27):99072112.  D=20
pU=1, pV=1, pT=1, muT=1, D=20
nuAlpha=1, Walpha=1, mu=0, muT=1, nu=20, beta=1, W=1, WT=1 BURN_IN=10
SVD++ 20 factors (rate=8.00e-03, reg=1.50e-02)
complete. Obj=6.8368e+08, TRAIN RMSE=3.7150 VALIDATION RMSE=3.7946.
max iterations = 0
step = 1
max_iterations = 0
INFO:     asynchronous_engine.hpp(run:94): Worker 0 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 2 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 1 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 3 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 4 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 5 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 6 started.

INFO:     asynchronous_engine.hpp(run:94): Worker 7 started.

Entering last iter with 1
92.7115) Iter SVD 1, TRAIN RMSE=1.0587 VALIDATION RMSE=0.9892.
Entering last iter with 2
174.441) Iter SVD 2, TRAIN RMSE=0.9096 VALIDATION RMSE=0.9536.
Entering last iter with 3
260.442) Iter SVD 3, TRAIN RMSE=0.8678 VALIDATION RMSE=0.9805.
Entering last iter with 4
321.652) Iter SVD 4, TRAIN RMSE=0.8480 VALIDATION RMSE=0.9603.
Entering last iter with 5
388.735) Iter SVD 5, TRAIN RMSE=0.8291 VALIDATION RMSE=0.9312.
Entering last iter with 6
470.291) Iter SVD 6, TRAIN RMSE=0.8106 VALIDATION RMSE=0.9264.
Entering last iter with 7
558.886) Iter SVD 7, TRAIN RMSE=0.8046 VALIDATION RMSE=0.9270.
Entering last iter with 8
628.846) Iter SVD 8, TRAIN RMSE=0.8007 VALIDATION RMSE=0.9242.
Entering last iter with 9
687.212) Iter SVD 9, TRAIN RMSE=0.7969 VALIDATION RMSE=0.9221.
Entering last iter with 10
775.021) Iter SVD 10, TRAIN RMSE=0.7926 VALIDATION RMSE=0.9215.
Entering last iter with 11
836.143) Iter SVD 11, TRAIN RMSE=0.7907 VALIDATION RMSE=0.9203.
Entering last iter with 12
919.416) Iter SVD 12, TRAIN RMSE=0.7874 VALIDATION RMSE=0.9195.
Entering last iter with 13
1000.87) Iter SVD 13, TRAIN RMSE=0.7852 VALIDATION RMSE=0.9191.
Entering last iter with 14
1081.9) Iter SVD 14, TRAIN RMSE=0.7834 VALIDATION RMSE=0.9186.
Entering last iter with 15
1169.46) Iter SVD 15, TRAIN RMSE=0.7817 VALIDATION RMSE=0.9182.
Entering last iter with 16
1236.61) Iter SVD 16, TRAIN RMSE=0.7808 VALIDATION RMSE=0.9179.
Entering last iter with 17
1304.72) Iter SVD 17, TRAIN RMSE=0.7795 VALIDATION RMSE=0.9176.
Entering last iter with 18
1366.15) Iter SVD 18, TRAIN RMSE=0.7783 VALIDATION RMSE=0.9173.
Entering last iter with 19
1453.8) Iter SVD 19, TRAIN RMSE=0.7768 VALIDATION RMSE=0.9172.
Entering last iter with 20
1521.15) Iter SVD 20, TRAIN RMSE=0.7763 VALIDATION RMSE=0.9171.
Entering last iter with 21
1588.85) Iter SVD 21, TRAIN RMSE=0.7754 VALIDATION RMSE=0.9175.
Entering last iter with 22
1654.52) Iter SVD 22, TRAIN RMSE=0.7757 VALIDATION RMSE=0.9170.
Entering last iter with 23
1722.88) Iter SVD 23, TRAIN RMSE=0.7740 VALIDATION RMSE=0.9171.
Entering last iter with 24
1783.94) Iter SVD 24, TRAIN RMSE=0.7739 VALIDATION RMSE=0.9163.

Wednesday, April 27, 2011

Yahoo! KDD Cup using Graphlab - Track 2?

I was delighted to hear from Suhrid Balakrishnan from AT&T Labs, that he is using GraphLab pmf for factorizing a linear model for Yahoo! KDD cup - track 2. Initially I focused only on track1, but it seems that Graphlab is potentially useful for track2 as well.

Overall, this month I am aware of 20 installations of Graphlab pmf code on various research groups all over the world. Specifically I got feedback from University of Austin, University of the Aegean, University of Macedonia, Carnegie Mellon University and AT&T Labs.

I got several valuable inputs regarding his experience with GraphLab I wanted to share and ask if anyone had the same experience.

1) It is recommended to download latest version from mercurial repository. See explanation:
in my previous post http://bickson.blogspot.com/2011/04/yahoo-kdd-cup-using-graphlab.html
I am constantly improving the matrix factorization code and adapting it to the KDD dataset.

2) It is better to install itpp first, since Graphlab auto detects it and it saves much later trouble.

3) Suhrid got an interesting error when saving the factorized matrices U,V. It seems that on his Ubuntu system, factor ordering was somehow reversed.
The following matlab code solved this issue:

Ud=reshape(U(:),size(U,2),size(U,1));
Ud=Ud';

I have opened a google group for discussions and questions concerning KDD usage of GraphLab. Everyone is welcome to join:
* Group name: GraphLab KDD
* Group home page: http://groups.google.com/group/graphlab-kdd
* Group email address graphlab-kdd@googlegroups.com

Saturday, April 16, 2011

Yahoo! KDD CUP using GraphLab - Part 2

Preparing the input

I got from Sanmi Koyejo , a graduate student in University of Austin, Texas, a Python script for converting KDD Yahoo! Cup dataset into GraphLab format. Thanks so much!

It may be preferable to the Matlab script, since it seems that for some users the Matlab script goes out of memory.

I have attached the python script for reading the kdd dataset. The package requires numpy so it can use the file writing method.

Additional information about the conversion procedure is kindly supplied by Yoyo here.

NOTE: For running the resulting files in Graphlab, you will need to have access to a 64 bit machine. (32 bit machine can not load this dataset in its current form).

'''
Created on Apr 16, 2011
Read KDD cup data (Low Memory) to store in format suitable for graphlab pmf 

Requires numpy, uses ndarray.tofile() to write the binary file

Module uses (A LOT) less memory than readKddData.py by reading and writing one user at a time
The tradeoff is that the max user id, item id, days and number of ratings must be known beforehand
This is because pmf expects this input in the first line of the File

Known Issue: Number of test (-f3) items is 624959, although we hard-code 624961.
This restriction comes from a bug(?) in pmf (to be fixed soon)
Ignore this warning if this your only issue

usage: python readKddLM.py --help
python readKddLM.py -i trainIdx.txt -o kddcup -f 1
python readKddLM.py -i validationIdx.txt -o kddcupe -f 2
python readKddLM.py -i testIdx.txt -o kddcupt -f 3
@author: Sanmi Koyejo; sanmi.k@mail.utexas.edu
'''

from optparse import OptionParser
from numpy import array, dtype, amax, maximum, zeros, int_

def readLine(fileHandle, splitter='|'):
    ''' read single line'''
    line = fileHandle.readline()
    if not line: #EOF
        return line # return null and let caller handle it
    return line.rstrip().split(splitter) # split the line and remove newline character

def readChunk(fileHandle, chunkSize, splitter='\t'):
    '''read a pre-specified chunksize'''
    for _ in range(chunkSize):
        line = fileHandle.readline()
        if not line: #EOF
            break
        yield line.rstrip().split(splitter)
        
def readOneUser(fileHandle, testFlag=False, verbose=True):
    ''' reads data for one user and returns rating Matrix'''
    
    while 1:
        line = readLine(fileHandle)
        if not line: break # EOF
        assert(len(line)==2)
        userID = float(line[0])
        nRatings = int(line[1])
        
        rateMat = zeros( (nRatings, 4), dtype=dtype('f4'))
        rateMat[:,0] = userID+1 # user ID
        
        for count, line in enumerate(readChunk(fileHandle, nRatings)):
            # note allow last user to break nratings constraint. All other users should satisfy this
            rateMat[count, 1] = float(line[0])+1 # item ID
            
            if testFlag:
                assert(len(line)==3) # error checking
                rateMat[count, 2] = float(line[1]) # day
                rateMat[count, 3] = 1.0 # rating
            else:
                assert(len(line)==4)
                rateMat[count, 2] = float(line[2]) # day
                rateMat[count, 3] = float(line[1]) # rating
        
        if verbose and nRatings != count+1: 
            '''User had a different number of items than expected 
            will only work for last user, any difference for other users will trigger assert errors'''
            print "Warning: Expected", nRatings, "ratings from user; id:", int(userID), ", read", count+1, "ratings."
            rateMat = rateMat[:count+1,:]
        yield rateMat

def KddDataParser(infile, outfile, size, testFlag, verbose):
    ''' read data for each user and write to binary format'''
    
    # setup storage for max user, item, nratings
    readLen = 0
    readSize = zeros(3, dtype=dtype('i4'))
        
    # open reader and writer file handles
    if verbose: print "opening input file", infile
    readhandle = open(infile, 'rb')
    if verbose: print "opening output file", outfile
    writehandle = open(outfile, 'wb')
    
    # write the size information
    size.tofile(writehandle)
    
    # read for each user
    for count, rateMat in enumerate(readOneUser(readhandle, testFlag, verbose)):
        readSize = maximum(readSize, int_(amax(rateMat[:,:3], axis=0)) ) # max user, max item, max time
        readLen  += rateMat.shape[0]  
        
        rateMat[:,1]+=float(size[0]) # itemID = itemID+maxUser
        rateMat.tofile(writehandle)
        
        if verbose: 
            if count%50000 == 0: print 'read', rateMat.shape[0], 'ratings from user', int(rateMat[0,0])-1
    
    # close reader and writer file handles
    readhandle.close()
    writehandle.close()
    
    if verbose: print "data conversion completed"
    
    return readSize, readLen

def main():
    usage = "usage: %prog [options] arg"
    parser = OptionParser(usage)
    parser.add_option("-q", "--quiet", action="store_false", dest="verbose", default=True)
    parser.add_option("-i", "--infile", dest="infile",
                      help="input file name", default="smallTrain.txt") # fixme
    parser.add_option("-o", "--outfile", dest="outfile",
                      help="output file name", default="kddcupLM")
    parser.add_option("-f", "--filetype", dest="filetype", type='int',
                      help="training=1, validation=2, test=3", default=1)
    parser.add_option("-u", "--nuser", dest="nuser",
                      help="max ID of users, if not set, defaults to expected KDD size")
    parser.add_option("-m", "--nitem", dest="nitem",
                      help="max ID of items, if not set, defaults to expected KDD size")
    parser.add_option("-t", "--ntime", dest="ntime",
                      help="max number of days, if not set, defaults to expected KDD size")
    parser.add_option("-r", "--nrate", dest="nrate",
                      help="number of ratings, if not set, defaults to expected KDD size")
    (options, args) = parser.parse_args()
    
    # setup nUser/nitem/nTime defaults based on train/valid/test
    nuser = 1000990 if options.nuser== None else options.nuser 
    nitem = 624961 if options.nitem== None else options.nitem
    '''TODO: once pmf is modified, change definition of nitem
    nitem (train, valid)== 624961
    nitem(test)== 624959 
    Should not affect results'''

    if options.filetype==1:  
        ntime = 6645 if options.ntime== None else options.ntime
        nrate = 252800275 if options.nrate== None else options.nrate
        istest= False
    elif options.filetype==2:
        ntime = 6645 if options.ntime== None else options.ntime
        nrate = 4003960 if options.nrate== None else options.nrate
        istest= False
    elif options.filetype==3:
        ntime = 6649 if options.ntime== None else options.ntime
        nrate = 6005940 if options.nrate== None else options.nrate
        istest   = True
    else:
        errorStr = "--filetype input: "+`options.filetype`+". Allowed values are 1, 2, 3"
        raise LookupError(errorStr)
    
    size = array([nuser, nitem, ntime, nrate], dtype=dtype('i4'))
    
    [nUser, nItem, nDays], nRate = KddDataParser(options.infile, options.outfile, size, istest, options.verbose)
    
    print 'input nuser:', nuser, ', max ID of user read:', nUser
    print 'input nitem:', nitem, ', max ID of item read:', nItem
    print 'input ndays:', ntime, ', max day read', nDays
    print 'input nrate:', nrate, ', Number of ratings read:', nRate
    
    if (nuser!=nUser) or (nitem!=nItem) or (ntime!=nDays) or (nrate!=nRate):
        print "Warning: input parameters differ from output parameters,",
        print "graphlab pmf may not run correctly !!!"
    
if __name__ == '__main__':
    main()

Sanity check 0: The downloaded file size from Yahoo! should be:
-rw-r--r-- 1 bickson users 134407201 2011-01-24 07:46 testIdx1.txt
-rw-r--r-- 1 bickson users 5967164350 2011-01-24 10:23 trainIdx1.txt
-rw-r--r-- 1 bickson users 104193447 2011-01-24 10:25 validationIdx1.txt

Sanity check 1: The output file size should be:
$ ls -l kddcup*
-rw-r–r– 1 sil sil 4044804416 2011-06-27 18:18 kddcup
-rw-r–r– 1 sil sil 64063376 2011-06-28 12:51 kddcupe
-rw-r–r– 1 sil sil 96095056 2011-06-28 16:22 kddcupt
(Thanks Yoyo!)

Sanity check 2: you can use the md5sum command to verify creation of inputs.
You should get the following numbers:

<34|0>bickson@bigbro6:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ md5sum kddcupe
aa76bb1d0e6e897e270ed65d021ed1d8  kddcupe
<35|0>bickson@bigbro6:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ md5sum kddcupt
917599ce7f715890a2705dc04851ac12  kddcupt
<36|0>bickson@bigbro6:~/newgraphlab/graphlabapi/debug/demoapps/pmf$ md5sum kddcup
345b168a208757b3098c6674b2fb653a  kddcup

If you got different output, please check carefully that the command line arguments used are as instructed.

Sanity check 3: When running the third script, you should see the output:
> data conversion completed
> input nuser: 1000990 , max ID of user read: 1000990
> input nitem: 624961 , max ID of item read: 624959
> input ndays: 6649 , max day read 6649
> input nrate: 6005940 , Number of ratings read: 6005940
> Warning: input parameters differ from output parameters, graphlab pmf may
> not run correctly !!!

Large Scale Machine Learning and Other Animals