## Friday, April 20, 2012

### KDD CUP 2012 Track 1 using GraphLab..

As some of you may recall, last year we had some great excitement with KDD CUP 2011, where we won the 5th place in track1. Additionally, the group who won the contest was using GraphLab as well. This year I really wanted to take a look again at this interesting and high visibility contest, but I did not find the time for it up to now. I got some interesting questions from students at Georgia Tech ML course taught by Le Song, were the course task is to compete in the KDD CUP 2012.

In this blog post I will share some of my approaches of dealing with KDD CUP 2012 track 1. I must warn my readers that it is very preliminary results, but I will be glad to get any inputs along the way.
Acknowledgement: my close collaborator, JustinYan from the Chinese National Academy of Science
is to credit most of the below results. By the way he is looking for summer internship in the US. Contact him if you are looking for intern. He is based in China - he will need a US intern visa.

## First try - matrix factorization using alternating least squares.

Goal: build a linear model for the user/item matrix and predict the missing values (the test data).

### Preparing the training data:

Some explanations and motivation. Track1 does not have a separate validation dataset, as was for example in Netflix and last year KDD CUP. It is very useful to have a validation dataset, or else one may easily overfit over the training dataset. Following JustinYan advice, I have separated the data based on its time properties. According to the unix timestamps of the data,
ratings are distributed from day 283 to day 314. (int date is in [131834878,1321027199] , 131834878 is 2011.10.12 00:00:00 (unix time +8) and 1321027199 is 2011.11.11(unix time +8). ). We take the last 5 days (310-314) as the validation and the first days (283-309) as the training set.

And here is what I did to convert the data for Graphlab v1 matrix factorization format:
2) I am using graphlab matrix factorization version which assumes there is a single rating between each user and data item. In rec_log_train.txt there are many items which are ranked several times. (Later we can add support to multiple ratings of the same item). Currently I am also avoiding the temporal aspect
we can later add that. Filter out duplicate ratings using the command:
sort -n -k 1,1 -k 2,2 rec_log_train.txt > rec_log_train.sorted.txt
3) Use graphlab version 2 kdd train parser to filter out duplicate ratings:
<468|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers\$ ./kdd_train_data_parser rec_log_train.sorted.txt
WARNING:  kdd_train_data_parser.cpp(main:224): Eigen detected. (This is actually good news!)
INFO:     kdd_train_data_parser.cpp(main:226): GraphLab parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Currently implemented parsers are: Call data records, document tokens
INFO:     sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO:     io.hpp(gzip_in_file:800): Opening input file: rec_log_train.sorted.txt
INFO:     io.hpp(gzip_out_file:831): Opening output file rec_log_train.sorted.txt.out
INFO:     kdd_train_data_parser.cpp(operator():181): Finished parsing total of 73209277 lines in file rec_log_train.sorted.txt
INFO:     kdd_train_data_parser.cpp(operator():197): Finished exporting a total of 39752418 ratings
Finished in 103.088

Explanation: whenever there are multiple instances of the same user-item pair, with the same ratings, they are pruned and only one is left. Whenever there are two or more instances of the same user-item pair with conflicting ratings, namely both -1 and 1, they are ignored and removed from the training data.

### Preparing the test data:

The test data contains the following format:
[user ] [ item ] [ empty rating ] [ time ]
It has unique 34910937 user/item pairs. The test data is explained here:
The rec_log_test.txt file contains data for both the public leaderboard set and final evaluation set. The file is sorted temporally. Any data with a timestamp < 1321891200 is used for the public leaderboard set, and >= 1321891200 is used for the final evaluation set. The goal is to predict which items that were recommended to the user were followed by the user, within each set.
In other words, the test data can be divided into two. In each set we should recommend the highest ranked items for this set. testA has 19,349,608 unique users and testB has 15,561,329 unique users.

Here are some more detailed instructions:
6) Unzip the file using
# unzip rec_log_test.txt.zip
Use GraphLab v2 kdd_test_splitter to prepare the test:
./kdd_test_splitter rec_log_test.txt --ncpus=1
WARNING:  kdd_test_splitter.cpp(main:159): Eigen detected. (This is actually good news!)
INFO:     kdd_test_splitter.cpp(main:161): GraphLab Parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Schedule all vertices
INFO:     sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
Rows:      2421057
Cols:      2421057
Nonzeros:  34910937
Constructing all vertices.
Graph size:    34910937
Start finalize..
INFO:     kdd_test_splitter.cpp(operator():132): Finished going over 19349608 rating  for 1196411 unique users
INFO:     kdd_test_splitter.cpp(operator():133): Successfully created output file: testA.sorted
Rows:      2421057
Cols:      2421057
Nonzeros:  34910937
Constructing all vertices.
Graph size:    34910937
Start finalize..
INFO:     kdd_test_splitter.cpp(operator():132): Finished going over 15561329 rating  for 1196411 unique users
INFO:     kdd_test_splitter.cpp(operator():133): Successfully created output file: testB.sorted
Finished in 164.793
INFO:     kdd_test_splitter.cpp(main:184): Finished exporting test data!

The output of this procedure are split test files, testA.sorted holding earlier ratings and testB.sorted holding later ratings.  Ratings are sorted first by the user and then by the item numbers.

### Understanding the cost function

Track one using AP@3 (average precision for 3 items) as its error measure. Here is pseudo code I got from JustinYan which computes it:

private static double ProcessList(List<Sample> labelList1)
{
//labelList is an ordered list of the top 3 computed predictions
List<Sample> labelList = labelList1.OrderByDescending(x => x.Prediction).Take(3).ToList();
//realList is a list of ratings that where equal to one
List<Sample> realList = labelList1.Where(x => x.RealValue > 0).ToList();
//realClickCount is the number of items a user clicked on
double realClickCount =realList.Count();
double ap = 0;
int count = 0;
//if user clicked on nothing, we get AP of zero.
if (realClickCount == 0)
return 0;
//for each of the recorded ratings
for (int i = 0; i < labelList.Count; i++)

//if one of the top 3 items we recommended, had actually been
//selected increase our precision.
if (labelList[i].RealValue > 0)
{
ap += ++count * 1.0 / (i + 1);
}
}
return ap / (realClickCount);
}

GraphLab now supports AP@3 cost function calculation using the command line switch --calc_ap=true

### Running GraphLab SGD

8) Rename the test file to be recognized in graphlab using:

ln -s testA.sorted rec_log_train.sorted.out.txtt

6) Run bias-SGD using the command:

./pmf rec_log_train.sorted.txt.out 15 --D=50 '--scheduler=round_robin(max_iterations=8,block_size=1)' --matrixmarket=true --minval=-1 --maxval=1 --zero=true --matrix_market_tokens_per_row=4 --ncpus=8 --sgd_step_dec=0.98 --sgd_gamma=0.005 --sgd_lambda=0.005 --exporttest=true --calc_ap=true --reduce_mem_consumption=true --exportlinearmodel=false
INFO:     pmf.cpp(do_main:441): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING:  pmf.cpp(do_main:448): Program compiled with it++ Support
Setting run mode Bias-SGD
INFO:     pmf.cpp(start:291): Bias-SGD starting

Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 0 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 1 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 2 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 3 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 5 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 4 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 6 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 7 started.

Bias-SGD for matrix (2421057, 2421057, 1):33589556.  D=50
SVD++ 50 factors
complete. Objective=0, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000.
INFO:     pmf.cpp(run_graphlab:238): starting with scheduler: round_robin
max iterations = 8
step = 1
max_iterations = 8
INFO:     asynchronous_engine.hpp(run:111): Worker 0 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 1 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 2 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 3 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 4 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 5 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 6 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 7 started.

Entering last iter with 1
21.4589) Iter SVD 1, TRAIN RMSE=0.5666 VALIDATION RMSE=0.5769.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.328273 AP@3 for validation: 0.292205
Entering last iter with 2
68.1817) Iter SVD 2, TRAIN RMSE=0.5393 VALIDATION RMSE=0.5700.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332562 AP@3 for validation: 0.308924
Entering last iter with 3
114.985) Iter SVD 3, TRAIN RMSE=0.5301 VALIDATION RMSE=0.5680.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332799 AP@3 for validation: 0.312879
Entering last iter with 4
162.032) Iter SVD 4, TRAIN RMSE=0.5249 VALIDATION RMSE=0.5673.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.333644 AP@3 for validation: 0.314124
Entering last iter with 5
219.16) Iter SVD 5, TRAIN RMSE=0.5213 VALIDATION RMSE=0.5671.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.336898 AP@3 for validation: 0.319216
Entering last iter with 6
289.758) Iter SVD 6, TRAIN RMSE=0.5180 VALIDATION RMSE=0.5670.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.340856 AP@3 for validation: 0.321952
Entering last iter with 7
360.691) Iter SVD 7, TRAIN RMSE=0.5146 VALIDATION RMSE=0.5670.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.344672 AP@3 for validation: 0.322379
Entering last iter with 8
431.754) Iter SVD 8, TRAIN RMSE=0.5113 VALIDATION RMSE=0.5671.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.352534 AP@3 for validation: 0.325194
INFO:     asynchronous_engine.hpp(run:119): Worker 5 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 2 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 1 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 6 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 7 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 4 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 3 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 0 finished.

Final result. Obj=0, TRAIN RMSE= 0.0000 VALIDATION RMSE= 0.0000.
Finished in 466.943851 seconds
INFO:     io.hpp(export_test_file:418): **Completed successfully (mean prediction: -0.740713
INFO:     read_matrix_market.hpp(save_matrix_market_vector:250): Saved output vector to file: rec_log_train.sorted.txt.out.test.predictions vector size: 19349608
Performance counters are: 0) EDGE_TRAVERSAL, 1421.59
Performance counters are: 2) CALC_RMSE_Q, 0.296477
Performance counters are: 6) CALC_OBJ, 1.17573

As you can see, each iteration of bias-SGD takes around 60 seconds using 8 cores machine. So it is enough to have a single good multicore machine. The output is the file: rec_log_train2.txt.test.predictions
Let's look at the output:

%%MatrixMarket matrix array real general
%%output predictions for test data
19349608 1
-1
-0.7164881229401
0.07807982712984
-0.9745061397552
-0.7039368152618
-0.6436884999275
-1
-0.6042827963829

For each user/item pair in the testA.sorted data, we get a prediction.

Now we store the results for testA:
mv rec_log_train2.txt.test.predictions testA.predictions

And again we run graphlab, this time with testB:

./pmf rec_log_train.sorted.txt.out2 15 --D=50 '--scheduler=round_robin(max_iterations=8,block_size=1)' --matrixmarket=true --minval=-1 --maxval=1 --zero=true --matrix_market_tokens_per_row=4 --ncpus=8 --lambda=0.01 --sgd_gamma=0.005 --sgd_lambda=0.005 --sgd_step_dec=0.98 --exporttest=true --calc_ap=true
INFO:     pmf.cpp(do_main:441): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING:  pmf.cpp(do_main:448): Program compiled with it++ Support
Setting run mode Bias-SGD
INFO:     pmf.cpp(start:291): Bias-SGD starting

Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 0 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 1 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 2 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 3 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 4 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 5 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 6 started.

INFO:     asynchronous_engine.hpp(run:137): Worker (Sync) 7 started.

Bias-SGD for matrix (2421057, 2421057, 1):33589556.  D=50
SVD++ 50 factors
complete. Objective=0, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000.
INFO:     pmf.cpp(run_graphlab:238): starting with scheduler: round_robin
max iterations = 8
step = 1
max_iterations = 8
INFO:     asynchronous_engine.hpp(run:111): Worker 1 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 0 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 2 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 3 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 4 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 5 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 6 started.

INFO:     asynchronous_engine.hpp(run:111): Worker 7 started.

Entering last iter with 1
42.9776) Iter SVD 1, TRAIN RMSE=0.5666 VALIDATION RMSE=0.5769.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.327499 AP@3 for validation: 0.291468
Entering last iter with 2
113.61) Iter SVD 2, TRAIN RMSE=0.5393 VALIDATION RMSE=0.5700.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.331413 AP@3 for validation: 0.308851
Entering last iter with 3
184.123) Iter SVD 3, TRAIN RMSE=0.5301 VALIDATION RMSE=0.5680.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332523 AP@3 for validation: 0.312983
Entering last iter with 4
251.239) Iter SVD 4, TRAIN RMSE=0.5249 VALIDATION RMSE=0.5673.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.334125 AP@3 for validation: 0.314801
Entering last iter with 5
307.7) Iter SVD 5, TRAIN RMSE=0.5213 VALIDATION RMSE=0.5671.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.336783 AP@3 for validation: 0.318807
Entering last iter with 6
377.682) Iter SVD 6, TRAIN RMSE=0.5180 VALIDATION RMSE=0.5670.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.340606 AP@3 for validation: 0.320896
Entering last iter with 7
447.13) Iter SVD 7, TRAIN RMSE=0.5146 VALIDATION RMSE=0.5670.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.344897 AP@3 for validation: 0.322647
Entering last iter with 8
517.358) Iter SVD 8, TRAIN RMSE=0.5113 VALIDATION RMSE=0.5671.
INFO:     biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.351736 AP@3 for validation: 0.324784
INFO:     asynchronous_engine.hpp(run:119): Worker 2 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 0 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 5 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 6 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 1 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 4 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 7 finished.

INFO:     asynchronous_engine.hpp(run:119): Worker 3 finished.

Final result. Obj=0, TRAIN RMSE= 0.0000 VALIDATION RMSE= 0.0000.
Finished in 552.339237 seconds
INFO:     io.hpp(export_test_file:418): **Completed successfully (mean prediction: -0.733542
INFO:     read_matrix_market.hpp(save_matrix_market_vector:250): Saved output vector to file: rec_log_train.sorted.txt.out2.test.predictions vector size: 15561329
Performance counters are: 0) EDGE_TRAVERSAL, 2109.31
Performance counters are: 2) CALC_RMSE_Q, 0.264258
Performance counters are: 6) CALC_OBJ, 1.1756

### Final parsing and compilation of submission

Use GraphLab v2, for creating the results in KDD cup format. This code go over the recommendation and builds a list of the top 3 recommendations out of the test items.

Submission file contains exactly 1,340,127 users (713,611 from testA, and 626,516 from testB).

There are four inputs to the output creation. The first one is the sorted testA we created previously using matlab. The second is the prediction file which is output from GraphLab.

<309|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers\$ ./kddcup_output_builder testA.sorted testA.predictions testB.sorted testB.predictions --ncpus=1
WARNING:  kddcup_output_builder.cpp(main:165): Eigen detected. (This is actually good news!)
INFO:     kddcup_output_builder.cpp(main:167): GraphLab Parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Schedule all vertices
INFO:     sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
Rows:      2421057
Cols:      2421057
Nonzeros:  15561329
Constructing all vertices.
Graph size:    15561329
INFO:     io.hpp(load_matrix_market_vector:544): Going to read matrix market vector from input file: testB.predictions
INFO:     kddcup_output_builder.cpp(operator():131): Finished going over 15561329 rating  for 626516 unique users
INFO:     kddcup_output_builder.cpp(operator():132): Successfully created output file: testB.sorted.out
Rows:      2421057
Cols:      2421057
Nonzeros:  19349608
Constructing all vertices.
Graph size:    19349608
INFO:     io.hpp(load_matrix_market_vector:544): Going to read matrix market vector from input file: testA.predictions
INFO:     kddcup_output_builder.cpp(operator():131): Finished going over 19349608 rating  for 713611 unique users
INFO:     kddcup_output_builder.cpp(operator():132): Successfully created output file: testA.sorted.out
Finished in 82.1905
INFO:     kddcup_output_builder.cpp(main:189): Merging the two files together
updating: testA.sorted.out (deflated 72%)
INFO:     kddcup_output_builder.cpp(main:208): Successfully created zip file: submission.zip
INFO:     kddcup_output_builder.cpp(main:214): removed temporary files
INFO:     kddcup_output_builder.cpp(main:216): Created successfully zip file: submission.zip

### Preliminary results

Here is a very preliminary report on the results I was able to obtain so far:
ALS - 0.254 on the leaderboard
SVD++ - 0.29 on the leaderboard (Thanks for my collaborators in Hungarian National Academy of Science, specifically Tamas Kiss)
bias-SVD 0.330 on the leaderboard (After some optimization JustinYan got 0.34 with it).
logistic regression - 0.336

### Disclaimer

As the code is evolving, there may be some bugs/inaccuracies in the code. However, I did not want to wait for the last minute to get a perfect solution that no one will be able to use. Please ping me if you find any bugs or have any suggestions.

1. Hi Danny, nice article, I have a question, how were you able to generate a prediction for each user-item pair in the test file? Because a lot of users/items in the test file do not appear in the training file.

2. I meant specifically using the alternating least squares based approach, which as I understand for prediction requires that you take the dot product of the corresponding user and item vector that you learned in the training phase.

1. Hi!
I answered a week ago but somehow the answer did not appear.. :-(
Anyway I suggest using bias-SGD where in case of a missing user in training data the bias of the item is used. In ALS we don't have a good answer on how to predict cold start users.
Maybe try to predict the average for that item.

2. Hi. I'm using matrix factorization too. (I have not tried GraphLab Yet). My current MAP is 0.38234.
I have a question and it is very kind of you if you give your opinion about it.
As ehtsham mentioned, users in the test file who have no data in the training file (81% of users in the test file!) are a big problem for MF models. But this is not the only problem. The most prevalent items in the test file are extremely rare in the training period.
Do you think there is a way to alleviate this problem?

3. Here is the answer I got from Michael Jahrer from the commendo team to your question:

Wow 0.38234 is really good for one model.
I assume that this is more than plain SVD.
Yes the user overlap in train and test is very bad, so you have to use other user information sources in order to solve the issue.

My best factor model (SVD) has 0.38591 leaderboard score, but this integrates additional parts: asymmetric info, user sns, user action, user keywords, user tags, user genre and user birthYear info. It has 20 features, more seems to hurt. And i trained it to minimize the rank (like in our 2011 kdd papers) - ranking improves the MAP approx. 0.005. By using all this parts the train/test user coverage is much better - therefore the better score. Plain SVD gives me about 0.346 on leaderboard. An ASVD gives me 0.352, ASVD + user sns part gives me 0.371 leaderboard score.

Thanks Michael!

4. Thank you both very much for the response!

3. Hi Danny,

I tried to follow this blog post, but I keep getting out of memory error. I am running on 12GB RAM.
How big is the memory needed to run this script?

Thank you

Philips

1. HI,
Can you be more specific, which of the program mentioned above gives you the error?