As some of you may recall, last year we had some great excitement with KDD CUP 2011, where we won the
5th place in track1. Additionally, the group who won the contest
was using GraphLab as well. This year I really wanted to take a look again at this interesting and high visibility contest, but I did not find the time for it up to now. I got some interesting questions from students at Georgia Tech ML course taught by
Le Song, were the course task is to compete in the
KDD CUP 2012.
In this blog post I will share some of my approaches of dealing with KDD CUP 2012 track 1. I must warn my readers that it is very preliminary results, but I will be glad to get any inputs along the way.
Acknowledgement: my close collaborator,
JustinYan from the Chinese National Academy of Science
is to credit most of the below results. By the way he is looking for summer internship in the US. Contact him if you are looking for intern. He is based in China - he will need a US intern visa.
First try - matrix factorization using alternating least squares.
Goal: build a linear model for the user/item matrix and predict the missing values (the test data).
Preparing the training data:
Some explanations and motivation. Track1 does not have a separate validation dataset, as was for example in Netflix and last year KDD CUP. It is very useful to have a validation dataset, or else one may easily overfit over the training dataset. Following JustinYan advice, I have separated the data based on its time properties. According to the unix timestamps of the data,
ratings are distributed from day 283 to day 314. (int date is in [131834878,1321027199] , 131834878 is 2011.10.12 00:00:00 (unix time +8) and 1321027199 is 2011.11.11(unix time +8). ). We take the last 5 days (310-314) as the validation and the first days (283-309) as the training set.
And here is what I did to convert the data for Graphlab v1 matrix factorization format:
1) download and unzip the file track1.zip.
2) I am using graphlab matrix factorization version which assumes there is a single rating between each user and data item. In rec_log_train.txt there are many items which are ranked several times. (Later we can add support to multiple ratings of the same item). Currently I am also avoiding the temporal aspect
we can later add that. Filter out duplicate ratings using the command:
sort -n -k 1,1 -k 2,2 rec_log_train.txt > rec_log_train.sorted.txt
3) Use graphlab version 2 kdd train parser to filter out duplicate ratings:
<468|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers$ ./kdd_train_data_parser rec_log_train.sorted.txt
WARNING: kdd_train_data_parser.cpp(main:224): Eigen detected. (This is actually good news!)
INFO: kdd_train_data_parser.cpp(main:226): GraphLab parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Currently implemented parsers are: Call data records, document tokens
INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO: io.hpp(gzip_in_file:800): Opening input file: rec_log_train.sorted.txt
INFO: io.hpp(gzip_out_file:831): Opening output file rec_log_train.sorted.txt.out
INFO: kdd_train_data_parser.cpp(operator():181): Finished parsing total of 73209277 lines in file rec_log_train.sorted.txt
INFO: kdd_train_data_parser.cpp(operator():197): Finished exporting a total of 39752418 ratings
Finished in 103.088
Explanation: whenever there are multiple instances of the same user-item pair, with the same ratings, they are pruned and only one is left. Whenever there are two or more instances of the same user-item pair with conflicting ratings, namely both -1 and 1, they are ignored and removed from the training data.
Preparing the test data:
The test data contains the following format:
[user ] [ item ] [ empty rating ] [ time ]
It has unique 34910937 user/item pairs. The test data is explained
here:
The rec_log_test.txt file contains data for both the public leaderboard set and final evaluation set. The file is sorted temporally. Any data with a timestamp < 1321891200 is used for the public leaderboard set, and >= 1321891200 is used for the final evaluation set. The goal is to predict which items that were recommended to the user were followed by the user, within each set.
In other words, the test data can be divided into two. In each set we should recommend the highest ranked items for this set. testA has 19,349,608 unique users and testB has 15,561,329 unique users.
Here are some more detailed instructions:
5) Download the test data file:
rec_log_test.txt.zip
6) Unzip the file using
# unzip rec_log_test.txt.zip
Use GraphLab v2 kdd_test_splitter to prepare the test:
./kdd_test_splitter rec_log_test.txt --ncpus=1
WARNING: kdd_test_splitter.cpp(main:159): Eigen detected. (This is actually good news!)
INFO: kdd_test_splitter.cpp(main:161): GraphLab Parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Schedule all vertices
INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO: io.hpp(load_matrixmarket_graph:326): Reading matrix market file: rec_log_test.txt
Rows: 2421057
Cols: 2421057
Nonzeros: 34910937
Constructing all vertices.
Adding edges.
Graph size: 34910937
Start finalize..
INFO: kdd_test_splitter.cpp(operator():132): Finished going over 19349608 rating for 1196411 unique users
INFO: kdd_test_splitter.cpp(operator():133): Successfully created output file: testA.sorted
INFO: io.hpp(load_matrixmarket_graph:326): Reading matrix market file: rec_log_test.txt
Rows: 2421057
Cols: 2421057
Nonzeros: 34910937
Constructing all vertices.
Adding edges.
Graph size: 34910937
Start finalize..
INFO: kdd_test_splitter.cpp(operator():132): Finished going over 15561329 rating for 1196411 unique users
INFO: kdd_test_splitter.cpp(operator():133): Successfully created output file: testB.sorted
Finished in 164.793
INFO: kdd_test_splitter.cpp(main:184): Finished exporting test data!
The output of this procedure are split test files, testA.sorted holding earlier ratings and testB.sorted holding later ratings. Ratings are sorted first by the user and then by the item numbers.
Understanding the cost function
Track one using AP@3 (average precision for 3 items) as its error measure. Here is pseudo code I got from JustinYan which computes it:
private static double ProcessList(List<Sample> labelList1)
{
//labelList is an ordered list of the top 3 computed predictions
List<Sample> labelList = labelList1.OrderByDescending(x => x.Prediction).Take(3).ToList();
//realList is a list of ratings that where equal to one
List<Sample> realList = labelList1.Where(x => x.RealValue > 0).ToList();
//realClickCount is the number of items a user clicked on
double realClickCount =realList.Count();
double ap = 0;
int count = 0;
//if user clicked on nothing, we get AP of zero.
if (realClickCount == 0)
return 0;
//for each of the recorded ratings
for (int i = 0; i < labelList.Count; i++)
{
//if one of the top 3 items we recommended, had actually been
//selected increase our precision.
if (labelList[i].RealValue > 0)
{
ap += ++count * 1.0 / (i + 1);
}
}
return ap / (realClickCount);
}
GraphLab now supports AP@3 cost function calculation using the command line switch --calc_ap=true
Running GraphLab SGD
8) Rename the test file to be recognized in graphlab using:
ln -s testA.sorted rec_log_train.sorted.out.txtt
6) Run bias-SGD using the command:
./pmf rec_log_train.sorted.txt.out 15 --D=50 '--scheduler=round_robin(max_iterations=8,block_size=1)' --matrixmarket=true --minval=-1 --maxval=1 --zero=true --matrix_market_tokens_per_row=4 --ncpus=8 --sgd_step_dec=0.98 --sgd_gamma=0.005 --sgd_lambda=0.005 --exporttest=true --calc_ap=true --reduce_mem_consumption=true --exportlinearmodel=false
INFO: pmf.cpp(do_main:441): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING: pmf.cpp(do_main:448): Program compiled with it++ Support
Setting run mode Bias-SGD
INFO: pmf.cpp(start:291): Bias-SGD starting
loading data file rec_log_train.sorted.txt.out
Loading Matrix Market file rec_log_train.sorted.txt.out TRAINING
Loading rec_log_train.sorted.txt.out TRAINING
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 33589556
loading data file rec_log_train.sorted.txt.oute
Loading Matrix Market file rec_log_train.sorted.txt.oute VALIDATION
Loading rec_log_train.sorted.txt.oute VALIDATION
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 6162863
loading data file rec_log_train.sorted.txt.outt
Loading Matrix Market file rec_log_train.sorted.txt.outt TEST
Loading rec_log_train.sorted.txt.outt TEST
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 19349608
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 0 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 1 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 2 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 3 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 5 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 4 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 6 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 7 started.
Bias-SGD for matrix (2421057, 2421057, 1):33589556. D=50
SVD++ 50 factors
complete. Objective=0, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000.
INFO: pmf.cpp(run_graphlab:238): starting with scheduler: round_robin
max iterations = 8
step = 1
max_iterations = 8
INFO: asynchronous_engine.hpp(run:111): Worker 0 started.
INFO: asynchronous_engine.hpp(run:111): Worker 1 started.
INFO: asynchronous_engine.hpp(run:111): Worker 2 started.
INFO: asynchronous_engine.hpp(run:111): Worker 3 started.
INFO: asynchronous_engine.hpp(run:111): Worker 4 started.
INFO: asynchronous_engine.hpp(run:111): Worker 5 started.
INFO: asynchronous_engine.hpp(run:111): Worker 6 started.
INFO: asynchronous_engine.hpp(run:111): Worker 7 started.
Entering last iter with 1
21.4589) Iter SVD 1, TRAIN RMSE=0.5666 VALIDATION RMSE=0.5769.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.328273 AP@3 for validation: 0.292205
Entering last iter with 2
68.1817) Iter SVD 2, TRAIN RMSE=0.5393 VALIDATION RMSE=0.5700.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332562 AP@3 for validation: 0.308924
Entering last iter with 3
114.985) Iter SVD 3, TRAIN RMSE=0.5301 VALIDATION RMSE=0.5680.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332799 AP@3 for validation: 0.312879
Entering last iter with 4
162.032) Iter SVD 4, TRAIN RMSE=0.5249 VALIDATION RMSE=0.5673.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.333644 AP@3 for validation: 0.314124
Entering last iter with 5
219.16) Iter SVD 5, TRAIN RMSE=0.5213 VALIDATION RMSE=0.5671.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.336898 AP@3 for validation: 0.319216
Entering last iter with 6
289.758) Iter SVD 6, TRAIN RMSE=0.5180 VALIDATION RMSE=0.5670.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.340856 AP@3 for validation: 0.321952
Entering last iter with 7
360.691) Iter SVD 7, TRAIN RMSE=0.5146 VALIDATION RMSE=0.5670.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.344672 AP@3 for validation: 0.322379
Entering last iter with 8
431.754) Iter SVD 8, TRAIN RMSE=0.5113 VALIDATION RMSE=0.5671.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.352534 AP@3 for validation: 0.325194
INFO: asynchronous_engine.hpp(run:119): Worker 5 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 2 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 1 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 6 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 7 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 4 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 3 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 0 finished.
Final result. Obj=0, TRAIN RMSE= 0.0000 VALIDATION RMSE= 0.0000.
Finished in 466.943851 seconds
INFO: io.hpp(export_test_file:418): **Completed successfully (mean prediction: -0.740713
INFO: read_matrix_market.hpp(save_matrix_market_vector:250): Saved output vector to file: rec_log_train.sorted.txt.out.test.predictions vector size: 19349608
INFO: read_matrix_market.hpp(save_matrix_market_vector:251): You can read it with Matlab/Octave using the script mmread.m found on http://graphlab.org/mmread.m
Performance counters are: 0) EDGE_TRAVERSAL, 1421.59
Performance counters are: 2) CALC_RMSE_Q, 0.296477
Performance counters are: 6) CALC_OBJ, 1.17573
As you can see, each iteration of bias-SGD takes around 60 seconds using 8 cores machine. So it is enough to have a single good multicore machine. The output is the file: rec_log_train2.txt.test.predictions
Let's look at the output:
head rec_log_train2.txt.test.predictions
%%MatrixMarket matrix array real general
%%output predictions for test data
19349608 1
-1
-0.7164881229401
0.07807982712984
-0.9745061397552
-0.7039368152618
-0.6436884999275
-1
-0.6042827963829
For each user/item pair in the testA.sorted data, we get a prediction.
Now we store the results for testA:
mv rec_log_train2.txt.test.predictions testA.predictions
And again we run graphlab, this time with testB:
./pmf rec_log_train.sorted.txt.out2 15 --D=50 '--scheduler=round_robin(max_iterations=8,block_size=1)' --matrixmarket=true --minval=-1 --maxval=1 --zero=true --matrix_market_tokens_per_row=4 --ncpus=8 --lambda=0.01 --sgd_gamma=0.005 --sgd_lambda=0.005 --sgd_step_dec=0.98 --exporttest=true --calc_ap=true
INFO: pmf.cpp(do_main:441): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU
Send bug reports and comments to danny.bickson@gmail.com
WARNING: pmf.cpp(do_main:448): Program compiled with it++ Support
Setting run mode Bias-SGD
INFO: pmf.cpp(start:291): Bias-SGD starting
loading data file rec_log_train.sorted.txt.out2
Loading Matrix Market file rec_log_train.sorted.txt.out2 TRAINING
Loading rec_log_train.sorted.txt.out2 TRAINING
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 33589556
loading data file rec_log_train.sorted.txt.out2e
Loading Matrix Market file rec_log_train.sorted.txt.out2e VALIDATION
Loading rec_log_train.sorted.txt.out2e VALIDATION
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 6162863
loading data file rec_log_train.sorted.txt.out2t
Loading Matrix Market file rec_log_train.sorted.txt.out2t TEST
Loading rec_log_train.sorted.txt.out2t TEST
Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1
INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 15561329
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 0 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 1 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 2 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 3 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 4 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 5 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 6 started.
INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 7 started.
Bias-SGD for matrix (2421057, 2421057, 1):33589556. D=50
SVD++ 50 factors
complete. Objective=0, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000.
INFO: pmf.cpp(run_graphlab:238): starting with scheduler: round_robin
max iterations = 8
step = 1
max_iterations = 8
INFO: asynchronous_engine.hpp(run:111): Worker 1 started.
INFO: asynchronous_engine.hpp(run:111): Worker 0 started.
INFO: asynchronous_engine.hpp(run:111): Worker 2 started.
INFO: asynchronous_engine.hpp(run:111): Worker 3 started.
INFO: asynchronous_engine.hpp(run:111): Worker 4 started.
INFO: asynchronous_engine.hpp(run:111): Worker 5 started.
INFO: asynchronous_engine.hpp(run:111): Worker 6 started.
INFO: asynchronous_engine.hpp(run:111): Worker 7 started.
Entering last iter with 1
42.9776) Iter SVD 1, TRAIN RMSE=0.5666 VALIDATION RMSE=0.5769.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.327499 AP@3 for validation: 0.291468
Entering last iter with 2
113.61) Iter SVD 2, TRAIN RMSE=0.5393 VALIDATION RMSE=0.5700.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.331413 AP@3 for validation: 0.308851
Entering last iter with 3
184.123) Iter SVD 3, TRAIN RMSE=0.5301 VALIDATION RMSE=0.5680.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332523 AP@3 for validation: 0.312983
Entering last iter with 4
251.239) Iter SVD 4, TRAIN RMSE=0.5249 VALIDATION RMSE=0.5673.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.334125 AP@3 for validation: 0.314801
Entering last iter with 5
307.7) Iter SVD 5, TRAIN RMSE=0.5213 VALIDATION RMSE=0.5671.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.336783 AP@3 for validation: 0.318807
Entering last iter with 6
377.682) Iter SVD 6, TRAIN RMSE=0.5180 VALIDATION RMSE=0.5670.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.340606 AP@3 for validation: 0.320896
Entering last iter with 7
447.13) Iter SVD 7, TRAIN RMSE=0.5146 VALIDATION RMSE=0.5670.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.344897 AP@3 for validation: 0.322647
Entering last iter with 8
517.358) Iter SVD 8, TRAIN RMSE=0.5113 VALIDATION RMSE=0.5671.
INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.351736 AP@3 for validation: 0.324784
INFO: asynchronous_engine.hpp(run:119): Worker 2 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 0 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 5 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 6 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 1 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 4 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 7 finished.
INFO: asynchronous_engine.hpp(run:119): Worker 3 finished.
Final result. Obj=0, TRAIN RMSE= 0.0000 VALIDATION RMSE= 0.0000.
Finished in 552.339237 seconds
INFO: io.hpp(export_test_file:418): **Completed successfully (mean prediction: -0.733542
INFO: read_matrix_market.hpp(save_matrix_market_vector:250): Saved output vector to file: rec_log_train.sorted.txt.out2.test.predictions vector size: 15561329
INFO: read_matrix_market.hpp(save_matrix_market_vector:251): You can read it with Matlab/Octave using the script mmread.m found on http://graphlab.org/mmread.m
Performance counters are: 0) EDGE_TRAVERSAL, 2109.31
Performance counters are: 2) CALC_RMSE_Q, 0.264258
Performance counters are: 6) CALC_OBJ, 1.1756
Final parsing and compilation of submission
Use GraphLab v2, for creating the results in KDD cup format. This code go over the recommendation and builds a list of the top 3 recommendations out of the test items.
Submission file contains exactly 1,340,127 users (713,611 from testA, and 626,516 from testB).
There are four inputs to the output creation. The first one is the sorted testA we created previously using matlab. The second is the prediction file which is output from GraphLab.
<309|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers$ ./kddcup_output_builder testA.sorted testA.predictions testB.sorted testB.predictions --ncpus=1
WARNING: kddcup_output_builder.cpp(main:165): Eigen detected. (This is actually good news!)
INFO: kddcup_output_builder.cpp(main:167): GraphLab Parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Schedule all vertices
INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
INFO: io.hpp(load_matrixmarket_graph:325): Reading matrix market file: testB.sorted
Rows: 2421057
Cols: 2421057
Nonzeros: 15561329
Constructing all vertices.
Adding edges.
Graph size: 15561329
INFO: io.hpp(load_matrix_market_vector:544): Going to read matrix market vector from input file: testB.predictions
INFO: kddcup_output_builder.cpp(operator():131): Finished going over 15561329 rating for 626516 unique users
INFO: kddcup_output_builder.cpp(operator():132): Successfully created output file: testB.sorted.out
INFO: io.hpp(load_matrixmarket_graph:325): Reading matrix market file: testA.sorted
Rows: 2421057
Cols: 2421057
Nonzeros: 19349608
Constructing all vertices.
Adding edges.
Graph size: 19349608
INFO: io.hpp(load_matrix_market_vector:544): Going to read matrix market vector from input file: testA.predictions
INFO: kddcup_output_builder.cpp(operator():131): Finished going over 19349608 rating for 713611 unique users
INFO: kddcup_output_builder.cpp(operator():132): Successfully created output file: testA.sorted.out
Finished in 82.1905
INFO: kddcup_output_builder.cpp(main:189): Merging the two files together
updating: testA.sorted.out (deflated 72%)
INFO: kddcup_output_builder.cpp(main:208): Successfully created zip file: submission.zip
INFO: kddcup_output_builder.cpp(main:214): removed temporary files
INFO: kddcup_output_builder.cpp(main:216): Created successfully zip file: submission.zip
Preliminary results
Here is a very preliminary report on the results I was able to obtain so far:
ALS - 0.254 on the leaderboard
SVD++ - 0.29 on the leaderboard (Thanks for my collaborators in Hungarian National Academy of Science, specifically Tamas Kiss)
bias-SVD 0.330 on the leaderboard (After some optimization JustinYan got 0.34 with it).
logistic regression - 0.336
Disclaimer
As the code is evolving, there may be some bugs/inaccuracies in the code. However, I did not want to wait for the last minute to get a perfect solution that no one will be able to use. Please ping me if you find any bugs or have any suggestions.