This is a great opportunity to get large scale data, for testing matrix factorization algorithms. (Specifically I am using GraphLab Matrix factorization ).
In track1, the task is to predict user music recommendations. There are about 1M users, 600K music titles, and about 252M non-zero ratings in the training data. (To compare to Netflix, there where only 50K users, 500K movies and about 100M non-zero ratings). Unlike Netflix, the ratings are between 0-100 (and not 1-5). In matrix factorization context, need to think how to represent a rating of zero vs. non existing rating.
For a start, I have tested GraphLab matrix factorization data with the track1 dataset. On an 8 core machine (2x4 cores Intel Xeon 2.67 Ghz) each alternating least squares iteration takes around 150 seconds. After 10 iterations I get training RMSE of 21.7 and validation RMSE of 22.5.
For reader's requests, I have posted some more detailed instructions on how to run GraphLab on KDD Cup data here
Here is the output of the GraphLab program:
<40|0>bickson@bigbro6:~/newgraphlab/graphlabapi/release/demoapps/pmf$ ./pmf kddcup 0 --ncpus=8 --float=true --zero=true --scheduler="round_robin(max_iterations=10)" --lambda=2 --D=25 Setting run mode ALS_MATRIX INFO :pmf.cpp(main:1236): PMF starting loading data file kddcup Loading kddcup TRAINING Matrix size is: 1000990 624961 1 Creating 252800275 edges... .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................loading data file kddcupe Loading kddcupe VALIDATION Matrix size is: 1000990 624961 1 Creating 4003960 edges... .....................loading data file kddcupt Loading kddcupt TEST Matrix size is: 1000990 624961 6649 Creating 6005940 edges... ...............................setting regularization weight to 2 PTF_ALS for matrix (1000990, 624961, 6649):252800275. D=25 pU=2, pV=2, pT=1, muT=1, D=25 nuAlpha=1, Walpha=1, mu=0, muT=1, nu=25, beta=1, W=1, WT=1 BURN_IN=10 complete. Obj=4.85306e+11, TRAIN RMSE=61.9632 TEST RMSE=75.7155. INFO :asynchronous_engine.hpp(run:79): Worker 0 started. INFO :asynchronous_engine.hpp(run:79): Worker 1 started. INFO :asynchronous_engine.hpp(run:79): Worker 2 started. INFO :asynchronous_engine.hpp(run:79): Worker 3 started. INFO :asynchronous_engine.hpp(run:79): Worker 4 started. INFO :asynchronous_engine.hpp(run:79): Worker 5 started. INFO :asynchronous_engine.hpp(run:79): Worker 6 started. INFO :asynchronous_engine.hpp(run:79): Worker 7 started. Entering last iter with 1 161.488) Iter ALS 1 Obj=4.63853e+11, TRAIN RMSE=60.5707 TEST RMSE=26.6540. Entering last iter with 2 301.86) Iter ALS 2 Obj=9.7782e+10, TRAIN RMSE=27.7971 TEST RMSE=26.0930. Entering last iter with 3 445.925) Iter ALS 3 Obj=8.91077e+10, TRAIN RMSE=26.5305 TEST RMSE=24.1940. Entering last iter with 4 585.759) Iter ALS 4 Obj=7.09672e+10, TRAIN RMSE=23.6696 TEST RMSE=23.0499. Entering last iter with 5 727.565) Iter ALS 5 Obj=6.42765e+10, TRAIN RMSE=22.5229 TEST RMSE=22.6883. Entering last iter with 6 870.943) Iter ALS 6 Obj=6.17103e+10, TRAIN RMSE=22.0672 TEST RMSE=22.5748. Entering last iter with 7 1011.14) Iter ALS 7 Obj=6.06398e+10, TRAIN RMSE=21.8743 TEST RMSE=22.5319. Entering last iter with 8 1153.29) Iter ALS 8 Obj=6.01234e+10, TRAIN RMSE=21.7807 TEST RMSE=22.5117. Entering last iter with 9 1292.27) Iter ALS 9 Obj=5.98422e+10, TRAIN RMSE=21.7295 TEST RMSE=22.5004. INFO :asynchronous_engine.hpp(run:89): Worker 5 finished. Entering last iter with 10 INFO :asynchronous_engine.hpp(run:89): Worker 0 finished. INFO :asynchronous_engine.hpp(run:89): Worker 1 finished. INFO :asynchronous_engine.hpp(run:89): Worker 4 finished. INFO :asynchronous_engine.hpp(run:89): Worker 7 finished. INFO :asynchronous_engine.hpp(run:89): Worker 2 finished. INFO :asynchronous_engine.hpp(run:89): Worker 6 finished. 1430.98) Iter ALS 10 Obj=5.96712e+10, TRAIN RMSE=21.6983 TEST RMSE=22.4935. INFO :asynchronous_engine.hpp(run:89): Worker 3 finished. Final result. Obj=5.96743e+10, TRAIN RMSE= 21.6989 TEST RMSE= 22.4935. Exporting KDD cup test graph: kddcupt.kdd.out **Completed successfully (mean prediction: 64.919045)** Finished in 1437.142338 Counters are: 0) EDGE_TRAVERSAL, 6038.63 Counters are: 1) BPTF_SAMPLE_STEP, 0 Counters are: 2) CALC_RMSE_Q, 0.044609 Counters are: 3) ALS_LEAST_SQUARES, 3258.77 Counters are: 4) NA, 0 Counters are: 5) BPTF_TIME_EDGES, 0 Counters are: 6) BPTF_LEAST_SQUARES, 0 Counters are: 7) CALC_OBJ, 3.51506 Counters are: 8) NA, 0 Counters are: 9) BPTF_MVN_RNDEX, 0 Counters are: 10) BPTF_LEAST_SQUARES2, 0 === REPORT FOR core === [Numeric] ncpus: 8 [Other] affinities: false compile_flags: -O3 -Wall -g -mfpmath=sse -msse2 -funroll-loops -fprefetch-loop-arrays -fopenmp engine: async scheduler: round_robin schedyield: true scope: edge === REPORT FOR engine === [Numeric] num_edges: 2.528e+08 num_syncs: 0 num_vertices: 1.62595e+06 updatecount: 1.62595e+07 (count: 8, min: 1.97158e+06, max: 2.11913e+06, avg: 2.03244e+06) [Timings] runtime: 1412.8 s [Other] termination_reason: task depletion (natural) [Numeric] updatecount_vector: 1.62595e+07 (count: 8, min: 1.97158e+06, max: 2.11913e+06, avg: 2.03244e+06) updatecount_vector.values: 1.99981e+06,2.01133e+06,1.97158e+06,2.08794e+06,2.00821e+06,2.11913e+06,2.02736e+06,2.03415e+06,
I wrote a simple Matlab script to parse the text input files into sparse Matlab matrix. Here it is:
%Written by Danny Bickson, CMU, March 211 %This script reads KDD cup track1 format and saves %it into a sparse Matlab matrix nUsers=1000990; nItems=624961; nRatings=262810175; nTrainRatings=252800275; nProbeRatings=4003960; nTestRatings=6005940; runmode=1; % 1 for training, 2 for probe, 3 for test switch runmode case 1 filename='/path/to/trainIdx1.txt'; ratings=nTrainRatings; outfile='/path/to/train1.mat'; case 2 filename='/path/to/validationIdx1.txt'; ratings=nProbeRatings; outfile='/path/to/probe1.mat'; case 3 filename='/path/to/testIdx1.txt'; ratings=nTestRatings; outfile='/path/to/test1.mat'; end rows=zeros(ratings,1); cols=rows; vals=rows; try ff=fopen(filename,'r'); cnt=1; for j=1:nUsers % read user id and number of ratings [a,num]=fscanf(ff,'%d|%d',2); assert(num==2); user=a(1); numofratings=a(2); if (mod(j,1000)==0) disp(['user: ', num2str(user),' ratings: ', num2str(numofratings)]); end if (runmode==3) assert(numofratings==6); % 6 ratings per each user in test file else if (runmode == 2) assert(numofratings==4); % 4 ratings per user in probe file end for i=1:numofratings % for each ratings b=-100; if (runmode<=2) [b,num]=fscanf(ff,'%d %d %d %d:%d:%d',6); assert(num==6); else [b,num]=fscanf(ff,'%d %d %d:%d:%d',5); assert(num==5); end rows(cnt)=user+1; % user assert(rows(cnt)>=1 && rows(cnt)<=nUsers); cols(cnt)=b(1)+1; % item assert(cols(cnt)>=1 && cols(cnt)<=nItems); if (runmode<=2) vals(cnt)=b(2); % rating assert(vals(cnt)>=0 && vals(cnt)<=100); else vals(cnt)=1; % placeholder for missing rating end cnt=cnt+1; end end assert(cnt==ratings+1); catch ex disp(['exception on ', num2str(i), ' ', num2str(j)]); ex end % construct and save a sparse Matlab matrix A=sparse(double(rows), double(cols), double(vals)); save(outfile,'-v7.3','A');
I wanted to comment in your kdd-cup group post but gave me an error :(
ReplyDelete- thanks for the script ;) I'm gonna try it as soon as download the data at home, can I ask u if you're using matlab only to do the whole game or are u using also mexfiles with c++ , openMP ... ?
Hi!
ReplyDeleteActually I am using Graphlab, and I am getting an RMSE of 22.3 for the validation data using alternating least squares. Write me if you like to get directions on how to setup GraphLab to run on KDDCUP.
Best,
DB
Hi,
ReplyDeleteDid you submit your prediction result for the Test data of KDDCup 2011? Is the Test result much different from Validation result?
Bests,
Thanh
Yes, the test is about +2 points higher in RMSE.
ReplyDeleteFor alternating least squares I got 23.97. Now I have implemented SVD++ and I will try it soon. I heard from people that SVD++ got RMSE 23.5.
connat 23.9727
ReplyDeleteIdealists 23.9768
TEST 23.9768
There are 3 contestants with this score. Which one is you? Or is it BiXolVer? I'm also curious about SVD++ score.
Hi,
ReplyDeleteYes, it is TEST.
I tried SVD++, with number of features = 20, without any tuning for different params, for 15 iterations, on 8 cores machine and got 24.6
but I am pretty sure this can be improved when tuning learning rate, step size, etc.
One of the groups combined my ALS score with their SVM and get -0.2 RMSE (total 23.2)