Friday, December 14, 2012

Collaborative filtering - 3rd generation - part 2

NOTE: This blog post is two years old. We have reimplemented this code as part of Graphlab Create. The implementation in GraphLab Create is preferred since:
1) No input format conversions are needed (like matrix market header setup)
2) No parameter tuning like step sizes and regularization are needed
3) No complicated command line arguments
4) The new implementation is more accurate, especially regarding the validation dataset. 
Anyone who wants to try it out should email me, I will send you the exact same code in python.

**********************************************************************************
A couple of days ago I wrote about a new experimental software I am writing - which is what I call a 3rd generation collaborative filtering software. I got a lot of interesting feedback from my readers which helps improve the software. Previously I tried it to examine its performance on KDD CUP 2012 dataset. Now I tried it on a completely different datasets and I am quite pleased with the results.

First dataset: Airline on time


Below I will explain how to deploy it on a different problem domain: Airline on time performance. It is a completely different dataset from a different domain, but still the gensgd software can deal without without any modification. I hope that those results that show how
flexible is the software will encourage additional data scientist to try it out!

The airline on time dataset, has information about 10 years of flights in the US. The data of each year is a csv file with the following format:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay

The fields are rather self explanatory  Each line represents a single flight, and information about the date, carrier, airport etc. is given, and the interesting fields is the varying information about flight duration.

And here are the first few lines:

2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA

Note: you can get the dataset using the commands:
curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 -o 2008.csv.bz2
bunzip2 2008.csv.bz2


First task. Can we predict the total time the flight was on the air? 


Well, for a matrix factorization method, it is not clear what is the actual matrix here. That is why it is useful to have a flexible software. In my experiments I have chosen "UniqueCarrier" and "FlightNum" as the two fields which form the matrix. This is because the characterize each flight rather uniquely. Next we need to decide which field we want to predict. I have chosen the ActualElapsedTime as the prediction target. Note that those fields are chosen on the fly, so you are more than welcome to chose others and see how well is the prediction in that case.
(Additional information about each field meaning is found here).

First let's use traditional matrix factorization.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1  --gensgd_rate3=1e-5  --gensgd_mult_dec=0.9999 --max_iter=20 --file_columns=28 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 0 : 

INFO:     gensgd.cpp(main:1212): Target variable    11 : ActualElapsedTime
INFO:     gensgd.cpp(main:1213): From                8 : UniqueCarrier
INFO:     gensgd.cpp(main:1214): To                  9 : FlightNum

   7.58561) Iteration:   0 Training RMSE:    67.1094
   11.7177) Iteration:   1 Training RMSE:    64.6665
   15.8441) Iteration:   2 Training RMSE:    63.2155
   19.9971) Iteration:   3 Training RMSE:    59.0044
   24.0989) Iteration:   4 Training RMSE:    53.9083
   28.1962) Iteration:   5 Training RMSE:    50.2416
...
   77.6041) Iteration:  17 Training RMSE:    35.6409
   81.7165) Iteration:  18 Training RMSE:     35.505
   85.8197) Iteration:  19 Training RMSE:    35.4046
   89.9266) Iteration:  20 Training RMSE:    35.3288


We got RMSE error of 35.3 minutes error on predicted flight time taking into account the carrier and flight number. That is rather bad.. we are half an hour off track.

Next let's throw in some temporal features into the computation: Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime. How do we do that? It is very easy! Just add the command line: --features=0,1,2,3,4,5,6,7 namely the positions of the features in the input file. This is what we call temporal matrix factorization or tensor factorization. But for utilizing it in one of the traditional methods, you need to merge al the 8 fields into one integer which encodes the time. Which is of course a tedious task.



bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --file_columns=28 --gensgd_rate3=1e-5  --gensgd_mult_dec=0.9999 --max_iter=100  --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --features=1,2,3,4,5,6,7 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 7 : 

INFO:     gensgd.cpp(main:1211): Selected feature:   1 : Month
INFO:     gensgd.cpp(main:1211): Selected feature:   2 : DayofMonth
INFO:     gensgd.cpp(main:1211): Selected feature:   3 : DayOfWeek
INFO:     gensgd.cpp(main:1211): Selected feature:   4 : DepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   5 : CRSDepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   6 : ArrTime
INFO:     gensgd.cpp(main:1211): Selected feature:   7 : CRSArrTime

INFO:     gensgd.cpp(main:1212): Target variable    11 : ActualElapsedTime
INFO:     gensgd.cpp(main:1213): From                8 : UniqueCarrier
INFO:     gensgd.cpp(main:1214): To                  9 : FlightNum


   21.8356) Iteration:   0 Training RMSE:    50.3144
   36.6782) Iteration:   1 Training RMSE:    40.4813
    51.425) Iteration:   2 Training RMSE:    36.0579
   66.4348) Iteration:   3 Training RMSE:    33.4226
...
   272.188) Iteration:  17 Training RMSE:    20.0103
   286.887) Iteration:  18 Training RMSE:    19.7198
   301.602) Iteration:  19 Training RMSE:    19.4597
   316.305) Iteration:  20 Training RMSE:    19.2147


 With temporal information we now got to RMSE of 19.2 minutes. Which is again not that
good.

Now let's utilize the full power of gensgd: when the going gets tough - throw in some more features! Without even understanding what the feature means I have thrown in almost everything...

./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --features=1,2,3,4,5,6,7,12,13,14,15,16,17,18 --gensgd_rate3=1e-5  --gensgd_mult_dec=0.9999 --file_columns=28 --max_iter=20 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 14 : 
INFO:     gensgd.cpp(main:1211): Selected feature:   1 : Month
INFO:     gensgd.cpp(main:1211): Selected feature:   2 : DayofMonth
INFO:     gensgd.cpp(main:1211): Selected feature:   3 : DayOfWeek
INFO:     gensgd.cpp(main:1211): Selected feature:   4 : DepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   5 : CRSDepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   6 : ArrTime
INFO:     gensgd.cpp(main:1211): Selected feature:   7 : CRSArrTime
INFO:     gensgd.cpp(main:1211): Selected feature:  12 : CRSElapsedTime
INFO:     gensgd.cpp(main:1211): Selected feature:  13 : AirTime
INFO:     gensgd.cpp(main:1211): Selected feature:  14 : ArrDelay
INFO:     gensgd.cpp(main:1211): Selected feature:  15 : DepDelay
INFO:     gensgd.cpp(main:1211): Selected feature:  16 : Origin
INFO:     gensgd.cpp(main:1211): Selected feature:  17 : Dest
INFO:     gensgd.cpp(main:1211): Selected feature:  18 : Distance
INFO:     gensgd.cpp(main:1212): Target variable    11 : ActualElapsedTime
INFO:     gensgd.cpp(main:1213): From                8 : UniqueCarrier
INFO:     gensgd.cpp(main:1214): To                  9 : FlightNum
   36.2089) Iteration:   0 Training RMSE:    21.1476
   61.2802) Iteration:   1 Training RMSE:    10.1963
   86.3032) Iteration:   2 Training RMSE:    8.64215
   111.236) Iteration:   3 Training RMSE:    7.76054
   136.246) Iteration:   4 Training RMSE:    7.14308
   161.221) Iteration:   5 Training RMSE:     6.6629
...
   461.528) Iteration:  17 Training RMSE:    4.26991
    486.61) Iteration:  18 Training RMSE:    4.17239
   511.737) Iteration:  19 Training RMSE:    4.08084
   536.775) Iteration:  20 Training RMSE:    3.99414

Now we got down to 4 minutes avg error. But, we can continue the computation (run more iterations) and we get down even below 2 minutes error. Isn't that neat? The average flight time is 127 minutes in 2008, so 2 minutes error prediction is not that bad.

Conclusion: traditional matrix / tensor factorization have some severe limitation when dealing with real world complex data. Additional techniques are needed to improve accuracy!

Second task: let's predict TaxiIn (time that the plane is on the ground when coming in)

This task is slightly more difficult, since as you may imagine, there is much larger variation in texiin time relative to flight time. But is predeicing it more difficult? No.. we simply change --val_pos=19 namely to point the taget into the taxiintime field.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=19 --rehash=1  --file_columns=28 --gensgd_rate3=1e-3  --gensgd_mult_dec=0.9999 --max_iter=20  --file_columns=28 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --features=1,2,3,4,5,6,7,10,11,12,13,14,15,16,17,18 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[quiet] => [1]
INFO:     gensgd.cpp(main:1155): Total selected features: 16 : 
INFO:     gensgd.cpp(main:1158): Selected feature: 1
INFO:     gensgd.cpp(main:1158): Selected feature: 2
INFO:     gensgd.cpp(main:1158): Selected feature: 3
INFO:     gensgd.cpp(main:1158): Selected feature: 4
INFO:     gensgd.cpp(main:1158): Selected feature: 5
INFO:     gensgd.cpp(main:1158): Selected feature: 6
INFO:     gensgd.cpp(main:1158): Selected feature: 7
INFO:     gensgd.cpp(main:1158): Selected feature: 10
INFO:     gensgd.cpp(main:1158): Selected feature: 11
INFO:     gensgd.cpp(main:1158): Selected feature: 12
INFO:     gensgd.cpp(main:1158): Selected feature: 13
INFO:     gensgd.cpp(main:1158): Selected feature: 14
INFO:     gensgd.cpp(main:1158): Selected feature: 15
INFO:     gensgd.cpp(main:1158): Selected feature: 16
INFO:     gensgd.cpp(main:1158): Selected feature: 17
INFO:     gensgd.cpp(main:1158): Selected feature: 18
   1.56777) Iteration:   0 Training RMSE:    3.89207
   3.01777) Iteration:   1 Training RMSE:    3.64978
    4.5159) Iteration:   2 Training RMSE:    3.46472
    5.8659) Iteration:   3 Training RMSE:    3.30712
   7.26778) Iteration:   4 Training RMSE:    3.17225
    8.7159) Iteration:   5 Training RMSE:    3.06696
...
   23.6072) Iteration:  16 Training RMSE:    2.60147
   24.9789) Iteration:  17 Training RMSE:    2.57697
   26.3267) Iteration:  18 Training RMSE:    2.55768
   27.6967) Iteration:  19 Training RMSE:    2.54186
   29.0773) Iteration:  20 Training RMSE:    2.53113
We again get to average RMSE of 2.5 minutes - which means that this task is actually more difficult than predicting air time.


Instructions:
0) Install GraphChi from mercurial using the instructions here.
1) Download the year 2008 from here.
2) Open the zip file using:
bunzip2 2008.csv.bz2
3) Create a matrix market format file, named 2008.csv:info with the following two lines:
%%MatrixMarket matrix coordinate real general
20 7130 1000000
4) Run the commands as instructed above.


Second dataset: Hearst machine learning challenge

A while ago Hearst provided data about emails campaigns and the task was to predict user reaction to emails (click/ not clicked). The data has several millions records about emails sent with around 273 user features for each email. Here is some of the available frields:
CLICK_FLG,OPEN_FLG,ADDR_VER_CD,AQI,ASIAN_CD,AUTO_IN_MARKET,BIRD_QTY,BUYER_DM_BOOKS,BUYER_DM_COLLECT_SPC_FOOD,BUYER_DM_CRAFTS_HOBBI,BUYER_DM_FEMALE_ORIEN,BUYER_DM_GARDEN_FARM,BUYER_DM_GENERAL,BUYER_DM_GIFT_GADGET,BUYER_DM_MALE_ORIEN,BUYER_DM_UPSCALE,BUYER_MAG_CULINARY_INTERS,BUYER_MAG_FAMILY_GENERAL,BUYER_MAG_FEMALE_ORIENTED,BUYER_MAG_GARDEN_FARMING,BUYER_MAG_HEALTH_FITNESS,BUYER_MAG_MALE_SPORT_ORIENTED,BUYER_MAG_RELIGIOUS,CATS_QTY,CEN_2000_MATCH_LEVEL,CLUB_MEMBER_CD,COUNTRY_OF_ORIGIN,DECEASED_INDICATOR,DM_RESPONDER_HH,DM_RESPONDER_INDIV,DMR_CONTRIB_CAT_GENERAL,DMR_CONTRIB_CAT_HEALTH_INST,DMR_CONTRIB_CAT_POLITICAL,DMR_CONTRIB_CAT_RELIGIOUS,DMR_DO_IT_YOURSELFERS,DMR_MISCELLANEOUS,DMR_NEWS_FINANCIAL,DMR_ODD_ENDS,DMR_PHOTOGRAPHY,DMR_SWEEPSTAKES,DOG_QTY,DWELLING_TYPE,DWELLING_UNIT_SIZE,EST_LOAN_VALUE_RATIO,ETECH_GROUP,ETHNIC_GROUP_CODE,ETHNIC_INSIGHT_MTCH_FLG,ETHNICITY_DETAIL,EXPERIAN_INCOME_CD,EXPERIAN_INCOME_CD_V4,GNDR_OF_CHLDRN_0_3,GNDR_OF_CHLDRN_10_12,GNDR_OF_CHLDRN_13_18,GNDR_OF_CHLDRN_4_6,GNDR_OF_CHLDRN_7_9,HH_INCOME,HHLD_DM_PURC_CD,HOME_BUSINESS_IND,I1_BUSINESS_OWNER_FLG,I1_EXACT_AGE,I1_GNDR_CODE,I1_INDIV_HHLD_STATUS_CODE,INDIV_EDUCATION,INDIV_EDUCATION_CONF_LVL,INDIV_MARITAL_STATUS,INDIV_MARITAL_STATUS_CONF_LVL,INS_MATCH_TYPE,LANGUAGE,LENGTH_OF_RESIDENCE,MEDIAN_HOUSING_VALUE,MEDIAN_LEN_OF_RESIDENCE,MM_INCOME_CD,MOSAIC_HH,MULTI_BUYER_INDIV,NEW_CAR_MODEL,NUM_OF_ADULTS_IN_HHLD,NUMBER_OF_CHLDRN_18_OR_LESS,OCCUP_DETAIL,OCCUP_MIX_PCT,PCT_CHLDRN,PCT_DEROG_TRADES,PCT_HOUSEHOLDS_BLACK,PCT_OWNER_OCCUPIED,PCT_RENTER_OCCUPIED,PCT_TRADES_NOT_DEROG,PCT_WHITE,PHONE_TYPE_CD,PRES_OF_CHLDRN_0_3,PRES_OF_CHLDRN_10_12,PRES_OF_CHLDRN_13_18,PRES_OF_CHLDRN_4_6,PRES_OF_CHLDRN_7_9,PRESENCE_OF_CHLDRN,PRIM_FEM_EDUC_CD,PRIM_FEM_OCC_CD,PRIM_MALE_EDUC_CD,PRIM_MALE_OCC_CD,RECIPIENT_RELIABILITY_CD,RELIGION,SCS_MATCH_TYPE,TRW_INCOME_CD,TRW_INCOME_CD_V4,USED_CAR_CD,Y_OWNS_HOME,Y_PROBABLE_HOMEOWNER,Y_PROBABLE_RENTER,Y_RENTER,YRS_SCHOOLING_CD,Z_CREDIT_CARD

Fields meaning and code are described in detail here. You will need to register the website for getting access to the data.

And this the is the first entry:
N,N,,G,,8,0,1,0,0,0,0,1,0,0,0,0,4,0,0,1,0,0,0,B,U,0,,M,Y,0,0,0,0,0,1,1,1,0,2,0,A,C,0,J,18,Y,66,,A,U,U,U,U,U,34,,U,U,84,M,H,1,1,M,5,I,01,00,67,3,,E06,Y,7,3,0,05,0,37,78.09,30,63,36,13.27,59,,N,N,N,N,N,N,U,UU,U,07,6,J,4,,J,4,U,,Y,U,0,Y,,24,,,,,,,F,F,,,,,,,U,Y,,,,,,,17,69,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NORTH LAUDERDALE,330685141,FL,190815,,,,,,1036,Third Party - Merch,"Mon, 09/20/10 01:04 PM"


For this demo, I used the file Modeling_1.csv which is the first of 5 files, with 400K entries.

We would like to predict the zeros entry (click flag). I have taken column 9 and 10 as the matrix from/to entries. The rest of the columns up to column 40 are features. (While there are more features the actual solution is so accurate so the first 40 are enough).

After about an hour of playing I got the the following formulation:

./toolkits/collaborative_filtering/gensgd --training=Modeling_1.csv --val_pos=0 --from_pos=9 --to_pos=10 --features=3,4,5,6,7,8,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40 --has_header_titles=1 --rehash=1 --file_columns=200 --rehash_value=1 --calc_error=1 --cutoff=0.5 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1255): Total selected features: 36 : 
INFO:     gensgd.cpp(main:1258): Selected feature:   3 : AQI
INFO:     gensgd.cpp(main:1258): Selected feature:   4 : ASIAN_CD
INFO:     gensgd.cpp(main:1258): Selected feature:   5 : AUTO_IN_MARKET
INFO:     gensgd.cpp(main:1258): Selected feature:   6 : BIRD_QTY
INFO:     gensgd.cpp(main:1258): Selected feature:   7 : BUYER_DM_BOOKS
INFO:     gensgd.cpp(main:1258): Selected feature:   8 : BUYER_DM_COLLECT_SPC_FOOD
INFO:     gensgd.cpp(main:1258): Selected feature:  11 : BUYER_DM_GARDEN_FARM
INFO:     gensgd.cpp(main:1258): Selected feature:  12 : BUYER_DM_GENERAL
INFO:     gensgd.cpp(main:1258): Selected feature:  13 : BUYER_DM_GIFT_GADGET
INFO:     gensgd.cpp(main:1258): Selected feature:  14 : BUYER_DM_MALE_ORIEN
INFO:     gensgd.cpp(main:1258): Selected feature:  15 : BUYER_DM_UPSCALE
INFO:     gensgd.cpp(main:1258): Selected feature:  16 : BUYER_MAG_CULINARY_INTERS
INFO:     gensgd.cpp(main:1258): Selected feature:  17 : BUYER_MAG_FAMILY_GENERAL
INFO:     gensgd.cpp(main:1258): Selected feature:  18 : BUYER_MAG_FEMALE_ORIENTED
INFO:     gensgd.cpp(main:1258): Selected feature:  19 : BUYER_MAG_GARDEN_FARMING
INFO:     gensgd.cpp(main:1258): Selected feature:  20 : BUYER_MAG_HEALTH_FITNESS
INFO:     gensgd.cpp(main:1258): Selected feature:  21 : BUYER_MAG_MALE_SPORT_ORIENTED
INFO:     gensgd.cpp(main:1258): Selected feature:  22 : BUYER_MAG_RELIGIOUS
INFO:     gensgd.cpp(main:1258): Selected feature:  23 : CATS_QTY
INFO:     gensgd.cpp(main:1258): Selected feature:  24 : CEN_2000_MATCH_LEVEL
INFO:     gensgd.cpp(main:1258): Selected feature:  25 : CLUB_MEMBER_CD
INFO:     gensgd.cpp(main:1258): Selected feature:  26 : COUNTRY_OF_ORIGIN
INFO:     gensgd.cpp(main:1258): Selected feature:  27 : DECEASED_INDICATOR
INFO:     gensgd.cpp(main:1258): Selected feature:  28 : DM_RESPONDER_HH
INFO:     gensgd.cpp(main:1258): Selected feature:  29 : DM_RESPONDER_INDIV
INFO:     gensgd.cpp(main:1258): Selected feature:  30 : DMR_CONTRIB_CAT_GENERAL
INFO:     gensgd.cpp(main:1258): Selected feature:  31 : DMR_CONTRIB_CAT_HEALTH_INST
INFO:     gensgd.cpp(main:1258): Selected feature:  32 : DMR_CONTRIB_CAT_POLITICAL
INFO:     gensgd.cpp(main:1258): Selected feature:  33 : DMR_CONTRIB_CAT_RELIGIOUS
INFO:     gensgd.cpp(main:1258): Selected feature:  34 : DMR_DO_IT_YOURSELFERS
INFO:     gensgd.cpp(main:1258): Selected feature:  35 : DMR_MISCELLANEOUS
INFO:     gensgd.cpp(main:1258): Selected feature:  36 : DMR_NEWS_FINANCIAL
INFO:     gensgd.cpp(main:1258): Selected feature:  37 : DMR_ODD_ENDS
INFO:     gensgd.cpp(main:1258): Selected feature:  38 : DMR_PHOTOGRAPHY
INFO:     gensgd.cpp(main:1258): Selected feature:  39 : DMR_SWEEPSTAKES
INFO:     gensgd.cpp(main:1258): Selected feature:  40 : DOG_QTY
INFO:     gensgd.cpp(main:1259): Target variable   0 : CLICK_FLG
INFO:     gensgd.cpp(main:1260): From              9 : BUYER_DM_CRAFTS_HOBBI
INFO:     gensgd.cpp(main:1261): To               10 : BUYER_DM_FEMALE_ORIEN
   54.8829) Iteration:   0 Training RMSE: 0.00927502  Train err:      8e-05
   99.4742) Iteration:   1 Training RMSE: 0.00120904  Train err:          0
   143.852) Iteration:   2 Training RMSE: 0.000793143 Train err:          0
   188.523) Iteration:   3 Training RMSE: 0.000604034 Train err:          0
   233.188) Iteration:   4 Training RMSE: 0.000500067 Train err:          0


We got a very good classifier - starting from the second iteration there are no classification errors.

Some explanation about additional run time flags, not used in previous examples.
1) --rehash_value=1 - since the target value is not numeric, I used rehash_value to translate Y/N into two numeric integer bins.
2) --cutoff=0.5 - after hasing the target Y/N we get two integers: 0 and 1. So I use 0.5 as a prediction threshold to decide for Y/N.
3) --file_columns=200 - I am looking only at the first 40 columns, so there is no need in parsing all the 273 columns. (You  can play with this parameter on run time).
4) --has_header_titles=1 - first line of input field includes column titles

Instructions
1) Register to the hearst website.
2) Download the first data file Modeling_1.csv and put in the in main graphchi folder.
3) Create a file named Modeling_1.csv:info and put the following two lines in it:
%%MatrixMarket matrix coordinate real general
11 13 400000
4) Run as instructed.

25 comments:

  1. Hi,

    I've installed graph-chi on my macbook, and ran a few of the demo scripts without error. However, it appears I cannot load data from a .csv file. When I try to run "traditional matrix factorization" I get the following error: "FATAL: gensgd.cpp(convert_matrixmarket_N:582): Bug: can not add edge from 0 to J 0 since max is: 0x0"

    It appears that the conversion from .csv to matrix market is failing. What could be causing this?

    Thanks,

    Zach

    ReplyDelete
    Replies
    1. I found the problem: the file should be named "2008.csv:info" not "csv.2008:info"

      Delete
    2. Thanks for the update! I have fixed the documentation.

      Delete
  2. Hello,

    I've installed graphlab on a VM with unbuntu and ran the demo scripts of this page and got some errors :

    dataset 2008.CSV
    - traditional matrix factorization : OK
    - temporal matrix factorization :
    [Other]
    app: sharder
    gensgd: malloc.c:2451: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
    Aborted (core dumped)
    - More features : OK
    - TaxiIn : OK

    dataset Modeling_1.csv
    INFO: gensgd.cpp(convert_matrixmarket_N:559): Starting to read matrix-market input. Matrix dimensions: 11 x 13, non-zeros: 400000
    FATAL: gensgd.cpp(read_line:333): Error reading line 0 feature 115 [ N,N,,G,,8,0,1,0,0,0,0,1,0,0,0,0,4,0,0,1,0,0,0,B,U,0,,M,Y,0,0,0,0,0,1,1,1,0,2,0,A,C,0,J,18,Y,66,,A,U,U,U,U,U,34,,U,U,84,M,H,1,1,M,5,I,01,00,67,3,,E06,Y,7,3,0,05,0,37,78.09,30,63,36,13.27,59,,N,N,N,N,N,N,U,UU,U,07,6,J,4,,J,4,U,,Y,U,0,Y,,24,,,,h ]
    terminate called after throwing an instance of 'char const*'
    Aborted (core dumped)

    The first error is strange because it works with more features and i couldn't find what's wrong in the second file that cause a reading error (tried to change --file_columns but still doesn't work).

    Thanks.

    ReplyDelete
    Replies
    1. Hi,
      Sorry about that. Please retake from mercurial using "hg pull; hg update" and recompile using "make clean; make cf". A MAC OS contributed patch that was supposed to fix getline() missing function did a mess in the Linux version..
      Let me know if it now works.

      Delete
    2. Thanks Danny.

      It works perfectly now.

      Delete
  3. Hello Danny,

    As i said in another post i'm working on a one class and i tried your new soft on my database, i have few questions.

    - in your first example you set "--minval=-1 --maxval=1 --calc_error=1" but no cutoff, it automatically set the cutoff value at 0 ?

    - in the sparse example you don't set --minval and --maxval but --cutoff=0.5, is there a specific reason you write the command this way in this case ?

    - when you set --minval and --maxval what kind of loss function is used ?

    - you use the --validation option in the sparse example but when i try to use it with gensgd it doesn't work, is it normal ?

    - do you plan to implement the --test option ?

    - as i'm dealing with a one class problem i tried the implicite rating option and it worked but i'm curious of what is done when features option is used, what value are put to the features associated to these additionnal ratings ?

    Thanks.

    Regards.

    ReplyDelete
    Replies
    1. Hi Alex,
      1) You are right. The default cutoff is 0.
      2) --minval and --maxval are optional arguments, the slightly improve performance in some cases but when the result is any in the range, there is no need to truncate.
      3) --minval and --maxval are independent of the loss function used, you can use them with any loss function.
      4) Please send our user mailing list (graphlab-kdd) the exact command you used and the error you got using the --validation - it should work. (Even better if you have some small dataset to show the error).
      5) The --test option should work - send me a scenario where you get an error and I will debug it.
      6) Adding implicit rating does not have feature information and thus I suggest not to apply it here.

      Best,

      Delete
  4. I am running this algorithm and it produced *_U.mm file after execution. To get the recommendation, I tried to run rating command and getting following exception

    $GRAPHCHI_ROOT/toolkits/collaborative_filtering/rating --training=*_U.mm --num_ratings=5 --quiet=1 --algorithm=sgd
    WARNING: common.hpp(print_copyright:183): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
    [training] => [*_U.mm]
    [num_ratings] => [5]
    [quiet] => [1]
    [algorithm] => [sgd]
    FATAL: io.hpp(read_matrix_market_banner_and_size:61): Sorry, this application does not support complex values and requires a sparse matrix.
    terminate called after throwing an instance of 'char const*'
    Aborted

    Please help me where am I doing wrong

    ReplyDelete
  5. 1) which algorithm are you running - sgd?
    2) you need to give the same string as given to the sgd utility using the --training=XXXX command.

    ReplyDelete
    Replies
    1. When I am giving gensgd it is giving following exception

      FATAL: rating.cpp(main:296): --algorithms should be one of: als, sparse_als, sgd, nmf, wals

      Delete
  6. Gensgd is not support for the rating command.
    The only option you have is to give a file with --test=FILENAME
    and then you will get predictions for each line of features in the test data.
    (test data should have same format as training data)

    ReplyDelete
    Replies
    1. Hi Danny,
      I have given the test file in the following format
      userid productid
      test file contains all the user ids and product ids of the training file. But I have not found any predictions in the test data. Please help me whether this is proper way of running

      Delete
  7. Test file should be in the same exact format as training file.
    So if you have a csv for training, you should have csv with the same format for test

    ReplyDelete
    Replies
    1. Both files are in same format, all the fields are separated by space. But no predictions are captured in test file. I was able to run the gensgd command successfully.

      Delete
    2. Send me an example input file and I will take a look - most chances you are probably having one of the command line arguments wrong.

      Delete
    3. Hi Danny,
      I am running it using the following command

      $GRAPHCHI_ROOT/toolkits/collaborative_filtering/gensgd --training=userproductmatrix --test=userproducttestfile --from_pos=0 --to_pos=1 --val_pos=3 --rehash=1 --gensgd_rate3=1e-5 --gensgd_mult_dec=0.9999 --max_iter=20 --file_columns=4 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --features=2

      Delete
    4. The userproductmatrix you sent me has only 3 columns. In that case there is no point in using gensgd - you should use sgd. (Unless you have more columns in your version)

      Delete
  8. Hello, Great thanks for this post. I was able to run all of the different samples but I get an RMSE far higher than expected even after many iterations.

    For the exemple which should lead to 2 minutes RMSE, I get an RMSE of 32 minutes after 19 iterations.

    I run an Ubuntu, could it be a library issue or setup ?

    Thanks

    ReplyDelete
    Replies
    1. Hi Xavier,
      We have re-implmentated this code as part of GraphLab Create. You are highly encouraged to try it out - it is free and it gets to much better results. Send me an email and I will send you the ipython notebook to reproduce the exact same experiment in GLC.

      Delete
    2. Hi Danny,
      Thanks for your feedback. GraphLab Create seems great but seems risky to me: I went into terms & conditions and read "We grant you a limited, revocable license". I am currently testing different solutions and it seems hard to know what is the future of such an option considering t&c.

      Delete
    3. Our project has open source foundations and you can always stick to the open source if you like. GraphLab Create, while not open source, is still free in the foreseeable future. Fine tuning the open source directly is more difficult. I am now traveling, I will be happy to take a look at the example in a few days - if you don't mind please post a question at our user forum: http://forum.graphlab.com so I could keep track of the issue and not forget.

      Delete
    4. p.s.
      I will be happy to setup up a phone call to discuss your problem and give some advice regarding Graphlab Create evaluation.

      Delete
    5. Hi Danny, I looked into gensgd.cpp to find out the difference of RMSE. It turned out that step3 gets gensgd_rate multiplied 2 times instead of 1 for a step. Now it works. This seems to date from 2 commits made on oct 4 and 10 in 2013. I made a pull request. Regards, Xavier

      Delete
    6. Great find! I just merged your pull request. Much appreciated!

      Delete