NOTE: This blog post is two years old. We have reimplemented this code as part of Graphlab Create. The implementation in GraphLab Create is preferred since:
1) No input format conversions are needed (like matrix market header setup)
2) No parameter tuning like step sizes and regularization are needed
3) No complicated command line arguments
4) The new implementation is more accurate, especially regarding the validation dataset.
Anyone who wants to try it out should email me, I will send you the exact same code in python.
**********************************************************************************
A couple of days ago I wrote about a new experimental software I am writing - which is what I call a
3rd generation collaborative filtering software. I got a lot of interesting feedback from my readers which helps improve the software. Previously I tried it to examine its performance on KDD CUP 2012 dataset. Now I tried it on a completely different datasets and I am quite pleased with the results.
First dataset: Airline on time
Below I will explain how to deploy it on a different problem domain:
Airline on time performance. It is a completely different dataset from a different domain, but still the gensgd software can deal without without any modification. I hope that those results that show how
flexible is the software will encourage additional data scientist to try it out!
The airline on time dataset, has information about 10 years of flights in the US. The data of each year is a csv file with the following format:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
The fields are rather self explanatory Each line represents a single flight, and information about the date, carrier, airport etc. is given, and the interesting fields is the varying information about flight duration.
And here are the first few lines:
2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA
Note: you can get the dataset using the commands:
curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 -o 2008.csv.bz2
bunzip2 2008.csv.bz2
First task. Can we predict the total time the flight was on the air?
Well, for a matrix factorization method, it is not clear what is the actual matrix here. That is why it is useful to have a flexible software. In my experiments I have chosen "UniqueCarrier" and "FlightNum" as the two fields which form the matrix. This is because the characterize each flight rather uniquely. Next we need to decide which field we want to predict. I have chosen the ActualElapsedTime as the prediction target. Note that those fields are chosen on the fly, so you are more than welcome to chose others and see how well is the prediction in that case.
(Additional information about each field meaning is found
here).
First let's use
traditional matrix factorization.
bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --gensgd_rate3=1e-5 --gensgd_mult_dec=0.9999 --max_iter=20 --file_columns=28 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1155): Total selected features: 0 :
INFO: gensgd.cpp(main:1212): Target variable 11 : ActualElapsedTime
INFO: gensgd.cpp(main:1213): From 8 : UniqueCarrier
INFO: gensgd.cpp(main:1214): To 9 : FlightNum
7.58561) Iteration: 0 Training RMSE: 67.1094
11.7177) Iteration: 1 Training RMSE: 64.6665
15.8441) Iteration: 2 Training RMSE: 63.2155
19.9971) Iteration: 3 Training RMSE: 59.0044
24.0989) Iteration: 4 Training RMSE: 53.9083
28.1962) Iteration: 5 Training RMSE: 50.2416
...
77.6041) Iteration: 17 Training RMSE: 35.6409
81.7165) Iteration: 18 Training RMSE: 35.505
85.8197) Iteration: 19 Training RMSE: 35.4046
89.9266) Iteration: 20 Training RMSE: 35.3288
We got RMSE error of 35.3 minutes error on predicted flight time taking into account the carrier and flight number. That is rather bad.. we are half an hour off track.
Next let's throw in some temporal features into the computation:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime. How do we do that? It is very easy! Just add the command line:
--features=0,1,2,3,4,5,6,7 namely the positions of the features in the input file. This is what we call
temporal matrix factorization or tensor factorization. But for utilizing it in one of the traditional methods, you need to merge al the 8 fields into one integer which encodes the time. Which is of course a tedious task.
bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --file_columns=28 --gensgd_rate3=1e-5 --gensgd_mult_dec=0.9999 --max_iter=100 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --features=1,2,3,4,5,6,7 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1155): Total selected features: 7 :
INFO: gensgd.cpp(main:1211): Selected feature: 1 : Month
INFO: gensgd.cpp(main:1211): Selected feature: 2 : DayofMonth
INFO: gensgd.cpp(main:1211): Selected feature: 3 : DayOfWeek
INFO: gensgd.cpp(main:1211): Selected feature: 4 : DepTime
INFO: gensgd.cpp(main:1211): Selected feature: 5 : CRSDepTime
INFO: gensgd.cpp(main:1211): Selected feature: 6 : ArrTime
INFO: gensgd.cpp(main:1211): Selected feature: 7 : CRSArrTime
INFO: gensgd.cpp(main:1212): Target variable 11 : ActualElapsedTime
INFO: gensgd.cpp(main:1213): From 8 : UniqueCarrier
INFO: gensgd.cpp(main:1214): To 9 : FlightNum
21.8356) Iteration: 0 Training RMSE: 50.3144
36.6782) Iteration: 1 Training RMSE: 40.4813
51.425) Iteration: 2 Training RMSE: 36.0579
66.4348) Iteration: 3 Training RMSE: 33.4226
...
272.188) Iteration: 17 Training RMSE: 20.0103
286.887) Iteration: 18 Training RMSE: 19.7198
301.602) Iteration: 19 Training RMSE: 19.4597
316.305) Iteration: 20 Training RMSE: 19.2147
With temporal information we now got to
RMSE of 19.2 minutes. Which is again not that
good.
Now let's utilize the full power of gensgd: when the going gets tough - throw in some more features! Without even understanding what the feature means I have thrown in almost everything...
./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --features=1,2,3,4,5,6,7,12,13,14,15,16,17,18 --gensgd_rate3=1e-5 --gensgd_mult_dec=0.9999 --file_columns=28 --max_iter=20 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1155): Total selected features: 14 :
INFO: gensgd.cpp(main:1211): Selected feature: 1 : Month
INFO: gensgd.cpp(main:1211): Selected feature: 2 : DayofMonth
INFO: gensgd.cpp(main:1211): Selected feature: 3 : DayOfWeek
INFO: gensgd.cpp(main:1211): Selected feature: 4 : DepTime
INFO: gensgd.cpp(main:1211): Selected feature: 5 : CRSDepTime
INFO: gensgd.cpp(main:1211): Selected feature: 6 : ArrTime
INFO: gensgd.cpp(main:1211): Selected feature: 7 : CRSArrTime
INFO: gensgd.cpp(main:1211): Selected feature: 12 : CRSElapsedTime
INFO: gensgd.cpp(main:1211): Selected feature: 13 : AirTime
INFO: gensgd.cpp(main:1211): Selected feature: 14 : ArrDelay
INFO: gensgd.cpp(main:1211): Selected feature: 15 : DepDelay
INFO: gensgd.cpp(main:1211): Selected feature: 16 : Origin
INFO: gensgd.cpp(main:1211): Selected feature: 17 : Dest
INFO: gensgd.cpp(main:1211): Selected feature: 18 : Distance
INFO: gensgd.cpp(main:1212): Target variable 11 : ActualElapsedTime
INFO: gensgd.cpp(main:1213): From 8 : UniqueCarrier
INFO: gensgd.cpp(main:1214): To 9 : FlightNum
36.2089) Iteration: 0 Training RMSE: 21.1476
61.2802) Iteration: 1 Training RMSE: 10.1963
86.3032) Iteration: 2 Training RMSE: 8.64215
111.236) Iteration: 3 Training RMSE: 7.76054
136.246) Iteration: 4 Training RMSE: 7.14308
161.221) Iteration: 5 Training RMSE: 6.6629
...
461.528) Iteration: 17 Training RMSE: 4.26991
486.61) Iteration: 18 Training RMSE: 4.17239
511.737) Iteration: 19 Training RMSE: 4.08084
536.775) Iteration: 20 Training RMSE: 3.99414
Now we got down to 4 minutes avg error. But, we can continue the computation (run more iterations) and we get down even below 2 minutes error. Isn't that neat? The average flight time is 127 minutes in 2008, so
2 minutes error prediction is not that bad.
Conclusion: traditional matrix / tensor factorization have some severe limitation when dealing with real world complex data. Additional techniques are needed to improve accuracy!
Second task: let's predict TaxiIn (time that the plane is on the ground when coming in)
This task is slightly more difficult, since as you may imagine, there is much larger variation in texiin time relative to flight time. But is predeicing it more difficult? No.. we simply change --val_pos=19 namely to point the taget into the taxiintime field.
bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=19 --rehash=1 --file_columns=28 --gensgd_rate3=1e-3 --gensgd_mult_dec=0.9999 --max_iter=20 --file_columns=28 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --features=1,2,3,4,5,6,7,10,11,12,13,14,15,16,17,18 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
[quiet] => [1]
INFO: gensgd.cpp(main:1155): Total selected features: 16 :
INFO: gensgd.cpp(main:1158): Selected feature: 1
INFO: gensgd.cpp(main:1158): Selected feature: 2
INFO: gensgd.cpp(main:1158): Selected feature: 3
INFO: gensgd.cpp(main:1158): Selected feature: 4
INFO: gensgd.cpp(main:1158): Selected feature: 5
INFO: gensgd.cpp(main:1158): Selected feature: 6
INFO: gensgd.cpp(main:1158): Selected feature: 7
INFO: gensgd.cpp(main:1158): Selected feature: 10
INFO: gensgd.cpp(main:1158): Selected feature: 11
INFO: gensgd.cpp(main:1158): Selected feature: 12
INFO: gensgd.cpp(main:1158): Selected feature: 13
INFO: gensgd.cpp(main:1158): Selected feature: 14
INFO: gensgd.cpp(main:1158): Selected feature: 15
INFO: gensgd.cpp(main:1158): Selected feature: 16
INFO: gensgd.cpp(main:1158): Selected feature: 17
INFO: gensgd.cpp(main:1158): Selected feature: 18
1.56777) Iteration: 0 Training RMSE: 3.89207
3.01777) Iteration: 1 Training RMSE: 3.64978
4.5159) Iteration: 2 Training RMSE: 3.46472
5.8659) Iteration: 3 Training RMSE: 3.30712
7.26778) Iteration: 4 Training RMSE: 3.17225
8.7159) Iteration: 5 Training RMSE: 3.06696
...
23.6072) Iteration: 16 Training RMSE: 2.60147
24.9789) Iteration: 17 Training RMSE: 2.57697
26.3267) Iteration: 18 Training RMSE: 2.55768
27.6967) Iteration: 19 Training RMSE: 2.54186
29.0773) Iteration: 20 Training RMSE: 2.53113
We again get to average RMSE of 2.5 minutes - which means that this task is actually more difficult than predicting air time.
Instructions:
0) Install GraphChi from mercurial using the instructions
here.
1) Download the year 2008 from
here.
2) Open the zip file using:
bunzip2 2008.csv.bz2
3) Create a matrix market format file, named 2008.csv:info with the following two lines:
%%MatrixMarket matrix coordinate real general
20 7130 1000000
4) Run the commands as instructed above.
Second dataset: Hearst machine learning challenge
A while ago Hearst provided data about emails campaigns and the task was to predict user reaction to emails (click/ not clicked). The data has several millions records about emails sent with around 273 user features for each email. Here is some of the available frields:
CLICK_FLG,OPEN_FLG,ADDR_VER_CD,AQI,ASIAN_CD,AUTO_IN_MARKET,BIRD_QTY,BUYER_DM_BOOKS,BUYER_DM_COLLECT_SPC_FOOD,BUYER_DM_CRAFTS_HOBBI,BUYER_DM_FEMALE_ORIEN,BUYER_DM_GARDEN_FARM,BUYER_DM_GENERAL,BUYER_DM_GIFT_GADGET,BUYER_DM_MALE_ORIEN,BUYER_DM_UPSCALE,BUYER_MAG_CULINARY_INTERS,BUYER_MAG_FAMILY_GENERAL,BUYER_MAG_FEMALE_ORIENTED,BUYER_MAG_GARDEN_FARMING,BUYER_MAG_HEALTH_FITNESS,BUYER_MAG_MALE_SPORT_ORIENTED,BUYER_MAG_RELIGIOUS,CATS_QTY,CEN_2000_MATCH_LEVEL,CLUB_MEMBER_CD,COUNTRY_OF_ORIGIN,DECEASED_INDICATOR,DM_RESPONDER_HH,DM_RESPONDER_INDIV,DMR_CONTRIB_CAT_GENERAL,DMR_CONTRIB_CAT_HEALTH_INST,DMR_CONTRIB_CAT_POLITICAL,DMR_CONTRIB_CAT_RELIGIOUS,DMR_DO_IT_YOURSELFERS,DMR_MISCELLANEOUS,DMR_NEWS_FINANCIAL,DMR_ODD_ENDS,DMR_PHOTOGRAPHY,DMR_SWEEPSTAKES,DOG_QTY,DWELLING_TYPE,DWELLING_UNIT_SIZE,EST_LOAN_VALUE_RATIO,ETECH_GROUP,ETHNIC_GROUP_CODE,ETHNIC_INSIGHT_MTCH_FLG,ETHNICITY_DETAIL,EXPERIAN_INCOME_CD,EXPERIAN_INCOME_CD_V4,GNDR_OF_CHLDRN_0_3,GNDR_OF_CHLDRN_10_12,GNDR_OF_CHLDRN_13_18,GNDR_OF_CHLDRN_4_6,GNDR_OF_CHLDRN_7_9,HH_INCOME,HHLD_DM_PURC_CD,HOME_BUSINESS_IND,I1_BUSINESS_OWNER_FLG,I1_EXACT_AGE,I1_GNDR_CODE,I1_INDIV_HHLD_STATUS_CODE,INDIV_EDUCATION,INDIV_EDUCATION_CONF_LVL,INDIV_MARITAL_STATUS,INDIV_MARITAL_STATUS_CONF_LVL,INS_MATCH_TYPE,LANGUAGE,LENGTH_OF_RESIDENCE,MEDIAN_HOUSING_VALUE,MEDIAN_LEN_OF_RESIDENCE,MM_INCOME_CD,MOSAIC_HH,MULTI_BUYER_INDIV,NEW_CAR_MODEL,NUM_OF_ADULTS_IN_HHLD,NUMBER_OF_CHLDRN_18_OR_LESS,OCCUP_DETAIL,OCCUP_MIX_PCT,PCT_CHLDRN,PCT_DEROG_TRADES,PCT_HOUSEHOLDS_BLACK,PCT_OWNER_OCCUPIED,PCT_RENTER_OCCUPIED,PCT_TRADES_NOT_DEROG,PCT_WHITE,PHONE_TYPE_CD,PRES_OF_CHLDRN_0_3,PRES_OF_CHLDRN_10_12,PRES_OF_CHLDRN_13_18,PRES_OF_CHLDRN_4_6,PRES_OF_CHLDRN_7_9,PRESENCE_OF_CHLDRN,PRIM_FEM_EDUC_CD,PRIM_FEM_OCC_CD,PRIM_MALE_EDUC_CD,PRIM_MALE_OCC_CD,RECIPIENT_RELIABILITY_CD,RELIGION,SCS_MATCH_TYPE,TRW_INCOME_CD,TRW_INCOME_CD_V4,USED_CAR_CD,Y_OWNS_HOME,Y_PROBABLE_HOMEOWNER,Y_PROBABLE_RENTER,Y_RENTER,YRS_SCHOOLING_CD,Z_CREDIT_CARD
Fields meaning and code are described in detail
here. You will need to register the website for getting access to the data.
And this the is the first entry:
N,N,,G,,8,0,1,0,0,0,0,1,0,0,0,0,4,0,0,1,0,0,0,B,U,0,,M,Y,0,0,0,0,0,1,1,1,0,2,0,A,C,0,J,18,Y,66,,A,U,U,U,U,U,34,,U,U,84,M,H,1,1,M,5,I,01,00,67,3,,E06,Y,7,3,0,05,0,37,78.09,30,63,36,13.27,59,,N,N,N,N,N,N,U,UU,U,07,6,J,4,,J,4,U,,Y,U,0,Y,,24,,,,,,,F,F,,,,,,,U,Y,,,,,,,17,69,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NORTH LAUDERDALE,330685141,FL,190815,,,,,,1036,Third Party - Merch,"Mon, 09/20/10 01:04 PM"
For this demo, I used the file Modeling_1.csv which is the first of 5 files, with 400K entries.
We would like to predict the zeros entry (click flag). I have taken column 9 and 10 as the matrix from/to entries. The rest of the columns up to column 40 are features. (While there are more features the actual solution is so accurate so the first 40 are enough).
After about an hour of playing I got the the following formulation:
./toolkits/collaborative_filtering/gensgd --training=Modeling_1.csv --val_pos=0 --from_pos=9 --to_pos=10 --features=3,4,5,6,7,8,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40 --has_header_titles=1 --rehash=1 --file_columns=200 --rehash_value=1 --calc_error=1 --cutoff=0.5 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1255): Total selected features: 36 :
INFO: gensgd.cpp(main:1258): Selected feature: 3 : AQI
INFO: gensgd.cpp(main:1258): Selected feature: 4 : ASIAN_CD
INFO: gensgd.cpp(main:1258): Selected feature: 5 : AUTO_IN_MARKET
INFO: gensgd.cpp(main:1258): Selected feature: 6 : BIRD_QTY
INFO: gensgd.cpp(main:1258): Selected feature: 7 : BUYER_DM_BOOKS
INFO: gensgd.cpp(main:1258): Selected feature: 8 : BUYER_DM_COLLECT_SPC_FOOD
INFO: gensgd.cpp(main:1258): Selected feature: 11 : BUYER_DM_GARDEN_FARM
INFO: gensgd.cpp(main:1258): Selected feature: 12 : BUYER_DM_GENERAL
INFO: gensgd.cpp(main:1258): Selected feature: 13 : BUYER_DM_GIFT_GADGET
INFO: gensgd.cpp(main:1258): Selected feature: 14 : BUYER_DM_MALE_ORIEN
INFO: gensgd.cpp(main:1258): Selected feature: 15 : BUYER_DM_UPSCALE
INFO: gensgd.cpp(main:1258): Selected feature: 16 : BUYER_MAG_CULINARY_INTERS
INFO: gensgd.cpp(main:1258): Selected feature: 17 : BUYER_MAG_FAMILY_GENERAL
INFO: gensgd.cpp(main:1258): Selected feature: 18 : BUYER_MAG_FEMALE_ORIENTED
INFO: gensgd.cpp(main:1258): Selected feature: 19 : BUYER_MAG_GARDEN_FARMING
INFO: gensgd.cpp(main:1258): Selected feature: 20 : BUYER_MAG_HEALTH_FITNESS
INFO: gensgd.cpp(main:1258): Selected feature: 21 : BUYER_MAG_MALE_SPORT_ORIENTED
INFO: gensgd.cpp(main:1258): Selected feature: 22 : BUYER_MAG_RELIGIOUS
INFO: gensgd.cpp(main:1258): Selected feature: 23 : CATS_QTY
INFO: gensgd.cpp(main:1258): Selected feature: 24 : CEN_2000_MATCH_LEVEL
INFO: gensgd.cpp(main:1258): Selected feature: 25 : CLUB_MEMBER_CD
INFO: gensgd.cpp(main:1258): Selected feature: 26 : COUNTRY_OF_ORIGIN
INFO: gensgd.cpp(main:1258): Selected feature: 27 : DECEASED_INDICATOR
INFO: gensgd.cpp(main:1258): Selected feature: 28 : DM_RESPONDER_HH
INFO: gensgd.cpp(main:1258): Selected feature: 29 : DM_RESPONDER_INDIV
INFO: gensgd.cpp(main:1258): Selected feature: 30 : DMR_CONTRIB_CAT_GENERAL
INFO: gensgd.cpp(main:1258): Selected feature: 31 : DMR_CONTRIB_CAT_HEALTH_INST
INFO: gensgd.cpp(main:1258): Selected feature: 32 : DMR_CONTRIB_CAT_POLITICAL
INFO: gensgd.cpp(main:1258): Selected feature: 33 : DMR_CONTRIB_CAT_RELIGIOUS
INFO: gensgd.cpp(main:1258): Selected feature: 34 : DMR_DO_IT_YOURSELFERS
INFO: gensgd.cpp(main:1258): Selected feature: 35 : DMR_MISCELLANEOUS
INFO: gensgd.cpp(main:1258): Selected feature: 36 : DMR_NEWS_FINANCIAL
INFO: gensgd.cpp(main:1258): Selected feature: 37 : DMR_ODD_ENDS
INFO: gensgd.cpp(main:1258): Selected feature: 38 : DMR_PHOTOGRAPHY
INFO: gensgd.cpp(main:1258): Selected feature: 39 : DMR_SWEEPSTAKES
INFO: gensgd.cpp(main:1258): Selected feature: 40 : DOG_QTY
INFO: gensgd.cpp(main:1259): Target variable 0 : CLICK_FLG
INFO: gensgd.cpp(main:1260): From 9 : BUYER_DM_CRAFTS_HOBBI
INFO: gensgd.cpp(main:1261): To 10 : BUYER_DM_FEMALE_ORIEN
54.8829) Iteration: 0 Training RMSE: 0.00927502 Train err: 8e-05
99.4742) Iteration: 1 Training RMSE: 0.00120904 Train err: 0
143.852) Iteration: 2 Training RMSE: 0.000793143 Train err: 0
188.523) Iteration: 3 Training RMSE: 0.000604034 Train err: 0
233.188) Iteration: 4 Training RMSE: 0.000500067 Train err: 0
We got a very good classifier - starting from the second iteration there are no classification errors.
Some explanation about additional run time flags, not used in previous examples.
1)
--rehash_value=1 - since the target value is not numeric, I used rehash_value to translate Y/N into two numeric integer bins.
2)
--cutoff=0.5 - after hasing the target Y/N we get two integers: 0 and 1. So I use 0.5 as a prediction threshold to decide for Y/N.
3)
--file_columns=200 - I am looking only at the first 40 columns, so there is no need in parsing all the 273 columns. (You can play with this parameter on run time).
4)
--has_header_titles=1 - first line of input field includes column titles
Instructions
1) Register to the
hearst website.
2) Download the first data file Modeling_1.csv and put in the in main graphchi folder.
3) Create a file named Modeling_1.csv:info and put the following two lines in it:
%%MatrixMarket matrix coordinate real general
11 13 400000
4) Run as instructed.