Large Scale Machine Learning and Other Animals: Collaborative filtering - 3rd generation

Friday, December 14, 2012

Collaborative filtering - 3rd generation - part 2

NOTE: This blog post is two years old. We have reimplemented this code as part of Graphlab Create. The implementation in GraphLab Create is preferred since:
1) No input format conversions are needed (like matrix market header setup)
2) No parameter tuning like step sizes and regularization are needed
3) No complicated command line arguments
4) The new implementation is more accurate, especially regarding the validation dataset.
Anyone who wants to try it out should email me, I will send you the exact same code in python.

**********************************************************************************
A couple of days ago I wrote about a new experimental software I am writing - which is what I call a 3rd generation collaborative filtering software. I got a lot of interesting feedback from my readers which helps improve the software. Previously I tried it to examine its performance on KDD CUP 2012 dataset. Now I tried it on a completely different datasets and I am quite pleased with the results.

First dataset: Airline on time

Below I will explain how to deploy it on a different problem domain: Airline on time performance. It is a completely different dataset from a different domain, but still the gensgd software can deal without without any modification. I hope that those results that show how
flexible is the software will encourage additional data scientist to try it out!

The airline on time dataset, has information about 10 years of flights in the US. The data of each year is a csv file with the following format:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay

The fields are rather self explanatory Each line represents a single flight, and information about the date, carrier, airport etc. is given, and the interesting fields is the varying information about flight duration.

And here are the first few lines:

2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA

Note: you can get the dataset using the commands:
curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 -o 2008.csv.bz2
bunzip2 2008.csv.bz2

First task. Can we predict the total time the flight was on the air?

Well, for a matrix factorization method, it is not clear what is the actual matrix here. That is why it is useful to have a flexible software. In my experiments I have chosen "UniqueCarrier" and "FlightNum" as the two fields which form the matrix. This is because the characterize each flight rather uniquely. Next we need to decide which field we want to predict. I have chosen the ActualElapsedTime as the prediction target. Note that those fields are chosen on the fly, so you are more than welcome to chose others and see how well is the prediction in that case.
(Additional information about each field meaning is found here).

First let's use traditional matrix factorization.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --gensgd_rate3=1e-5 --gensgd_mult_dec=0.9999 --max_iter=20 --file_columns=28 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1155): Total selected features: 0 :

INFO: gensgd.cpp(main:1212): Target variable 11 : ActualElapsedTime
INFO: gensgd.cpp(main:1213): From 8 : UniqueCarrier
INFO: gensgd.cpp(main:1214): To 9 : FlightNum

7.58561) Iteration: 0 Training RMSE: 67.1094
11.7177) Iteration: 1 Training RMSE: 64.6665
15.8441) Iteration: 2 Training RMSE: 63.2155
19.9971) Iteration: 3 Training RMSE: 59.0044
24.0989) Iteration: 4 Training RMSE: 53.9083
28.1962) Iteration: 5 Training RMSE: 50.2416
...
77.6041) Iteration: 17 Training RMSE: 35.6409
81.7165) Iteration: 18 Training RMSE: 35.505
85.8197) Iteration: 19 Training RMSE: 35.4046
89.9266) Iteration: 20 Training RMSE: 35.3288

We got RMSE error of 35.3 minutes error on predicted flight time taking into account the carrier and flight number. That is rather bad.. we are half an hour off track.

Next let's throw in some temporal features into the computation: Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime. How do we do that? It is very easy! Just add the command line: --features=0,1,2,3,4,5,6,7 namely the positions of the features in the input file. This is what we call temporal matrix factorization or tensor factorization. But for utilizing it in one of the traditional methods, you need to merge al the 8 fields into one integer which encodes the time. Which is of course a tedious task.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --file_columns=28 --gensgd_rate3=1e-5 --gensgd_mult_dec=0.9999 --max_iter=100 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --features=1,2,3,4,5,6,7 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1155): Total selected features: 7 :

INFO: gensgd.cpp(main:1211): Selected feature: 1 : Month
INFO: gensgd.cpp(main:1211): Selected feature: 2 : DayofMonth
INFO: gensgd.cpp(main:1211): Selected feature: 3 : DayOfWeek
INFO: gensgd.cpp(main:1211): Selected feature: 4 : DepTime
INFO: gensgd.cpp(main:1211): Selected feature: 5 : CRSDepTime
INFO: gensgd.cpp(main:1211): Selected feature: 6 : ArrTime
INFO: gensgd.cpp(main:1211): Selected feature: 7 : CRSArrTime

INFO: gensgd.cpp(main:1212): Target variable 11 : ActualElapsedTime
INFO: gensgd.cpp(main:1213): From 8 : UniqueCarrier
INFO: gensgd.cpp(main:1214): To 9 : FlightNum

21.8356) Iteration: 0 Training RMSE: 50.3144
36.6782) Iteration: 1 Training RMSE: 40.4813
51.425) Iteration: 2 Training RMSE: 36.0579
66.4348) Iteration: 3 Training RMSE: 33.4226
...
272.188) Iteration: 17 Training RMSE: 20.0103
286.887) Iteration: 18 Training RMSE: 19.7198
301.602) Iteration: 19 Training RMSE: 19.4597
316.305) Iteration: 20 Training RMSE: 19.2147

With temporal information we now got to RMSE of 19.2 minutes. Which is again not that
good.

Now let's utilize the full power of gensgd: when the going gets tough - throw in some more features! Without even understanding what the feature means I have thrown in almost everything...

./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --features=1,2,3,4,5,6,7,12,13,14,15,16,17,18 --gensgd_rate3=1e-5 --gensgd_mult_dec=0.9999 --file_columns=28 --max_iter=20 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1155): Total selected features: 14 :
INFO: gensgd.cpp(main:1211): Selected feature: 1 : Month
INFO: gensgd.cpp(main:1211): Selected feature: 2 : DayofMonth
INFO: gensgd.cpp(main:1211): Selected feature: 3 : DayOfWeek
INFO: gensgd.cpp(main:1211): Selected feature: 4 : DepTime
INFO: gensgd.cpp(main:1211): Selected feature: 5 : CRSDepTime
INFO: gensgd.cpp(main:1211): Selected feature: 6 : ArrTime
INFO: gensgd.cpp(main:1211): Selected feature: 7 : CRSArrTime
INFO: gensgd.cpp(main:1211): Selected feature: 12 : CRSElapsedTime
INFO: gensgd.cpp(main:1211): Selected feature: 13 : AirTime
INFO: gensgd.cpp(main:1211): Selected feature: 14 : ArrDelay
INFO: gensgd.cpp(main:1211): Selected feature: 15 : DepDelay
INFO: gensgd.cpp(main:1211): Selected feature: 16 : Origin
INFO: gensgd.cpp(main:1211): Selected feature: 17 : Dest
INFO: gensgd.cpp(main:1211): Selected feature: 18 : Distance
INFO: gensgd.cpp(main:1212): Target variable 11 : ActualElapsedTime
INFO: gensgd.cpp(main:1213): From 8 : UniqueCarrier
INFO: gensgd.cpp(main:1214): To 9 : FlightNum
36.2089) Iteration: 0 Training RMSE: 21.1476
61.2802) Iteration: 1 Training RMSE: 10.1963
86.3032) Iteration: 2 Training RMSE: 8.64215
111.236) Iteration: 3 Training RMSE: 7.76054
136.246) Iteration: 4 Training RMSE: 7.14308
161.221) Iteration: 5 Training RMSE: 6.6629
...
461.528) Iteration: 17 Training RMSE: 4.26991
486.61) Iteration: 18 Training RMSE: 4.17239
511.737) Iteration: 19 Training RMSE: 4.08084
536.775) Iteration: 20 Training RMSE: 3.99414

Now we got down to 4 minutes avg error. But, we can continue the computation (run more iterations) and we get down even below 2 minutes error. Isn't that neat? The average flight time is 127 minutes in 2008, so 2 minutes error prediction is not that bad.

Conclusion: traditional matrix / tensor factorization have some severe limitation when dealing with real world complex data. Additional techniques are needed to improve accuracy!

Second task: let's predict TaxiIn (time that the plane is on the ground when coming in)

This task is slightly more difficult, since as you may imagine, there is much larger variation in texiin time relative to flight time. But is predeicing it more difficult? No.. we simply change --val_pos=19 namely to point the taget into the taxiintime field.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=19 --rehash=1 --file_columns=28 --gensgd_rate3=1e-3 --gensgd_mult_dec=0.9999 --max_iter=20 --file_columns=28 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --features=1,2,3,4,5,6,7,10,11,12,13,14,15,16,17,18 --quiet=1 --has_header_titles=1

WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com

[quiet] => [1]

INFO: gensgd.cpp(main:1155): Total selected features: 16 :

INFO: gensgd.cpp(main:1158): Selected feature: 1

INFO: gensgd.cpp(main:1158): Selected feature: 2

INFO: gensgd.cpp(main:1158): Selected feature: 3

INFO: gensgd.cpp(main:1158): Selected feature: 4

INFO: gensgd.cpp(main:1158): Selected feature: 5

INFO: gensgd.cpp(main:1158): Selected feature: 6

INFO: gensgd.cpp(main:1158): Selected feature: 7

INFO: gensgd.cpp(main:1158): Selected feature: 10

INFO: gensgd.cpp(main:1158): Selected feature: 11

INFO: gensgd.cpp(main:1158): Selected feature: 12

INFO: gensgd.cpp(main:1158): Selected feature: 13

INFO: gensgd.cpp(main:1158): Selected feature: 14

INFO: gensgd.cpp(main:1158): Selected feature: 15

INFO: gensgd.cpp(main:1158): Selected feature: 16

INFO: gensgd.cpp(main:1158): Selected feature: 17

INFO: gensgd.cpp(main:1158): Selected feature: 18

1.56777) Iteration: 0 Training RMSE: 3.89207

3.01777) Iteration: 1 Training RMSE: 3.64978

4.5159) Iteration: 2 Training RMSE: 3.46472

5.8659) Iteration: 3 Training RMSE: 3.30712

7.26778) Iteration: 4 Training RMSE: 3.17225

8.7159) Iteration: 5 Training RMSE: 3.06696

...

23.6072) Iteration: 16 Training RMSE: 2.60147

24.9789) Iteration: 17 Training RMSE: 2.57697

26.3267) Iteration: 18 Training RMSE: 2.55768

27.6967) Iteration: 19 Training RMSE: 2.54186

29.0773) Iteration: 20 Training RMSE: 2.53113

We again get to average RMSE of 2.5 minutes - which means that this task is actually more difficult than predicting air time.

Instructions:
0) Install GraphChi from mercurial using the instructions here.
1) Download the year 2008 from here.
2) Open the zip file using:
bunzip2 2008.csv.bz2
3) Create a matrix market format file, named 2008.csv:info with the following two lines:
%%MatrixMarket matrix coordinate real general
20 7130 1000000
4) Run the commands as instructed above.

Second dataset: Hearst machine learning challenge

A while ago Hearst provided data about emails campaigns and the task was to predict user reaction to emails (click/ not clicked). The data has several millions records about emails sent with around 273 user features for each email. Here is some of the available frields:
CLICK_FLG,OPEN_FLG,ADDR_VER_CD,AQI,ASIAN_CD,AUTO_IN_MARKET,BIRD_QTY,BUYER_DM_BOOKS,BUYER_DM_COLLECT_SPC_FOOD,BUYER_DM_CRAFTS_HOBBI,BUYER_DM_FEMALE_ORIEN,BUYER_DM_GARDEN_FARM,BUYER_DM_GENERAL,BUYER_DM_GIFT_GADGET,BUYER_DM_MALE_ORIEN,BUYER_DM_UPSCALE,BUYER_MAG_CULINARY_INTERS,BUYER_MAG_FAMILY_GENERAL,BUYER_MAG_FEMALE_ORIENTED,BUYER_MAG_GARDEN_FARMING,BUYER_MAG_HEALTH_FITNESS,BUYER_MAG_MALE_SPORT_ORIENTED,BUYER_MAG_RELIGIOUS,CATS_QTY,CEN_2000_MATCH_LEVEL,CLUB_MEMBER_CD,COUNTRY_OF_ORIGIN,DECEASED_INDICATOR,DM_RESPONDER_HH,DM_RESPONDER_INDIV,DMR_CONTRIB_CAT_GENERAL,DMR_CONTRIB_CAT_HEALTH_INST,DMR_CONTRIB_CAT_POLITICAL,DMR_CONTRIB_CAT_RELIGIOUS,DMR_DO_IT_YOURSELFERS,DMR_MISCELLANEOUS,DMR_NEWS_FINANCIAL,DMR_ODD_ENDS,DMR_PHOTOGRAPHY,DMR_SWEEPSTAKES,DOG_QTY,DWELLING_TYPE,DWELLING_UNIT_SIZE,EST_LOAN_VALUE_RATIO,ETECH_GROUP,ETHNIC_GROUP_CODE,ETHNIC_INSIGHT_MTCH_FLG,ETHNICITY_DETAIL,EXPERIAN_INCOME_CD,EXPERIAN_INCOME_CD_V4,GNDR_OF_CHLDRN_0_3,GNDR_OF_CHLDRN_10_12,GNDR_OF_CHLDRN_13_18,GNDR_OF_CHLDRN_4_6,GNDR_OF_CHLDRN_7_9,HH_INCOME,HHLD_DM_PURC_CD,HOME_BUSINESS_IND,I1_BUSINESS_OWNER_FLG,I1_EXACT_AGE,I1_GNDR_CODE,I1_INDIV_HHLD_STATUS_CODE,INDIV_EDUCATION,INDIV_EDUCATION_CONF_LVL,INDIV_MARITAL_STATUS,INDIV_MARITAL_STATUS_CONF_LVL,INS_MATCH_TYPE,LANGUAGE,LENGTH_OF_RESIDENCE,MEDIAN_HOUSING_VALUE,MEDIAN_LEN_OF_RESIDENCE,MM_INCOME_CD,MOSAIC_HH,MULTI_BUYER_INDIV,NEW_CAR_MODEL,NUM_OF_ADULTS_IN_HHLD,NUMBER_OF_CHLDRN_18_OR_LESS,OCCUP_DETAIL,OCCUP_MIX_PCT,PCT_CHLDRN,PCT_DEROG_TRADES,PCT_HOUSEHOLDS_BLACK,PCT_OWNER_OCCUPIED,PCT_RENTER_OCCUPIED,PCT_TRADES_NOT_DEROG,PCT_WHITE,PHONE_TYPE_CD,PRES_OF_CHLDRN_0_3,PRES_OF_CHLDRN_10_12,PRES_OF_CHLDRN_13_18,PRES_OF_CHLDRN_4_6,PRES_OF_CHLDRN_7_9,PRESENCE_OF_CHLDRN,PRIM_FEM_EDUC_CD,PRIM_FEM_OCC_CD,PRIM_MALE_EDUC_CD,PRIM_MALE_OCC_CD,RECIPIENT_RELIABILITY_CD,RELIGION,SCS_MATCH_TYPE,TRW_INCOME_CD,TRW_INCOME_CD_V4,USED_CAR_CD,Y_OWNS_HOME,Y_PROBABLE_HOMEOWNER,Y_PROBABLE_RENTER,Y_RENTER,YRS_SCHOOLING_CD,Z_CREDIT_CARD

Fields meaning and code are described in detail here. You will need to register the website for getting access to the data.

And this the is the first entry:
N,N,,G,,8,0,1,0,0,0,0,1,0,0,0,0,4,0,0,1,0,0,0,B,U,0,,M,Y,0,0,0,0,0,1,1,1,0,2,0,A,C,0,J,18,Y,66,,A,U,U,U,U,U,34,,U,U,84,M,H,1,1,M,5,I,01,00,67,3,,E06,Y,7,3,0,05,0,37,78.09,30,63,36,13.27,59,,N,N,N,N,N,N,U,UU,U,07,6,J,4,,J,4,U,,Y,U,0,Y,,24,,,,,,,F,F,,,,,,,U,Y,,,,,,,17,69,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NORTH LAUDERDALE,330685141,FL,190815,,,,,,1036,Third Party - Merch,"Mon, 09/20/10 01:04 PM"

For this demo, I used the file Modeling_1.csv which is the first of 5 files, with 400K entries.

We would like to predict the zeros entry (click flag). I have taken column 9 and 10 as the matrix from/to entries. The rest of the columns up to column 40 are features. (While there are more features the actual solution is so accurate so the first 40 are enough).

After about an hour of playing I got the the following formulation:

./toolkits/collaborative_filtering/gensgd --training=Modeling_1.csv --val_pos=0 --from_pos=9 --to_pos=10 --features=3,4,5,6,7,8,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40 --has_header_titles=1 --rehash=1 --file_columns=200 --rehash_value=1 --calc_error=1 --cutoff=0.5 --has_header_titles=1
WARNING: common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
INFO: gensgd.cpp(main:1255): Total selected features: 36 :
INFO: gensgd.cpp(main:1258): Selected feature: 3 : AQI
INFO: gensgd.cpp(main:1258): Selected feature: 4 : ASIAN_CD
INFO: gensgd.cpp(main:1258): Selected feature: 5 : AUTO_IN_MARKET
INFO: gensgd.cpp(main:1258): Selected feature: 6 : BIRD_QTY
INFO: gensgd.cpp(main:1258): Selected feature: 7 : BUYER_DM_BOOKS
INFO: gensgd.cpp(main:1258): Selected feature: 8 : BUYER_DM_COLLECT_SPC_FOOD
INFO: gensgd.cpp(main:1258): Selected feature: 11 : BUYER_DM_GARDEN_FARM
INFO: gensgd.cpp(main:1258): Selected feature: 12 : BUYER_DM_GENERAL
INFO: gensgd.cpp(main:1258): Selected feature: 13 : BUYER_DM_GIFT_GADGET
INFO: gensgd.cpp(main:1258): Selected feature: 14 : BUYER_DM_MALE_ORIEN
INFO: gensgd.cpp(main:1258): Selected feature: 15 : BUYER_DM_UPSCALE
INFO: gensgd.cpp(main:1258): Selected feature: 16 : BUYER_MAG_CULINARY_INTERS
INFO: gensgd.cpp(main:1258): Selected feature: 17 : BUYER_MAG_FAMILY_GENERAL
INFO: gensgd.cpp(main:1258): Selected feature: 18 : BUYER_MAG_FEMALE_ORIENTED
INFO: gensgd.cpp(main:1258): Selected feature: 19 : BUYER_MAG_GARDEN_FARMING
INFO: gensgd.cpp(main:1258): Selected feature: 20 : BUYER_MAG_HEALTH_FITNESS
INFO: gensgd.cpp(main:1258): Selected feature: 21 : BUYER_MAG_MALE_SPORT_ORIENTED
INFO: gensgd.cpp(main:1258): Selected feature: 22 : BUYER_MAG_RELIGIOUS
INFO: gensgd.cpp(main:1258): Selected feature: 23 : CATS_QTY
INFO: gensgd.cpp(main:1258): Selected feature: 24 : CEN_2000_MATCH_LEVEL
INFO: gensgd.cpp(main:1258): Selected feature: 25 : CLUB_MEMBER_CD
INFO: gensgd.cpp(main:1258): Selected feature: 26 : COUNTRY_OF_ORIGIN
INFO: gensgd.cpp(main:1258): Selected feature: 27 : DECEASED_INDICATOR
INFO: gensgd.cpp(main:1258): Selected feature: 28 : DM_RESPONDER_HH
INFO: gensgd.cpp(main:1258): Selected feature: 29 : DM_RESPONDER_INDIV
INFO: gensgd.cpp(main:1258): Selected feature: 30 : DMR_CONTRIB_CAT_GENERAL
INFO: gensgd.cpp(main:1258): Selected feature: 31 : DMR_CONTRIB_CAT_HEALTH_INST
INFO: gensgd.cpp(main:1258): Selected feature: 32 : DMR_CONTRIB_CAT_POLITICAL
INFO: gensgd.cpp(main:1258): Selected feature: 33 : DMR_CONTRIB_CAT_RELIGIOUS
INFO: gensgd.cpp(main:1258): Selected feature: 34 : DMR_DO_IT_YOURSELFERS
INFO: gensgd.cpp(main:1258): Selected feature: 35 : DMR_MISCELLANEOUS
INFO: gensgd.cpp(main:1258): Selected feature: 36 : DMR_NEWS_FINANCIAL
INFO: gensgd.cpp(main:1258): Selected feature: 37 : DMR_ODD_ENDS
INFO: gensgd.cpp(main:1258): Selected feature: 38 : DMR_PHOTOGRAPHY
INFO: gensgd.cpp(main:1258): Selected feature: 39 : DMR_SWEEPSTAKES
INFO: gensgd.cpp(main:1258): Selected feature: 40 : DOG_QTY
INFO: gensgd.cpp(main:1259): Target variable 0 : CLICK_FLG
INFO: gensgd.cpp(main:1260): From 9 : BUYER_DM_CRAFTS_HOBBI
INFO: gensgd.cpp(main:1261): To 10 : BUYER_DM_FEMALE_ORIEN
54.8829) Iteration: 0 Training RMSE: 0.00927502 Train err: 8e-05
99.4742) Iteration: 1 Training RMSE: 0.00120904 Train err: 0
143.852) Iteration: 2 Training RMSE: 0.000793143 Train err: 0
188.523) Iteration: 3 Training RMSE: 0.000604034 Train err: 0
233.188) Iteration: 4 Training RMSE: 0.000500067 Train err: 0

We got a very good classifier - starting from the second iteration there are no classification errors.

Some explanation about additional run time flags, not used in previous examples.
1) --rehash_value=1 - since the target value is not numeric, I used rehash_value to translate Y/N into two numeric integer bins.
2) --cutoff=0.5 - after hasing the target Y/N we get two integers: 0 and 1. So I use 0.5 as a prediction threshold to decide for Y/N.
3) --file_columns=200 - I am looking only at the first 40 columns, so there is no need in parsing all the 273 columns. (You can play with this parameter on run time).
4) --has_header_titles=1 - first line of input field includes column titles

Instructions
1) Register to the hearst website.
2) Download the first data file Modeling_1.csv and put in the in main graphchi folder.
3) Create a file named Modeling_1.csv:info and put the following two lines in it:
%%MatrixMarket matrix coordinate real general
11 13 400000
4) Run as instructed.

25 comments:

Zach Deane-MayerDecember 16, 2012 at 3:09 PM
Hi,

I've installed graph-chi on my macbook, and ran a few of the demo scripts without error. However, it appears I cannot load data from a .csv file. When I try to run "traditional matrix factorization" I get the following error: "FATAL: gensgd.cpp(convert_matrixmarket_N:582): Bug: can not add edge from 0 to J 0 since max is: 0x0"

It appears that the conversion from .csv to matrix market is failing. What could be causing this?

Thanks,

Zach
ReplyDelete
Replies
AnonymousJanuary 2, 2013 at 6:35 AM
Hello,

I've installed graphlab on a VM with unbuntu and ran the demo scripts of this page and got some errors :

dataset 2008.CSV
- traditional matrix factorization : OK
- temporal matrix factorization :
[Other]
app: sharder
gensgd: malloc.c:2451: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
Aborted (core dumped)
- More features : OK
- TaxiIn : OK

dataset Modeling_1.csv
INFO: gensgd.cpp(convert_matrixmarket_N:559): Starting to read matrix-market input. Matrix dimensions: 11 x 13, non-zeros: 400000
FATAL: gensgd.cpp(read_line:333): Error reading line 0 feature 115 [ N,N,,G,,8,0,1,0,0,0,0,1,0,0,0,0,4,0,0,1,0,0,0,B,U,0,,M,Y,0,0,0,0,0,1,1,1,0,2,0,A,C,0,J,18,Y,66,,A,U,U,U,U,U,34,,U,U,84,M,H,1,1,M,5,I,01,00,67,3,,E06,Y,7,3,0,05,0,37,78.09,30,63,36,13.27,59,,N,N,N,N,N,N,U,UU,U,07,6,J,4,,J,4,U,,Y,U,0,Y,,24,,,,h ]
terminate called after throwing an instance of 'char const*'
Aborted (core dumped)

The first error is strange because it works with more features and i couldn't find what's wrong in the second file that cause a reading error (tried to change --file_columns but still doesn't work).

Thanks.

ReplyDelete
Replies
AlexFebruary 1, 2013 at 12:40 PM
Hello Danny,

As i said in another post i'm working on a one class and i tried your new soft on my database, i have few questions.

- in your first example you set "--minval=-1 --maxval=1 --calc_error=1" but no cutoff, it automatically set the cutoff value at 0 ?

- in the sparse example you don't set --minval and --maxval but --cutoff=0.5, is there a specific reason you write the command this way in this case ?

- when you set --minval and --maxval what kind of loss function is used ?

- you use the --validation option in the sparse example but when i try to use it with gensgd it doesn't work, is it normal ?

- do you plan to implement the --test option ?

- as i'm dealing with a one class problem i tried the implicite rating option and it worked but i'm curious of what is done when features option is used, what value are put to the features associated to these additionnal ratings ?

Thanks.

Regards.
ReplyDelete
Replies
UnknownSeptember 30, 2013 at 12:28 AM
I am running this algorithm and it produced *_U.mm file after execution. To get the recommendation, I tried to run rating command and getting following exception

$GRAPHCHI_ROOT/toolkits/collaborative_filtering/rating --training=*_U.mm --num_ratings=5 --quiet=1 --algorithm=sgd
WARNING: common.hpp(print_copyright:183): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
[training] => [*_U.mm]
[num_ratings] => [5]
[quiet] => [1]
[algorithm] => [sgd]
FATAL: io.hpp(read_matrix_market_banner_and_size:61): Sorry, this application does not support complex values and requires a sparse matrix.
terminate called after throwing an instance of 'char const*'
Aborted

Please help me where am I doing wrong
ReplyDelete
Replies
Danny BicksonSeptember 30, 2013 at 1:20 AM
1) which algorithm are you running - sgd?
2) you need to give the same string as given to the sgd utility using the --training=XXXX command.
ReplyDelete
Replies
Danny BicksonSeptember 30, 2013 at 1:29 AM
Gensgd is not support for the rating command.
The only option you have is to give a file with --test=FILENAME
and then you will get predictions for each line of features in the test data.
(test data should have same format as training data)
ReplyDelete
Replies
Danny BicksonSeptember 30, 2013 at 3:26 AM
Test file should be in the same exact format as training file.
So if you have a csv for training, you should have csv with the same format for test
ReplyDelete
Replies
XavierJuly 19, 2014 at 9:29 AM
Hello, Great thanks for this post. I was able to run all of the different samples but I get an RMSE far higher than expected even after many iterations.

For the exemple which should lead to 2 minutes RMSE, I get an RMSE of 32 minutes after 19 iterations.

I run an Ubuntu, could it be a library issue or setup ?

Thanks
ReplyDelete
Replies

Add comment

Large Scale Machine Learning and Other Animals

Friday, December 14, 2012

Collaborative filtering - 3rd generation - part 2

First dataset: Airline on time

First task. Can we predict the total time the flight was on the air?

Second task: let's predict TaxiIn (time that the plane is on the ground when coming in)

Second dataset: Hearst machine learning challenge

25 comments:

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax