Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!
If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.
When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.
The simplest distance metric I have implemented is Jaccard distance. The distance of items i and j is computed as follows:
wi = number of users who watched movie i
wj = number of users who watched movie j
wij = number of users who watched both movie i and movie j
Dij = wij / ( wi + wj - wij )
It is clear that Dij is a number between zero and one.
Additional distance functions are found here.
As always, I am looking for brave beta testers who want to try it out!
For full Netflix data, it takes around 1200 seconds to compute distances of around 130M item pairs. (Around 1000 item pairs in a second, using a 24 core machine, with only 800MB memory used).
Ping me if you are interested in other distance metric so I could add them as well.
How to try it out:
1) Follow GraphChi installation instructions steps 1-4.2) Download smallnetflix_mm
3) Run itemcf on smallnetflix data using:
bickson@thrust:~/graphchi$ > ./toolkits/collaborative_filtering/itemcf --training=smallnetflix_mm --nshards=1 --clean_cache=1 --K=10 --quiet=1
WARNING: common.hpp(print_copyright:183): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
[training] => [smallnetflix_mm]
[nshards] => [1]
[clean_cache] => [1]
[K] => [10]
[quiet] => [1]
=== REPORT FOR sharder() ===
[Timings]
edata_flush: 0.005163s (count: 13, min: 0.000335s, max: 0.000534, avg: 0.000397154s)
execute_sharding: 0.273818 s
finish_shard.sort: 0.065508 s
preprocessing: 1.05835 s
shard_final: 0.15078 s
[Other]
app: sharder
Total item pairs compared: 6325948 total written to file: 35551 pairs with zero distance: 124609
/Users/bickson/code/graphchi-cpp
Sorting output file smallnetflix_mm.out0
Sorting output file smallnetflix_mm.out1
Sorting output file smallnetflix_mm.out2
Sorting output file smallnetflix_mm.out3
Sorting output file smallnetflix_mm.out4
Sorting output file smallnetflix_mm.out5
Sorting output file smallnetflix_mm.out6
Sorting output file smallnetflix_mm.out7
Merging sorted files:
File written: smallnetflix_mm-topk
Total lines: 35551
Now let's examine the output file
bickson@thrust:~/graphchi$ 3561 3076 0.23296110332
3561 1990 0.270156830549
3561 1794 0.218252897263
3561 1788 0.227363973856
3561 1598 0.211793571711
3561 1536 0.233875498176
3561 1507 0.224534928799
3561 557 0.237991556525
3561 411 0.22253626585
3561 338 0.237756416202
For each item we get exactly K=10 similar items. (The item are sorted in reverse order).
The format is rather simple. In each row we have:
<item A> <item B> <similarity>
Command line arguments
--training - input file name in sparse matrix market format
--min_allowed_intersection=XX - filter out item pairs with less than XX users who rated them jointly.
--quiet=1 run with less verbose traces
--nshards=1 ## mandatory argument
FOR itemcf: --distance=XX, 0 = Jaccard index, 1=AA, 2=RA, 3=Aiolli
FOR itemcf2: --distance=XX, 3 = PEARSON, 4=COSINE, 5=CHEBYCHEV, 6=MANHATTEN, 7=TANIMOTO, 8=LOG_LIKELIHOOD, 9 = SLOPE_ONE
Useful GraphChi command line arguments:
execthreads XX - set the number of execution threads
membudget_mb XX - fix the size of used memory to XX mb
Additional metrics:
Adamic/Adar (AA) (itemcf):
Just got an email from Timmy Wilson, our man in Ohio:"There are a lot of similarity functions out there -- Adamic / Adar is another good one "
I looked it up an indeed it is a very simple cost function:
Distance_item_i_j = sum over users which rated both items i j (1 / log (user_rating) )
see equation (2) in the paper: http://arxiv.org/abs/0907.1728 "Role of Weak Ties in Link Prediction of Complex Networks". The equation gives larger weight to users which rated both items but have relatively small number of rated items.
I have also added the RA cost function, which is same like AA but without the log.
I have also added the RA cost function, which is same like AA but without the log.
Pearson Correlation (itemcf)
I got a request from additional reader "silly snail" to implement Pearson Correlation as well. Let m be a vector of size M (users) which hold the user rating mean.
Let a be a sparse vector holding user rating for item a, and the same for item b.
Pearson_correlation_i_j = ((a - mean)' * (b-mean)) / (stddev(a) * stddev(b))
And stddev(a) = sum((a-mean).^2) / (n - 1)
where n = length(a).
To run pearson correlation, use the program ./toolkits/collaborative_filtering/preason
A toy example:
Prepare the file "stam" with the content:
%%MatrixMarket matrix coordinate real general
3 4 9
1 1 3
1 2 4
2 3 1
2 4 1
2 1 2
3 1 1
3 2 2
3 3 3
3 4 4
Namely there are 3 users, 4 items, and 9 ratings.
Now run pearson correlation:
./toolkits/collaborative_filtering/pearson --training=stam execthreads 1
Lets examine the output file:
head stam.out0
2 1 0.337405
...
Namely, the pearson correlation between items 1 and 2 is 0.337405.
Now let's do the same computation in matlab:
matlab
>> a=full(mmread('stam')); % load the matrix into memory
Luckily we got the same result! namely pearson correlation between the two first items is 0.3374.
Case study: million songs dataset
Pearson_correlation_i_j = ((a - mean)' * (b-mean)) / (stddev(a) * stddev(b))
And stddev(a) = sum((a-mean).^2) / (n - 1)
where n = length(a).
To run pearson correlation, use the program ./toolkits/collaborative_filtering/preason
A toy example:
Prepare the file "stam" with the content:
%%MatrixMarket matrix coordinate real general
3 4 9
1 1 3
1 2 4
2 3 1
2 4 1
2 1 2
3 1 1
3 2 2
3 3 3
3 4 4
Namely there are 3 users, 4 items, and 9 ratings.
Now run pearson correlation:
./toolkits/collaborative_filtering/pearson --training=stam execthreads 1
Lets examine the output file:
head stam.out0
2 1 0.337405
...
Namely, the pearson correlation between items 1 and 2 is 0.337405.
Now let's do the same computation in matlab:
matlab
>> a=full(mmread('stam')); % load the matrix into memory
a =
3 4 0 0
2 0 1 1
1 2 3 4
>> means=mean(a') % compute the item vectors mean
means =
1.7500 1.0000 2.5000
>> A=a(:,1)-means' % compute item 1 vector minus mean
A =
1.2500
1.0000
-1.5000
>> B=a(:,2)-means' % compute item 2 vector minus mean
B =
2.2500
-1.0000
-0.5000
>> dist = A'*B % multiply the result
ans =
2.5625
>> stddeva=sum(A.^2)/2 % compute stddev item 1
stddeva =
2.4062
>> stddevb=sum(B.^2/2)% compute stddev item 2
stddevb =
3.1562
>> dist/(stddeva*stddevb) % compute pearson correlation
ans =
0.3374
Luckily we got the same result! namely pearson correlation between the two first items is 0.3374.
Cosine Distance (itemcf)
Manhattan Distance (itemcf2)
See http://en.wikipedia.org/wiki/Taxicab_geometryLog Similarity Distance (itemcf2)
Chebychev Distance (itemcf2)
http://en.wikipedia.org/wiki/Chebyshev_distanceTanimoto Distance (itemcf2)
See http://en.wikipedia.org/wiki/Jaccard_indexAiolli's method (itemcf)
See the paper: F. Aiolli, A Preliminary Study on a Recommender System for the Million Songs Dataset Challenge Preference Learning: Problems and Applications in AI (PL-12), ECAI-12 Workshop, Montpellier. pdfSlope One Recommender (itemcf2)
Is described in A Programmer's Guide to Data Mining , page 18.How to compute recommendations out of similarities?
For computing top K recommendations for each user, you can use the itemsim2ratings utility
Additional reading:
Matrix factorization based collaborative filtering with GraphChi.Case study: million songs dataset
Acknowledgements
- Tao Ye, Pandora Internet Radio, for beta testing itemcf.
- Timmy Wilson, Smarttypes.org, for proposing additional metrics
- Denis Parra, Univ. of Pittsburgh, for proposing additional metrics
- Clive Cox, Rummble Labs, for implementing and contributing asym. cosine metric.
- Mohammad Burhan, Fraunhofer IAIS, for suggesting Slope One recommender implementation.
Danny, Do you mean number of users who _watched_ both movie i and j in the third line?
ReplyDeletewi = number of users who watched movie i
wj = number of users who watched movie j
wij = number of users who wanted both movie i and movie j
Dij = wij / ( wi + wj - wij )
Also at some point you mentioned Python interfaces to GraphChi, if that becomes available I would be happy to jump in and beta test, become early user etc. Do you have any plans for that. I think once you have a system where writing a new algorithm becomes kind of like writing a plugin for firefox or chrome, that is the ideal situation since we often run into situations which need slight modifications, workarounds based on ground realities.
Hi Senthil ! Thanks for your note I have fixed the typo (watched).
DeleteRegarding python we do not have a wrapper yet, but there is a graphchi version in Java. We will add your request to our wish list...
Hi Danny, nice work!
ReplyDeleteCan it calculate item similarity with Pearson's correlation?
I'm new to GraphChi and I'm wondering if it is possible that I could implement different similarity metrics myself.
Thanks.
Hi SillySnail (nice nickname by the way!)
DeleteI can implement pearson correlation rather quickly for you (maybe even today), if you like to be my beta tester and try it out.
Best,
HI again,
DeletePearson correlation is now implemented in graphchi. Please try it out by checking out from mercurial (hg pull; hg update) and then recompile (make clean; make cf). I have documented the usage here: http://bickson.blogspot.co.il/2012/09/item-based-similarity-with-graphchi.html#metrics
Hi Danny.
ReplyDeleteThanks for the work. I was wondering. Does someone have to go through each install step each time you publish a new algorithm/measure or can one just copy the new file to the toolkit folder?
Thanks!
Hi Paul,
DeleteMy advice is to use mercurial, and thus you do not need to install, just do "hg pull; hg update" and then "make clean; make cf". The pull commands gets the latest files from the repository. You can also download them manually and place them in the toolkits/collaborative_filtering folder, but this is less recommended since we sometime fix things in the graphchi engine itself or in other parts of the code.
Bestm
Hy Danny,
ReplyDeleteIn the last RecSys Conference, Anmol Bhasin from LinkedIn explained that they use LSH (http://en.wikipedia.org/wiki/Locality-sensitive_hashing) for item-similarity and clustering in high-dimensional data (makes sense, in LinkedIn they have many features per user). Do you think is a good idea implementing this in GraphLab?
On the other side, although you have implemented many recommender algorithms, many researchers require several measures to evaluate them (RMSE, Precision@n, nDCG, recall, MRR, AUC) and MyMediaLite has given an important step in this area
http://www.ismll.uni-hildesheim.de/mymedialite/examples/item_recommendation_datasets.html
is there any plan to incorporate these measures in the Recommender Algorithms of GraphLab and/or GraphChi?
Thanks,
Hi Denis,
ReplyDeleteThanks for your interesting feedback.
Regarding locality sensitive hashing - I am not sure wish variant do you mean. It seems that on the web lecture I found: http://www.slideshare.net/anmolbhasin/beyond-ratings-andfollowers-recsys-2012 they refer to cosine similarity. It is rather straightforward to add it to graphchi.
Regarding additional accuracy measures - it is possible to add support to additional measures. If you are interested in helping me beta test them I can add some of the measures you request.
Best,
DB
Danny,
ReplyDeleteYou are right, looks like the variant than Anmol refers in his presentation is cosine similarity, but not sure how he sets the weights of the vectors or if this is a special implementation. I think it good be a good contribution adding this to GraphChi.
On the other side, I would be glad to help you testing evaluation metrics. We have 4 people working on RecSys at iSchool at Pitt and at least 2 of us (Sherry Sahebi and me) could collaborate with you on this task.
Thanks again.
Danny,
ReplyDeleteI do hope you'll implement cosine similarity and the other sim. functions.
Already implemented... Let me know if you want to test it out.
DeleteThanks,
DB
Yes, I would indeed!
DeleteHi,Danny!
ReplyDeleteI am a little confused about your code when do some test about computing the similarity(Cosine similarity and Pearson coefficient) of an item set by "itemcf2".
It seems when computing Cosine Similarity, the output is the "item1 item2 Distance", for in the distance.cpp, function "calc_cosine_distance" returns "1-dotprod / denominator". But when Pearson Coefficient, the output is "item1 item2 Similarity". I think maybe these outputs are confusing...
Hi Martin!
DeleteThanks for your feedback. Based on your comments I have change the terminology from Cosine similarity to Cosine distance and gave a different explaining URL. Please take a loook and let me know if this is now clearer.
Best,
DB
Good one its works Great .... Sir !!
ReplyDeleteDear Danny,
ReplyDeleteI have a below mentioned Market Matrix, when I run itemcf command, it throws an exception...any idea what could be the issue?
%%MatrixMarket matrix coordinate real general
% Generated 28-Jan-2013
4 4 7
1 1 5
1 2 5
1 3 3
1 4 5
2 1 5
3 1 2
4 1 4
mburhan@mburhan-Vostro-3460:~/graphchi/toolkits/collaborative_filtering$ ./itemcf --training=/home/mburhan/workspace/MarketMatrix/datasource/test_mm
WARNING: common.hpp(print_copyright:144): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
[training] => [/home/mburhan/workspace/MarketMatrix/datasource/test_mm]
INFO: chifilenames.hpp(find_shards:251): Detected number of shards: 2
INFO: chifilenames.hpp(find_shards:252): To specify a different number of shards, use command-line parameter 'nshards'
INFO: io.hpp(convert_matrixmarket:419): File /home/mburhan/workspace/MarketMatrix/datasource/test_mm was already preprocessed, won't do it again.
INFO: io.hpp(read_global_mean:112): Opened matrix size: 4 x 4 edges: 7 Global mean is: 4.14286 time bins: 0 Now creating shards.
FATAL: itemcf.cpp(main:452): This application currently supports only 1 shard
terminate called after throwing an instance of 'char const*'
Aborted (core dumped)
You should run with --nshards=1
DeleteHi Danny,
ReplyDeleteI exectued the algorithms (Jaccard index, AA, RA, Aiollo) with my dataset and have got great results.
I am currently trying to use cosine distance using the following command:
./toolkits/collaborative_filtering/itemcf2 --training=input_notP_5 --distance=4
This results in the following output on the terminal.
[training] => [input_notP_5]
[distance] => [4]
INFO: chifilenames.hpp(find_shards:258): Detected number of shards: 2
INFO: chifilenames.hpp(find_shards:259): To specify a different number of shards, use command-line parameter 'nshards'
INFO: io.hpp(convert_matrixmarket:425): File input_notP_5 was already preprocessed, won't do it again.
INFO: io.hpp(read_global_mean:112): Opened matrix size: 511288 x 274815 edges: 1044265 Global mean is: 4 time bins: 0 Now creating shards.
DEBUG: stripedio.hpp(stripedio:201): Start io-manager with 2 threads.
INFO: graphchi_engine.hpp(graphchi_engine:150): Initializing graphchi_engine. This engine expects 4-byte edge data.
INFO: chifilenames.hpp(load_vertex_intervals:378): shard: 0 - 600855
INFO: chifilenames.hpp(load_vertex_intervals:378): shard: 600856 - 786102
INFO: graphchi_engine.hpp(run:673): GraphChi starting
INFO: graphchi_engine.hpp(run:674): Licensed under the Apache License 2.0
INFO: graphchi_engine.hpp(run:675): Copyright Aapo Kyrola et al., Carnegie Mellon University (2012)
DEBUG: slidingshard.hpp(sliding_shard:193): Total edge data size: 2088552, input_notP_5.edata.e4B.0_2sizeof(ET): 4
DEBUG: slidingshard.hpp(sliding_shard:193): Total edge data size: 2088508, input_notP_5.edata.e4B.1_2sizeof(ET): 4
INFO: graphchi_engine.hpp(print_config:125): Engine configuration:
INFO: graphchi_engine.hpp(print_config:126): exec_threads = 1
INFO: graphchi_engine.hpp(print_config:127): load_threads = 4
INFO: graphchi_engine.hpp(print_config:128): membudget_mb = 800
INFO: graphchi_engine.hpp(print_config:129): blocksize = 4194304
INFO: graphchi_engine.hpp(print_config:130): scheduler = 1
INFO: graphchi_engine.hpp(run:706): Start iteration: 0
INFO: graphchi_engine.hpp(run:760): 0.029119s: Starting: 0 -- 600855
INFO: graphchi_engine.hpp(run:773): Iteration 0/5, subinterval: 0 - 600855
DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485887 number of blocks: 1
INFO: graphchi_engine.hpp(run:799): Start updates
INFO: graphchi_engine.hpp(run:809): Finished updates
INFO: graphchi_engine.hpp(run:827): Commit memshard
INFO: graphchi_engine.hpp(run:760): 0.207832s: Starting: 600856 -- 786102
INFO: graphchi_engine.hpp(run:773): Iteration 0/5, subinterval: 600856 - 786102
DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485801 number of blocks: 1
INFO: graphchi_engine.hpp(run:799): Start updates
INFO: graphchi_engine.hpp(run:809): Finished updates
INFO: graphchi_engine.hpp(run:827): Commit memshard
INFO: graphchi_engine.hpp(run:706): Start iteration: 1
INFO: graphchi_engine.hpp(run:760): 0.322075s: Starting: 0 -- 600855
INFO: graphchi_engine.hpp(run:773): Iteration 1/5, subinterval: 0 - 600855
DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485887 number of blocks: 1
INFO: graphchi_engine.hpp(run:799): Start updates
The execution does not proceeds further and stays at "Start Updates" step. Could you please let me know what I have missed.
Thanks,
Manu
HI Manu,
DeleteThis is strange. If you like send me a sample dataset where this problem happens and I will take a look at it. Which OS are you running on?
Hi Danny,
DeleteIt worked. I am using Ubuntu.. I used the following command:
./toolkits/collaborative_filtering/itemcf2 --training=input_notP_7 --distance=6 --K=3 --min_allowed_intersection=3
The execution at the following took 10 minutes:
INFO: graphchi_engine.hpp(run:799): Start updates
And then it proceeded with the comparison of pairs.
Thanks,
Manu
I am glad to hear it worked! Comparing all pairs may be a low task - depends on how many items you got and also on the sparsity pattern of the user item ratings.
DeleteHi Danny,
ReplyDeleteI am a little bit confused about the cosine similarity choice.
According to the command line argument, "FOR itemcf2: --distance=XX, 3 = PEARSON, 4=COSINE, " itemcf2: --distance=4 means cosine similarity, but in the reference page, cosine distance are in itemcf, not in itemcf2.
I tried with itemcf2, --distance=4 command, which give me similarity value greater than 1, I assume the cosine similarity value should between -1 and 1, right?
Could you help me figure out which --distance index should I choose?
Thanks,
Fen
If your edges are not binary (0 or 1) you may have values outside the range.
Delete