Large Scale Machine Learning and Other Animals: Item based similarity with GraphChi

Tuesday, October 2, 2012

Item based similarity with GraphChi

Item based collaborative filtering is one of the most popular collaborative filtering methods used in more than 70% of the companies I am talking to. Following my mega collaborator Justin Yan's advice, I have started to implement some item based similarity methods in GraphChi.

Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!

If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.

When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.

The simplest distance metric I have implemented is Jaccard distance. The distance of items i and j is computed as follows:
wi = number of users who watched movie i
wj = number of users who watched movie j
wij = number of users who watched both movie i and movie j
Dij = wij / ( wi + wj - wij )

It is clear that Dij is a number between zero and one.
Additional distance functions are found here.

As always, I am looking for brave beta testers who want to try it out!
For full Netflix data, it takes around 1200 seconds to compute distances of around 130M item pairs. (Around 1000 item pairs in a second, using a 24 core machine, with only 800MB memory used).

Ping me if you are interested in other distance metric so I could add them as well.

How to try it out:

1) Follow GraphChi installation instructions steps 1-4.
2) Download smallnetflix_mm
3) Run itemcf on smallnetflix data using:

bickson@thrust:~/graphchi$ > ./toolkits/collaborative_filtering/itemcf --training=smallnetflix_mm --nshards=1 --clean_cache=1 --K=10 --quiet=1
WARNING: common.hpp(print_copyright:183): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
[training] => [smallnetflix_mm]
[nshards] => [1]
[clean_cache] => [1]
[K] => [10]
[quiet] => [1]

=== REPORT FOR sharder() ===
[Timings]
edata_flush: 0.005163s (count: 13, min: 0.000335s, max: 0.000534, avg: 0.000397154s)
execute_sharding: 0.273818 s
finish_shard.sort: 0.065508 s
preprocessing: 1.05835 s
shard_final: 0.15078 s
[Other]
app: sharder
Total item pairs compared: 6325948 total written to file: 35551 pairs with zero distance: 124609
/Users/bickson/code/graphchi-cpp
Sorting output file smallnetflix_mm.out0
Sorting output file smallnetflix_mm.out1
Sorting output file smallnetflix_mm.out2
Sorting output file smallnetflix_mm.out3
Sorting output file smallnetflix_mm.out4
Sorting output file smallnetflix_mm.out5
Sorting output file smallnetflix_mm.out6
Sorting output file smallnetflix_mm.out7
Merging sorted files:
File written: smallnetflix_mm-topk
Total lines: 35551

Now let's examine the output file

bickson@thrust:~/graphchi$ 3561 3076 0.23296110332
3561 1990 0.270156830549
3561 1794 0.218252897263
3561 1788 0.227363973856
3561 1598 0.211793571711
3561 1536 0.233875498176
3561 1507 0.224534928799
3561 557 0.237991556525
3561 411 0.22253626585
3561 338 0.237756416202

For each item we get exactly K=10 similar items. (The item are sorted in reverse order).
The format is rather simple. In each row we have:
<item A> <item B> <similarity>

Command line arguments
--training - input file name in sparse matrix market format
--min_allowed_intersection=XX - filter out item pairs with less than XX users who rated them jointly.
--quiet=1 run with less verbose traces
--nshards=1 ## mandatory argument
FOR itemcf: --distance=XX, 0 = Jaccard index, 1=AA, 2=RA, 3=Aiolli
FOR itemcf2: --distance=XX, 3 = PEARSON, 4=COSINE, 5=CHEBYCHEV, 6=MANHATTEN, 7=TANIMOTO, 8=LOG_LIKELIHOOD, 9 = SLOPE_ONE

Useful GraphChi command line arguments:
execthreads XX - set the number of execution threads
membudget_mb XX - fix the size of used memory to XX mb

Additional metrics:

Adamic/Adar (AA) (itemcf):

Just got an email from Timmy Wilson, our man in Ohio:
"There are a lot of similarity functions out there -- Adamic / Adar is another good one "

I looked it up an indeed it is a very simple cost function:

Distance_item_i_j = sum over users which rated both items i j (1 / log (user_rating) )

see equation (2) in the paper: http://arxiv.org/abs/0907.1728 "Role of Weak Ties in Link Prediction of Complex Networks". The equation gives larger weight to users which rated both items but have relatively small number of rated items.

I have also added the RA cost function, which is same like AA but without the log.

Pearson Correlation (itemcf)

I got a request from additional reader "silly snail" to implement Pearson Correlation as well. Let m be a vector of size M (users) which hold the user rating mean.

Let a be a sparse vector holding user rating for item a, and the same for item b.

Pearson_correlation_i_j = ((a - mean)' * (b-mean)) / (stddev(a) * stddev(b))

And stddev(a) = sum((a-mean).^2) / (n - 1)
where n = length(a).

To run pearson correlation, use the program ./toolkits/collaborative_filtering/preason

A toy example:

Prepare the file "stam" with the content:
%%MatrixMarket matrix coordinate real general
3 4 9
1 1 3
1 2 4
2 3 1
2 4 1
2 1 2
3 1 1
3 2 2
3 3 3
3 4 4

Namely there are 3 users, 4 items, and 9 ratings.
Now run pearson correlation:
./toolkits/collaborative_filtering/pearson --training=stam execthreads 1

Lets examine the output file:
head stam.out0
2 1 0.337405
...

Namely, the pearson correlation between items 1 and 2 is 0.337405.
Now let's do the same computation in matlab:
matlab

>> a=full(mmread('stam')); % load the matrix into memory

a =

3 4 0 0

2 0 1 1

1 2 3 4

>> means=mean(a') % compute the item vectors mean

means =

1.7500 1.0000 2.5000

>> A=a(:,1)-means' % compute item 1 vector minus mean

A =

1.2500

1.0000

-1.5000

>> B=a(:,2)-means' % compute item 2 vector minus mean

B =

2.2500

-1.0000

-0.5000

>> dist = A'*B % multiply the result

ans =

2.5625

>> stddeva=sum(A.^2)/2 % compute stddev item 1

stddeva =

2.4062

>> stddevb=sum(B.^2/2)% compute stddev item 2

stddevb =

3.1562

>> dist/(stddeva*stddevb) % compute pearson correlation

ans =

0.3374

Luckily we got the same result! namely pearson correlation between the two first items is 0.3374.

Aiolli's method (itemcf)

See the paper: F. Aiolli, A Preliminary Study on a Recommender System for the Million Songs Dataset Challenge Preference Learning: Problems and Applications in AI (PL-12), ECAI-12 Workshop, Montpellier. pdf

Slope One Recommender (itemcf2)

Is described in A Programmer's Guide to Data Mining , page 18.

How to compute recommendations out of similarities?

For computing top K recommendations for each user, you can use the itemsim2ratings utility

as explained here. The input file for itemsim2rating is the similarity matrix computed by itemcf.

Additional reading:

Matrix factorization based collaborative filtering with GraphChi.
Case study: million songs dataset

Acknowledgements

Tao Ye, Pandora Internet Radio, for beta testing itemcf.
Timmy Wilson, Smarttypes.org, for proposing additional metrics
Denis Parra, Univ. of Pittsburgh, for proposing additional metrics
Clive Cox, Rummble Labs, for implementing and contributing asym. cosine metric.
Mohammad Burhan, Fraunhofer IAIS, for suggesting Slope One recommender implementation.

24 comments:

SenthilSeptember 8, 2012 at 3:40 AM
Danny, Do you mean number of users who _watched_ both movie i and j in the third line?

wi = number of users who watched movie i
wj = number of users who watched movie j
wij = number of users who wanted both movie i and movie j
Dij = wij / ( wi + wj - wij )

Also at some point you mentioned Python interfaces to GraphChi, if that becomes available I would be happy to jump in and beta test, become early user etc. Do you have any plans for that. I think once you have a system where writing a new algorithm becomes kind of like writing a plugin for firefox or chrome, that is the ideal situation since we often run into situations which need slight modifications, workarounds based on ground realities.
ReplyDelete
Replies
SillySnailSeptember 9, 2012 at 9:56 PM
Hi Danny, nice work!
Can it calculate item similarity with Pearson's correlation?
I'm new to GraphChi and I'm wondering if it is possible that I could implement different similarity metrics myself.
Thanks.
ReplyDelete
Replies
Paul LefkopoulosSeptember 11, 2012 at 3:01 AM
Hi Danny.

Thanks for the work. I was wondering. Does someone have to go through each install step each time you publish a new algorithm/measure or can one just copy the new file to the toolkit folder?

Thanks!
ReplyDelete
Replies
Denis ParraOctober 2, 2012 at 8:33 AM
Hy Danny,

In the last RecSys Conference, Anmol Bhasin from LinkedIn explained that they use LSH (http://en.wikipedia.org/wiki/Locality-sensitive_hashing) for item-similarity and clustering in high-dimensional data (makes sense, in LinkedIn they have many features per user). Do you think is a good idea implementing this in GraphLab?

On the other side, although you have implemented many recommender algorithms, many researchers require several measures to evaluate them (RMSE, Precision@n, nDCG, recall, MRR, AUC) and MyMediaLite has given an important step in this area

http://www.ismll.uni-hildesheim.de/mymedialite/examples/item_recommendation_datasets.html

is there any plan to incorporate these measures in the Recommender Algorithms of GraphLab and/or GraphChi?

Thanks,
ReplyDelete
Replies
Danny BicksonOctober 2, 2012 at 8:59 AM
Hi Denis,
Thanks for your interesting feedback.

Regarding locality sensitive hashing - I am not sure wish variant do you mean. It seems that on the web lecture I found: http://www.slideshare.net/anmolbhasin/beyond-ratings-andfollowers-recsys-2012 they refer to cosine similarity. It is rather straightforward to add it to graphchi.

Regarding additional accuracy measures - it is possible to add support to additional measures. If you are interested in helping me beta test them I can add some of the measures you request.

Best,

DB
ReplyDelete
Replies
Denis ParraOctober 2, 2012 at 10:55 AM
Danny,

You are right, looks like the variant than Anmol refers in his presentation is cosine similarity, but not sure how he sets the weights of the vectors or if this is a special implementation. I think it good be a good contribution adding this to GraphChi.

On the other side, I would be glad to help you testing evaluation metrics. We have 4 people working on RecSys at iSchool at Pitt and at least 2 of us (Sherry Sahebi and me) could collaborate with you on this task.

Thanks again.
ReplyDelete
Replies
Will FitzgeraldOctober 2, 2012 at 1:17 PM
Danny,

I do hope you'll implement cosine similarity and the other sim. functions.

ReplyDelete
Replies
MartinOctober 16, 2012 at 2:20 AM
Hi，Danny!
I am a little confused about your code when do some test about computing the similarity(Cosine similarity and Pearson coefficient) of an item set by "itemcf2".
It seems when computing Cosine Similarity, the output is the "item1 item2 Distance", for in the distance.cpp, function "calc_cosine_distance" returns "1-dotprod / denominator". But when Pearson Coefficient, the output is "item1 item2 Similarity". I think maybe these outputs are confusing...
ReplyDelete
Replies
UnknownOctober 31, 2012 at 8:23 AM
Good one its works Great .... Sir !!
ReplyDelete
Replies
BurhanJanuary 28, 2013 at 6:03 AM
Dear Danny,

I have a below mentioned Market Matrix, when I run itemcf command, it throws an exception...any idea what could be the issue?

%%MatrixMarket matrix coordinate real general
% Generated 28-Jan-2013
4 4 7
1 1 5
1 2 5
1 3 3
1 4 5
2 1 5
3 1 2
4 1 4

mburhan@mburhan-Vostro-3460:~/graphchi/toolkits/collaborative_filtering$ ./itemcf --training=/home/mburhan/workspace/MarketMatrix/datasource/test_mm
WARNING: common.hpp(print_copyright:144): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
[training] => [/home/mburhan/workspace/MarketMatrix/datasource/test_mm]
INFO: chifilenames.hpp(find_shards:251): Detected number of shards: 2
INFO: chifilenames.hpp(find_shards:252): To specify a different number of shards, use command-line parameter 'nshards'
INFO: io.hpp(convert_matrixmarket:419): File /home/mburhan/workspace/MarketMatrix/datasource/test_mm was already preprocessed, won't do it again.
INFO: io.hpp(read_global_mean:112): Opened matrix size: 4 x 4 edges: 7 Global mean is: 4.14286 time bins: 0 Now creating shards.
FATAL: itemcf.cpp(main:452): This application currently supports only 1 shard
terminate called after throwing an instance of 'char const*'
Aborted (core dumped)

ReplyDelete
Replies
UnknownFebruary 13, 2013 at 3:13 AM
Hi Danny,

I exectued the algorithms (Jaccard index, AA, RA, Aiollo) with my dataset and have got great results.
I am currently trying to use cosine distance using the following command:
./toolkits/collaborative_filtering/itemcf2 --training=input_notP_5 --distance=4

This results in the following output on the terminal.

[training] => [input_notP_5]
[distance] => [4]
INFO: chifilenames.hpp(find_shards:258): Detected number of shards: 2
INFO: chifilenames.hpp(find_shards:259): To specify a different number of shards, use command-line parameter 'nshards'
INFO: io.hpp(convert_matrixmarket:425): File input_notP_5 was already preprocessed, won't do it again.
INFO: io.hpp(read_global_mean:112): Opened matrix size: 511288 x 274815 edges: 1044265 Global mean is: 4 time bins: 0 Now creating shards.
DEBUG: stripedio.hpp(stripedio:201): Start io-manager with 2 threads.
INFO: graphchi_engine.hpp(graphchi_engine:150): Initializing graphchi_engine. This engine expects 4-byte edge data.
INFO: chifilenames.hpp(load_vertex_intervals:378): shard: 0 - 600855
INFO: chifilenames.hpp(load_vertex_intervals:378): shard: 600856 - 786102
INFO: graphchi_engine.hpp(run:673): GraphChi starting
INFO: graphchi_engine.hpp(run:674): Licensed under the Apache License 2.0
INFO: graphchi_engine.hpp(run:675): Copyright Aapo Kyrola et al., Carnegie Mellon University (2012)
DEBUG: slidingshard.hpp(sliding_shard:193): Total edge data size: 2088552, input_notP_5.edata.e4B.0_2sizeof(ET): 4
DEBUG: slidingshard.hpp(sliding_shard:193): Total edge data size: 2088508, input_notP_5.edata.e4B.1_2sizeof(ET): 4
INFO: graphchi_engine.hpp(print_config:125): Engine configuration:
INFO: graphchi_engine.hpp(print_config:126): exec_threads = 1
INFO: graphchi_engine.hpp(print_config:127): load_threads = 4
INFO: graphchi_engine.hpp(print_config:128): membudget_mb = 800
INFO: graphchi_engine.hpp(print_config:129): blocksize = 4194304
INFO: graphchi_engine.hpp(print_config:130): scheduler = 1
INFO: graphchi_engine.hpp(run:706): Start iteration: 0
INFO: graphchi_engine.hpp(run:760): 0.029119s: Starting: 0 -- 600855
INFO: graphchi_engine.hpp(run:773): Iteration 0/5, subinterval: 0 - 600855
DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485887 number of blocks: 1
INFO: graphchi_engine.hpp(run:799): Start updates
INFO: graphchi_engine.hpp(run:809): Finished updates
INFO: graphchi_engine.hpp(run:827): Commit memshard
INFO: graphchi_engine.hpp(run:760): 0.207832s: Starting: 600856 -- 786102
INFO: graphchi_engine.hpp(run:773): Iteration 0/5, subinterval: 600856 - 786102
DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485801 number of blocks: 1
INFO: graphchi_engine.hpp(run:799): Start updates
INFO: graphchi_engine.hpp(run:809): Finished updates
INFO: graphchi_engine.hpp(run:827): Commit memshard
INFO: graphchi_engine.hpp(run:706): Start iteration: 1
INFO: graphchi_engine.hpp(run:760): 0.322075s: Starting: 0 -- 600855
INFO: graphchi_engine.hpp(run:773): Iteration 1/5, subinterval: 0 - 600855
DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485887 number of blocks: 1
INFO: graphchi_engine.hpp(run:799): Start updates

The execution does not proceeds further and stays at "Start Updates" step. Could you please let me know what I have missed.

Thanks,
Manu
ReplyDelete
Replies
UnknownOctober 15, 2017 at 9:13 PM
Hi Danny,

I am a little bit confused about the cosine similarity choice.

According to the command line argument, "FOR itemcf2: --distance=XX, 3 = PEARSON, 4=COSINE, " itemcf2: --distance=4 means cosine similarity, but in the reference page, cosine distance are in itemcf, not in itemcf2.

I tried with itemcf2, --distance=4 command, which give me similarity value greater than 1, I assume the cosine similarity value should between -1 and 1, right?

Could you help me figure out which --distance index should I choose?

Thanks,
Fen
ReplyDelete
Replies

Add comment

Large Scale Machine Learning and Other Animals

Tuesday, October 2, 2012

Item based similarity with GraphChi

How to try it out:

Additional metrics:

Adamic/Adar (AA) (itemcf):

Pearson Correlation (itemcf)

Cosine Distance (itemcf)

Manhattan Distance (itemcf2)

Log Similarity Distance (itemcf2)

Chebychev Distance (itemcf2)

Tanimoto Distance (itemcf2)

Aiolli's method (itemcf)

Slope One Recommender (itemcf2)

How to compute recommendations out of similarities?

Additional reading:

Acknowledgements

24 comments:

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax