tag:blogger.com,1999:blog-3211409948956809184.post2241757437702574037..comments2024-03-21T04:14:27.443-07:00Comments on Large Scale Machine Learning and Other Animals: Item based similarity with GraphChiDanny Bicksonhttp://www.blogger.com/profile/01517237836051035400noreply@blogger.comBlogger24125tag:blogger.com,1999:blog-3211409948956809184.post-9012525469451770342017-10-21T09:42:33.414-07:002017-10-21T09:42:33.414-07:00If your edges are not binary (0 or 1) you may have...If your edges are not binary (0 or 1) you may have values outside the range.Danny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-1729227963515532272017-10-15T21:13:49.299-07:002017-10-15T21:13:49.299-07:00Hi Danny,
I am a little bit confused about the co...Hi Danny,<br /><br />I am a little bit confused about the cosine similarity choice. <br /><br />According to the command line argument, "FOR itemcf2: --distance=XX, 3 = PEARSON, 4=COSINE, " itemcf2: --distance=4 means cosine similarity, but in the reference page, cosine distance are in itemcf, not in itemcf2. <br /><br />I tried with itemcf2, --distance=4 command, which give me similarity value greater than 1, I assume the cosine similarity value should between -1 and 1, right?<br /><br />Could you help me figure out which --distance index should I choose?<br /><br />Thanks,<br />FenAnonymoushttps://www.blogger.com/profile/11106529787820827199noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-2110672370785804112013-02-13T08:57:19.939-08:002013-02-13T08:57:19.939-08:00I am glad to hear it worked! Comparing all pairs m...I am glad to hear it worked! Comparing all pairs may be a low task - depends on how many items you got and also on the sparsity pattern of the user item ratings.Danny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-62612534423198478252013-02-13T08:55:35.133-08:002013-02-13T08:55:35.133-08:00Hi Danny,
It worked. I am using Ubuntu.. I used t...Hi Danny,<br /><br />It worked. I am using Ubuntu.. I used the following command:<br /><br />./toolkits/collaborative_filtering/itemcf2 --training=input_notP_7 --distance=6 --K=3 --min_allowed_intersection=3<br /><br />The execution at the following took 10 minutes:<br />INFO: graphchi_engine.hpp(run:799): Start updates <br /><br />And then it proceeded with the comparison of pairs. <br /><br />Thanks,<br />ManuAnonymoushttps://www.blogger.com/profile/02285983550463771834noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-58582659512091384202013-02-13T03:59:27.297-08:002013-02-13T03:59:27.297-08:00You should run with --nshards=1You should run with --nshards=1Danny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-47104808013221747952013-02-13T03:55:57.031-08:002013-02-13T03:55:57.031-08:00HI Manu,
This is strange. If you like send me a sa...HI Manu,<br />This is strange. If you like send me a sample dataset where this problem happens and I will take a look at it. Which OS are you running on?Danny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-25561850185574227512013-02-13T03:13:38.624-08:002013-02-13T03:13:38.624-08:00Hi Danny,
I exectued the algorithms (Jaccard inde...Hi Danny,<br /><br />I exectued the algorithms (Jaccard index, AA, RA, Aiollo) with my dataset and have got great results.<br />I am currently trying to use cosine distance using the following command:<br />./toolkits/collaborative_filtering/itemcf2 --training=input_notP_5 --distance=4<br /><br />This results in the following output on the terminal. <br /><br />[training] => [input_notP_5]<br />[distance] => [4]<br />INFO: chifilenames.hpp(find_shards:258): Detected number of shards: 2<br />INFO: chifilenames.hpp(find_shards:259): To specify a different number of shards, use command-line parameter 'nshards'<br />INFO: io.hpp(convert_matrixmarket:425): File input_notP_5 was already preprocessed, won't do it again. <br />INFO: io.hpp(read_global_mean:112): Opened matrix size: 511288 x 274815 edges: 1044265 Global mean is: 4 time bins: 0 Now creating shards.<br />DEBUG: stripedio.hpp(stripedio:201): Start io-manager with 2 threads.<br />INFO: graphchi_engine.hpp(graphchi_engine:150): Initializing graphchi_engine. This engine expects 4-byte edge data. <br />INFO: chifilenames.hpp(load_vertex_intervals:378): shard: 0 - 600855<br />INFO: chifilenames.hpp(load_vertex_intervals:378): shard: 600856 - 786102<br />INFO: graphchi_engine.hpp(run:673): GraphChi starting<br />INFO: graphchi_engine.hpp(run:674): Licensed under the Apache License 2.0<br />INFO: graphchi_engine.hpp(run:675): Copyright Aapo Kyrola et al., Carnegie Mellon University (2012)<br />DEBUG: slidingshard.hpp(sliding_shard:193): Total edge data size: 2088552, input_notP_5.edata.e4B.0_2sizeof(ET): 4<br />DEBUG: slidingshard.hpp(sliding_shard:193): Total edge data size: 2088508, input_notP_5.edata.e4B.1_2sizeof(ET): 4<br />INFO: graphchi_engine.hpp(print_config:125): Engine configuration: <br />INFO: graphchi_engine.hpp(print_config:126): exec_threads = 1<br />INFO: graphchi_engine.hpp(print_config:127): load_threads = 4<br />INFO: graphchi_engine.hpp(print_config:128): membudget_mb = 800<br />INFO: graphchi_engine.hpp(print_config:129): blocksize = 4194304<br />INFO: graphchi_engine.hpp(print_config:130): scheduler = 1<br />INFO: graphchi_engine.hpp(run:706): Start iteration: 0<br />INFO: graphchi_engine.hpp(run:760): 0.029119s: Starting: 0 -- 600855<br />INFO: graphchi_engine.hpp(run:773): Iteration 0/5, subinterval: 0 - 600855<br />DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485887 number of blocks: 1<br />INFO: graphchi_engine.hpp(run:799): Start updates<br />INFO: graphchi_engine.hpp(run:809): Finished updates<br />INFO: graphchi_engine.hpp(run:827): Commit memshard<br />INFO: graphchi_engine.hpp(run:760): 0.207832s: Starting: 600856 -- 786102<br />INFO: graphchi_engine.hpp(run:773): Iteration 0/5, subinterval: 600856 - 786102<br />DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485801 number of blocks: 1<br />INFO: graphchi_engine.hpp(run:799): Start updates<br />INFO: graphchi_engine.hpp(run:809): Finished updates<br />INFO: graphchi_engine.hpp(run:827): Commit memshard<br />INFO: graphchi_engine.hpp(run:706): Start iteration: 1<br />INFO: graphchi_engine.hpp(run:760): 0.322075s: Starting: 0 -- 600855<br />INFO: graphchi_engine.hpp(run:773): Iteration 1/5, subinterval: 0 - 600855<br />DEBUG: memoryshard.hpp(load_edata:249): Compressed/full size: 0.00485887 number of blocks: 1<br />INFO: graphchi_engine.hpp(run:799): Start updates <br /> <br />The execution does not proceeds further and stays at "Start Updates" step. Could you please let me know what I have missed.<br /><br />Thanks,<br />ManuAnonymoushttps://www.blogger.com/profile/02285983550463771834noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-49950656823225067432013-01-28T06:03:19.972-08:002013-01-28T06:03:19.972-08:00Dear Danny,
I have a below mentioned Market Matri...Dear Danny,<br /><br />I have a below mentioned Market Matrix, when I run itemcf command, it throws an exception...any idea what could be the issue?<br /><br />%%MatrixMarket matrix coordinate real general<br />% Generated 28-Jan-2013<br />4 4 7<br />1 1 5<br />1 2 5<br />1 3 3<br />1 4 5<br />2 1 5<br />3 1 2<br />4 1 4<br /><br />mburhan@mburhan-Vostro-3460:~/graphchi/toolkits/collaborative_filtering$ ./itemcf --training=/home/mburhan/workspace/MarketMatrix/datasource/test_mm<br />WARNING: common.hpp(print_copyright:144): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com <br />[training] => [/home/mburhan/workspace/MarketMatrix/datasource/test_mm]<br />INFO: chifilenames.hpp(find_shards:251): Detected number of shards: 2<br />INFO: chifilenames.hpp(find_shards:252): To specify a different number of shards, use command-line parameter 'nshards'<br />INFO: io.hpp(convert_matrixmarket:419): File /home/mburhan/workspace/MarketMatrix/datasource/test_mm was already preprocessed, won't do it again. <br />INFO: io.hpp(read_global_mean:112): Opened matrix size: 4 x 4 edges: 7 Global mean is: 4.14286 time bins: 0 Now creating shards.<br />FATAL: itemcf.cpp(main:452): This application currently supports only 1 shard<br />terminate called after throwing an instance of 'char const*'<br />Aborted (core dumped)<br /><br />Burhanhttps://www.blogger.com/profile/02054454976785378558noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-87731072125462577772012-10-31T08:23:23.611-07:002012-10-31T08:23:23.611-07:00Good one its works Great .... Sir !!Good one its works Great .... Sir !!Anonymoushttps://www.blogger.com/profile/18279661984564791638noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-85869726464158922822012-10-16T05:17:24.707-07:002012-10-16T05:17:24.707-07:00Hi Martin!
Thanks for your feedback. Based on your...Hi Martin!<br />Thanks for your feedback. Based on your comments I have change the terminology from Cosine similarity to Cosine distance and gave a different explaining URL. Please take a loook and let me know if this is now clearer.<br /><br />Best,<br /><br />DBDanny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-67650399359959402322012-10-16T02:20:14.626-07:002012-10-16T02:20:14.626-07:00Hi,Danny!
I am a little confused about your code ...Hi,Danny! <br />I am a little confused about your code when do some test about computing the similarity(Cosine similarity and Pearson coefficient) of an item set by "itemcf2". <br />It seems when computing Cosine Similarity, the output is the "item1 item2 Distance", for in the distance.cpp, function "calc_cosine_distance" returns "1-dotprod / denominator". But when Pearson Coefficient, the output is "item1 item2 Similarity". I think maybe these outputs are confusing... Martinhttps://www.blogger.com/profile/12261968681166009634noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-63036103821312864662012-10-03T13:26:46.855-07:002012-10-03T13:26:46.855-07:00Yes, I would indeed!Yes, I would indeed!Will Fitzgeraldhttps://www.blogger.com/profile/08337633377540983679noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-33718769401104778552012-10-02T13:22:44.111-07:002012-10-02T13:22:44.111-07:00Already implemented... Let me know if you want to ...Already implemented... Let me know if you want to test it out.<br /><br />Thanks,<br /><br />DBDanny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-6262393251584630212012-10-02T13:17:54.592-07:002012-10-02T13:17:54.592-07:00Danny,
I do hope you'll implement cosine simi...Danny,<br /><br />I do hope you'll implement cosine similarity and the other sim. functions.<br /><br />Will Fitzgeraldhttps://www.blogger.com/profile/08337633377540983679noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-58005151538774388572012-10-02T10:55:16.466-07:002012-10-02T10:55:16.466-07:00Danny,
You are right, looks like the variant than...Danny,<br /><br />You are right, looks like the variant than Anmol refers in his presentation is cosine similarity, but not sure how he sets the weights of the vectors or if this is a special implementation. I think it good be a good contribution adding this to GraphChi.<br /><br />On the other side, I would be glad to help you testing evaluation metrics. We have 4 people working on RecSys at iSchool at Pitt and at least 2 of us (Sherry Sahebi and me) could collaborate with you on this task.<br /><br />Thanks again.Denis Parrahttps://www.blogger.com/profile/12128725067167092076noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-54219933790263384392012-10-02T08:59:17.498-07:002012-10-02T08:59:17.498-07:00Hi Denis,
Thanks for your interesting feedback.
...Hi Denis, <br />Thanks for your interesting feedback.<br /><br />Regarding locality sensitive hashing - I am not sure wish variant do you mean. It seems that on the web lecture I found: http://www.slideshare.net/anmolbhasin/beyond-ratings-andfollowers-recsys-2012 they refer to cosine similarity. It is rather straightforward to add it to graphchi.<br /><br />Regarding additional accuracy measures - it is possible to add support to additional measures. If you are interested in helping me beta test them I can add some of the measures you request.<br /><br />Best, <br /><br />DBDanny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-17671233664822452312012-10-02T08:33:55.161-07:002012-10-02T08:33:55.161-07:00Hy Danny,
In the last RecSys Conference, Anmol Bh...Hy Danny,<br /><br />In the last RecSys Conference, Anmol Bhasin from LinkedIn explained that they use LSH (http://en.wikipedia.org/wiki/Locality-sensitive_hashing) for item-similarity and clustering in high-dimensional data (makes sense, in LinkedIn they have many features per user). Do you think is a good idea implementing this in GraphLab?<br /><br />On the other side, although you have implemented many recommender algorithms, many researchers require several measures to evaluate them (RMSE, Precision@n, nDCG, recall, MRR, AUC) and MyMediaLite has given an important step in this area <br /><br />http://www.ismll.uni-hildesheim.de/mymedialite/examples/item_recommendation_datasets.html <br /><br />is there any plan to incorporate these measures in the Recommender Algorithms of GraphLab and/or GraphChi? <br /><br />Thanks,Denis Parrahttps://www.blogger.com/profile/12128725067167092076noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-34854710998610180322012-09-11T03:04:53.455-07:002012-09-11T03:04:53.455-07:00Hi Paul,
My advice is to use mercurial, and thus ...Hi Paul, <br />My advice is to use mercurial, and thus you do not need to install, just do "hg pull; hg update" and then "make clean; make cf". The pull commands gets the latest files from the repository. You can also download them manually and place them in the toolkits/collaborative_filtering folder, but this is less recommended since we sometime fix things in the graphchi engine itself or in other parts of the code.<br /><br />BestmDanny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-3202442647792346522012-09-11T03:01:02.797-07:002012-09-11T03:01:02.797-07:00Hi Danny.
Thanks for the work. I was wondering. D...Hi Danny.<br /><br />Thanks for the work. I was wondering. Does someone have to go through each install step each time you publish a new algorithm/measure or can one just copy the new file to the toolkit folder?<br /><br />Thanks!Paul Lefkopouloshttp://www.fifty-five.comnoreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-76586882878541146772012-09-11T00:54:27.145-07:002012-09-11T00:54:27.145-07:00HI again,
Pearson correlation is now implemented ...HI again, <br />Pearson correlation is now implemented in graphchi. Please try it out by checking out from mercurial (hg pull; hg update) and then recompile (make clean; make cf). I have documented the usage here: http://bickson.blogspot.co.il/2012/09/item-based-similarity-with-graphchi.html#metricsDanny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-25176565309628133552012-09-09T22:22:38.523-07:002012-09-09T22:22:38.523-07:00Hi SillySnail (nice nickname by the way!)
I can im...Hi SillySnail (nice nickname by the way!)<br />I can implement pearson correlation rather quickly for you (maybe even today), if you like to be my beta tester and try it out.<br /><br />Best,Danny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-27128129018257700522012-09-09T21:56:19.201-07:002012-09-09T21:56:19.201-07:00Hi Danny, nice work!
Can it calculate item simila...Hi Danny, nice work! <br />Can it calculate item similarity with Pearson's correlation?<br />I'm new to GraphChi and I'm wondering if it is possible that I could implement different similarity metrics myself.<br />Thanks.SillySnailhttp://www.sillysnail.comnoreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-54730561731245019732012-09-08T04:07:00.492-07:002012-09-08T04:07:00.492-07:00Hi Senthil ! Thanks for your note I have fixed the...Hi Senthil ! Thanks for your note I have fixed the typo (watched).<br /><br />Regarding python we do not have a wrapper yet, but there is a graphchi version in Java. We will add your request to our wish list...Danny Bicksonhttps://www.blogger.com/profile/01517237836051035400noreply@blogger.comtag:blogger.com,1999:blog-3211409948956809184.post-88739982798289705382012-09-08T03:40:53.075-07:002012-09-08T03:40:53.075-07:00Danny, Do you mean number of users who _watched_ b...Danny, Do you mean number of users who _watched_ both movie i and j in the third line? <br /><br /> wi = number of users who watched movie i<br /> wj = number of users who watched movie j<br /> wij = number of users who wanted both movie i and movie j<br /> Dij = wij / ( wi + wj - wij )<br /><br />Also at some point you mentioned Python interfaces to GraphChi, if that becomes available I would be happy to jump in and beta test, become early user etc. Do you have any plans for that. I think once you have a system where writing a new algorithm becomes kind of like writing a plugin for firefox or chrome, that is the ideal situation since we often run into situations which need slight modifications, workarounds based on ground realities. Senthilhttp://urbanravine.comnoreply@blogger.com