In this blog post I explain how to setup GraphChi Java development environment in Eclipse and run alternating least squares algorithm (ALS) on a small subset of Netflix data.
Based on the level of user feedback I am going to receive for this blog post, we will consider porting more methods to Java. So email me if you are interested in trying it out.
Preliminaries - setting up Maven
Download maven binary from:http://maven.apache.org/download.cgi
Extract the tgz file into /usr/local/apache-maven-3.0.4/
Setup Maven environment:
export M2_HOME=/usr/local/apache-maven-3.0.4
export M2=$M2_HOME/bin
optional:
export MAVEN_OPTS="-Xms256m -Xmx512m"
Note: you have to have Java JDK installed.
Download and install mercurial from:
http://mercurial.selenic.com/downloads/Checkout GraphChi-Java from:
http://code.google.com/p/graphchi-java/source/checkout
Download Ecplise Classic Juno from:
http://www.eclipse.org/downloads/index-developer.php?release=junoDownload m2e eclipse plugin from:
http://eclipse.org/m2e/download/
By adding a new software site as explained here: http://help.eclipse.org/juno/index.jsp?topic=//org.eclipse.platform.doc.user/tasks/tasks-127.htmEclipse -> install -> work with: http://download.eclipse.org/technology/m2e/releases
software name: m2e ->
Restart eclipse.
Import GraphChi Java project into Ecplise
Eclipse -> File -> import -> existing maven project ->
Next->Browse for the graphchi-java project (the path you checked using mercurial)
Project -> Build (remove the check mark on build automatically if present).
At the first compilation maven will download some plugins:
Verify that the project compiler is pointing to Java 1.6: Right mouse click GraphChi Java project root -> properties - > compiler -> 1.6 (see picture):
Hopefully now the project compiled without errors.
Now run ALS with subset of netflix data
Download the file: smallnetflix_mm and put it in your project folder.Right mouse click ALSMatrixFactoriztion,java -> Run as.. -> run configuration and add command line arguments:
Also set the virtual machine parameters to increase memory.
Press the "Run" button.
Correct run should be:
9:54:25 AM ALS main - INFO: Found shards -- no need to preprocess
9:54:25 AM ALS main - INFO: Set latent factor dimension to: 5
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
9:54:26 AM engine run - INFO: :::::::: Using 4 execution threads :::::::::
9:54:26 AM ALS beginIteration - INFO: Initializing latent factors for 96576 vertices
Creating 1 blocks
9:54:26 AM engine run - INFO: 0.672s: iteration: 0, interval: 0 -- 96575
Tried to read past file: 0 --- 772608
9:54:26 AM engine run - INFO: Subinterval:: 0 -- 96575 (iteration 0)
9:54:26 AM engine run - INFO: Init vertices...
9:54:27 AM engine run - INFO: Loading...
9:54:27 AM engine run - INFO: Loading memshard started. pool-2-thread-1 id=11
9:54:27 AM engine run - INFO: Memshard: 0 -- 96575
9:54:27 AM engine run - INFO: Vertices length: 96576
9:54:27 AM memoryshard loadVertices - INFO: Load memory shard: 0 --- 96575
9:54:27 AM engine run - INFO: Loading memory-shard finished.pool-2-thread-1
9:54:27 AM engine run - INFO: Load took: 274ms
9:54:27 AM engine run - INFO: Update exec: 610 ms.
9:54:27 AM engine run - INFO: 1.793s: iteration: 1, interval: 0 -- 96575
9:54:27 AM engine run - INFO: Subinterval:: 0 -- 96575 (iteration 1)
9:54:27 AM engine run - INFO: Init vertices...
9:54:27 AM engine run - INFO: Loading...
9:54:27 AM engine run - INFO: Loading memshard started. pool-2-thread-2 id=16
9:54:27 AM engine run - INFO: Memshard: 0 -- 96575
9:54:27 AM engine run - INFO: Vertices length: 96576
9:54:27 AM memoryshard loadVertices - INFO: Load memory shard: 0 --- 96575
9:54:28 AM engine run - INFO: Loading memory-shard finished.pool-2-thread-2
9:54:28 AM engine run - INFO: Load took: 163ms
9:54:28 AM engine run - INFO: Update exec: 391 ms.
9:54:28 AM engine run - INFO: 2.422s: iteration: 2, interval: 0 -- 96575
9:54:28 AM engine run - INFO: Subinterval:: 0 -- 96575 (iteration 2)
9:54:28 AM engine run - INFO: Init vertices...
9:54:28 AM engine run - INFO: Loading...
9:54:28 AM engine run - INFO: Loading memshard started. pool-2-thread-3 id=17
9:54:28 AM engine run - INFO: Memshard: 0 -- 96575
9:54:28 AM engine run - INFO: Vertices length: 96576
9:54:28 AM memoryshard loadVertices - INFO: Load memory shard: 0 --- 96575
9:54:28 AM engine run - INFO: Loading memory-shard finished.pool-2-thread-3
9:54:28 AM engine run - INFO: Load took: 134ms
9:54:29 AM engine run - INFO: Update exec: 374 ms.
9:54:29 AM engine run - INFO: 2.997s: iteration: 3, interval: 0 -- 96575
9:54:29 AM engine run - INFO: Subinterval:: 0 -- 96575 (iteration 3)
9:54:29 AM engine run - INFO: Init vertices...
9:54:29 AM engine run - INFO: Loading...
9:54:29 AM engine run - INFO: Loading memshard started. pool-2-thread-4 id=18
9:54:29 AM engine run - INFO: Memshard: 0 -- 96575
9:54:29 AM engine run - INFO: Vertices length: 96576
9:54:29 AM memoryshard loadVertices - INFO: Load memory shard: 0 --- 96575
9:54:29 AM engine run - INFO: Loading memory-shard finished.pool-2-thread-4
9:54:29 AM engine run - INFO: Load took: 170ms
9:54:29 AM engine run - INFO: Update exec: 398 ms.
9:54:29 AM engine run - INFO: 3.5820000000000003s: iteration: 4, interval: 0 -- 96575
9:54:29 AM engine run - INFO: Subinterval:: 0 -- 96575 (iteration 4)
9:54:29 AM engine run - INFO: Init vertices...
9:54:29 AM engine run - INFO: Loading...
9:54:29 AM engine run - INFO: Loading memshard started. pool-2-thread-1 id=11
9:54:29 AM engine run - INFO: Memshard: 0 -- 96575
9:54:29 AM engine run - INFO: Vertices length: 96576
9:54:29 AM memoryshard loadVertices - INFO: Load memory shard: 0 --- 96575
9:54:29 AM engine run - INFO: Loading memory-shard finished.pool-2-thread-1
9:54:29 AM engine run - INFO: Load took: 117ms
9:54:30 AM engine run - INFO: Update exec: 505 ms.
9:54:30 AM engine run - INFO: Engine finished in: 4.2620000000000005 secs.
9:54:30 AM engine run - INFO: Updates: 482880
9:54:30 AM ALS main - INFO: Train RMSE: 0.7323246277805968, total edges:900817
9:54:31 AM ALS writeOutputMatrices - INFO: Latent factor matrices saved: /Users/bickson/Downloads/smallnetflix_mm_U.mm, /Users/bickson/Downloads/smallnetflix_mm_V.mm
Known errors:
in thread "main" java.io.FileNotFoundException: ~/Downloads/smallnetflix_mm.shovel.0 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
at java.io.FileOutputStream.<init>(FileOutputStream.java:84)
at edu.cmu.graphchi.preprocessing.FastSharder.<init>(FastSharder.java:113)
at edu.cmu.graphchi.apps.ALSMatrixFactorization.createSharder(ALSMatrixFactorization.java:176)
at edu.cmu.graphchi.apps.ALSMatrixFactorization.main(ALSMatrixFactorization.java:198)
Solution: Give a full absolute path pointing to the location of your file, namely /home/bickson/Downloads/smallnetflix_mm etc.
Error:
thread "main" java.lang.IllegalArgumentException: Java Virtual Machine has only 32489472bytes maximum memory. Please run the JVM with at least 256 megabytes of memory using -Xmx256m. For better performance, use higher value
at edu.cmu.graphchi.engine.GraphChiEngine.<init>(GraphChiEngine.java:120)
at edu.cmu.graphchi.apps.ALSMatrixFactorization.main(ALSMatrixFactorization.java:215)
Solution:
Increase virtual machine memory quota as explained on top.
Hi Danny,
ReplyDeleteI was able to setup and run the Java version on Windows 7.It run perfectly although I have slow 2GB RAM 32 bit x86 machine:
Here is the final output:
12:29:15 AM engine run - INFO: Engine finished in: 70.42 secs.
12:29:15 AM engine run - INFO: Updates: 495445
12:29:15 AM ALS main - INFO: Train RMSE: 0.80984909266858, total edges:3298163
12:29:20 AM ALS writeOutputMatrices - INFO: Latent factor matrices saved: C:\ARVista01\data2012\MySoftwareProjects\DataScience\MachineLearning\CMU\GraphChi\Data\smallnetflix_mm.txt_U.mm, C:\ARVista01\data2012\MySoftwareProjects\DataScience\MachineLearning\CMU\GraphChi\Data\smallnetflix_mm.txt_V.mm
It would be nice if all the algorithm in the C++ version are ported to Java. Java has a much more bigger audience
I am willing to help with porting to Java and/or testing
thanks
Al
Thanks Al for your kind note. We would love to get any help we can. Let me try to port some algorithm and have you help us test it.
DeleteBest,
Sure. I will be glad to help in any ways I can. I also can participate in code review (may be design and implantation) . I have a strong S/W background (Java,C/C++,Python) on Windows (and lesser degree Linux platform).
ReplyDeleteHi,
ReplyDeleteWhich Java SDK for Ubuntu you recommend :
Oracle Java 6 (or 7) latest
OpenJDK 6 (or 7)
I am using Oracle Java 7 on Win 7
P.S: In my last post "implantation" is typo! Meant implementation!
Whichever works... :-)
DeleteWhy it takes much more time on my PC?
ReplyDelete5:13:57 PM engine run - INFO: Engine finished in: 67.946 secs.
5:13:57 PM engine run - INFO: Updates: 495445
5:13:57 PM ALS main - INFO: Train RMSE: 0.804810381972627, total edges:3298163
as compared to Danny's:
9:54:30 AM engine run - INFO: Engine finished in: 4.2620000000000005 secs.
I am using Win 7 with 2 AMD core each 1.67 GHz and 2 GB M.
I guess Ram is the determining factor?
Don't worry about it - when I run I did not notice that my input file was truncated so it was about 1/4 of the right size. So you should multiply runtime in about x4 to get my runtime.
DeleteActually on a faster PC (win7 with 8 cores) using the whole file it run in 7.115 which is very close to your results!
Delete3:18:21 PM memoryshard loadVertices - INFO: Load memory shard: 0 --- 98350
3:18:22 PM engine run - INFO: Loading memory-shard finished.pool-2-thread-1
3:18:22 PM engine run - INFO: Load took: 275ms
3:18:22 PM engine run - INFO: Update exec: 737 ms.
3:18:22 PM engine run - INFO: Subinterval:: 98351 -- 99088 (iteration 4)
3:18:22 PM engine run - INFO: Init vertices...
3:18:22 PM engine run - INFO: Loading...
3:18:22 PM engine run - INFO: Loading memshard started. pool-2-thread-2 id=19
3:18:22 PM engine run - INFO: Memshard: 98351 -- 99088
3:18:22 PM engine run - INFO: Vertices length: 738
3:18:22 PM memoryshard loadVertices - INFO: Load memory shard: 98351 --- 99088
3:18:22 PM engine run - INFO: Loading memory-shard finished.pool-2-thread-2
3:18:22 PM engine run - INFO: Load took: 85ms
3:18:22 PM engine run - INFO: Update exec: 74 ms.
3:18:22 PM engine run - INFO: Engine finished in: 7.115 secs.
3:18:22 PM engine run - INFO: Updates: 495445
3:18:22 PM ALS main - INFO: Train RMSE: 0.8100755498999634, total edges:3298163
3:18:23 PM ALS writeOutputMatrices - INFO: Latent factor matrices saved: C:\AARW701\data\AR\TOSH2013_01\SW\EWS\GraphChiJava\DataSets\smallnetflix_mm.txt_U.mm, C:\AARW701\data\AR\TOSH2013_01\SW\EWS\GraphChiJava\DataSets\smallnetflix_mm.txt_V.mm
Thanks for the update!
DeleteYou're welcome. Any performance model as a function of number of CPU and power of each and RAM?
ReplyDeleteMore specifically I use a PC with:
Operating System: Windows 7 Professional 64-bit (6.1, Build
Processor: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz (8 CPUs), ~2.2GHz
Memory: 8192MB RAM
Available OS Memory: 8100MB RAM
Page File: 2943MB used, 13254MB available
Thanks for the tutorial. I followed the steps but get an error when I run it.
ReplyDeleteException in thread "main" java.io.FileNotFoundException: /home/Data/smallnetflix_mm.txt.shovel.0 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:212)
at java.io.FileOutputStream.(FileOutputStream.java:104)
at edu.cmu.graphchi.preprocessing.FastSharder.(FastSharder.java:115)
at edu.cmu.graphchi.apps.SmokeTest.createSharder(SmokeTest.java:101)
at edu.cmu.graphchi.apps.SmokeTest.main(SmokeTest.java:125)
In the FastSharder, it tries to retrive a file which is not created. Do you know what may be causing this error ?
Can you please specify the full command line arguments you are using?
DeleteThanks