Wednesday, September 21, 2011

Giraph Machine Learning Project - Setting up

Giraph machine learning project, is a relatively new large scale machine learning project at incubation stage under Apache. It is the only open source implementation I am aware of Google's Pregel (BSP = Bulk Synchronous Parallel) framework.

I got the following instructions, from my colleague and friend Aapo Kyrola:

1. INSTALL HADOOP: Must be version 0.20.203 or later.
- This is simple, just download and extract.

2. Set HADOOP_HOME variable to point to the hadoop directory.

3. Set Hadoop configuration (under HADOOP_HOME/config) according
to what is explained here.

* NOTE: set the hdfs directory appropriately: core-site.xml, property hadoop.tmp.dir

3.5 Start Hadoop:
bin/start-all.sh

4. Install zookeeper
- just download and extract

5. Configure conf/zoo.cfg properly. (Just copy the sample config and change to sensible parameters).
- set clientPort=22181

6. Start up zookeeper:
bin/zkServer.sh start

7. Install and build Giraph as explained in the end of this website:
http://incubator.apache.org/giraph/

8. In HADOOP_HOME, run PageRank:
bin/hadoop jar ../../GraphLab/giraph/giraph/trunk/target/giraph-0.70-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -e 100 -s 5 -V 10000 -w 1 -v

If everything went OK you will get:
11/09/19 18:23:20 INFO mapred.JobClient:   Giraph Timers
11/09/19 18:23:20 INFO mapred.JobClient:     Total (milliseconds)=260128
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 3 (milliseconds)=54578
11/09/19 18:23:20 INFO mapred.JobClient:     Setup (milliseconds)=2771
11/09/19 18:23:20 INFO mapred.JobClient:     Shutdown (milliseconds)=92
11/09/19 18:23:20 INFO mapred.JobClient:     Vertex input superstep (milliseconds)=2386
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 0 (milliseconds)=8059
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 4 (milliseconds)=70263
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 5 (milliseconds)=1879
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 2 (milliseconds)=66531
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep 1 (milliseconds)=53564
11/09/19 18:23:20 INFO mapred.JobClient:   Giraph Stats
11/09/19 18:23:20 INFO mapred.JobClient:     Aggregate edges=1000000
11/09/19 18:23:20 INFO mapred.JobClient:     Superstep=6
11/09/19 18:23:20 INFO mapred.JobClient:     Current workers=1
11/09/19 18:23:20 INFO mapred.JobClient:     Current master task partition=0
11/09/19 18:23:20 INFO mapred.JobClient:     Sent messages=0
11/09/19 18:23:20 INFO mapred.JobClient:     Aggregate finished vertices=10000
11/09/19 18:23:20 INFO mapred.JobClient:     Aggregate vertices=10000

Anyway Aapo has a great Nordic sense of humor. This is what he sent me later:
For your convenience, I have pasted the documentation of Giraph to this email.

-- Begin --
-- End --



Additionally, a quick start document is available here:
https://github.com/aching/Giraph/wiki/Quick-Start-Guide

4 comments:

  1. I have question: How to set the clientPort=22181?

    Thank you so much!

    ReplyDelete
  2. HAMA is another open source implementation of BSP.

    ReplyDelete
    Replies
    1. Thanks for you note. i did not try it out but I heard it is not stable yet.

      Delete
  3. This is great -- are there any examples in Python? I understand that there is some way to interface Giraph through jython, but I do not know how.

    ReplyDelete