Large Scale Machine Learning and Other Animals: GraphChi visual toolkit

Thursday, November 22, 2012

GraphChi visual toolkit - or understanding your data

A few weeks ago I wrote about Orange d4d data of cellular user behavior in Africa.
The data of phone call patterns is given as a text file in the following format:
20000 20003
20000 20005
20000 20008
20000 20011
20000 20012
1052 20000
20001 20006
20002 20009
20002 20010
1052 20002

With the following format:
[calling user] [receiving user]\n

Since there are hundreds of thousands of phone calls it is very hard to understand what is actually the network structure. I decided to write a quick visual tool that will help user examine their graphs and understand better their structure.

Here is how you can try it out:
1. Checkout GraphChi from mercurial using the instructions here.
2. # cd graphchi; bash install.sh; make parsers; make ga
3. # cd toolkits/visual
4. Run the visual toolkit to create a sub graph representation. You will need to input the graph input file name, and the number of edges to extract. It is recommended to display less than 1000 edges or else the plot may be slow.
# bash make_data.csv.sh -f [input graph name] -n [number of lines]
For example, you can use the sample graph provided:
# bash make_data.csv.sh -f `pwd`/sample_graph -n 1000
5. # firefox index.html

Here are some examples of the images I got when playing with orange data:

As you can see different kinds of users emerge very clearly.. the red nodes are the "seed" users where the graph was traversed from. Each edge is a phone call connection. We can see different users:
1) unsocial - rarely makes phone calls..
2) small network - few calls to neighbors
3) nagging - often calls to call centers (highly connected neighbors)
4) social - connected to a lot of friends which are interconnected together

Next I tried the same visualization on some twitter data I have. Each link is a twit or retwit directed to a certain user.

Next I looked at some phone calls data from a large European country. The graph captures only several minutes time span. It is interesting to see that from the gray node in the middle the is a 6 hop link of someone who called someone who called someone in a very short time.

And here is a sample webpage which shows the output of the visualization.

Advanced features:
1) It is possible to traverse a graph starting from a set of seed nodes.
Use the command line -s XXX for example: -s 12
or -s 192,31990,2312

2) When selecting a seed node, specify the number of hops to traverse using -h XX command. For example, -h 3 will traverse 3 hops around the sets of seed nodes.

3) If your input file is not in sparse matrix market format, but in [from] [to] format, you need
to specify an upper limit on the number of graph nodes using -o XX command.

How does your data look like? I would love any feedback from people who are trying to visualize their own graphs... let me know if you have any questions about the setup.

Credits: I am using the great d3.js package for performing the visualization. Thanks to Tyler Johnson, Shingo Takamatsu and Ali Bagheri Garakani from UW for teaching me how to deploy d3.js!

Large Scale Machine Learning and Other Animals

Thursday, November 22, 2012

GraphChi visual toolkit - or understanding your data

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax