Large Scale Machine Learning and Other Animals: February 2013

Friday, February 22, 2013

MLBase

Here is an interesting post I got from Ben Lorica, O'Reilly about MLbase:
http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html

It is a proof of concept machine learning library on top of Spark, with a custom declarative language called MQL.

Wednesday, February 20, 2013

Literature survey of graph databases

I stumbled upon this tech report: Literature survey of graph database, by Bryan Thompson, Systap. It is a good survey of different graph database platforms, but especially I liked its extensive coverage of the GraphChi framework:

GraphChi (IO Efficient Graph Mining) GraphChi (Kyrola, 2012) is an IO efficient graph mining system that is also designed to accept topology updates based on a Parallel Sliding Window (PSW) algorithm. Each iteration over the graph requires P^2 sequential reads and P^2 sequential writes. Because all IO is sequential, GraphChi may be used with traditional disk or SSD. The system is not designed to answer ad-hoc queries and is not a database in any traditional sense – the isolation semantics of GraphChi are entirely related to the Asynchronous Parallel (ASP) versus Bulk Synchronous Parallel (BSP) processing paradigms. GraphChi does not support either vertex attributes or link attributes. The basic data layout for GraphChi is a storage model that is key-range partitioned by the link target (O) and then stores the links in sorted order (SO). This design was chosen to permit IO efficient vertex programs where the graph was larger than main memory.

While GraphChi does not support cluster-based process, the approach could be extended to a compute cluster. Because of the IO efficient design, the approach is of interest for out-of-core processing in hybrid CPU/GPU architectures. GraphChi operates by applying a utility program to split a data set into P partitions, where the user chooses the value of P with the intention that a single partition will fit entirely into main memory. The edges are assigned to partitions in order to create partitions with an equal #of edges – this provides load balancing and compensates for data skew in the graph (high cardinality vertices). GraphChi reads one partition of P (called the memory partition) into core. This provides the in-edges for all vertices in that partition. Because the partitions are divided into target vertex key-ranges, and because partitions are internally ordered by the source vertex, out-edges for those vertices are guaranteed to lie in a contiguous range of the remaining P-1 partitions. Those key-ranges may vary in their size since the #of out-edges for a vertex is not a constant. Thus, for the current memory partition, GraphChi performs 1 full partition scan plus P-1 partial partition scans.

In addition to the edges (network structure), GraphChi maintains the transient graph computation state for each edge and vertex. The edge and vertex each have a user assignable label consisting of some (user-defined) fixed-length data type. The vertices also have a flag indicating whether or not they are scheduled in a given iteration. The edge state is presumably read with the out-edges, though perhaps from a separate file (the paper does not specify this). The vertex state is presumably read from a separate file (again, this is not specified). After the update() function has been applied to each vertex in the current memory partition, the transient graph state is written back out to the disk. This provides one more dimension of graph computation state that is persisted on disk, presumably in a column paired with the vertex state.

If you like to read the rest of the overview, and also some proposed extensions, you should read the full paper. And of course, you can read about the collaborative filtering toolkit I am writing on top of GraphChi here.

An update: I just got from Bryan Thompson a note about additional useful resources Systap released:
- Large Scale Graph Algorithms on the GPU (Yangzihao Wang and John Owens, UC Davis)
- Graph Pattern Matching, Search, and OLAP (Dr. Xifeng Yan, UCSB)

Tuesday, February 19, 2013

Co-EM algorithm in GraphChi

Following the previous post about label propagation, as well as some request from US based startup to implement this method in GraphChi, I have decided to write a quick tutorial for Co-EM algorithm.

Co-EM is a very simple algorithm, extensively utilized by Rosie Jones in her PhD thesis. Originally by Nigam and Ghani (2000). The algorithm is used for clustering test entities into categories. Here is an example dataset (NPIC500) which explains the input format. The algorithm constructs a bipartite graph:

Where the left nodes are noun phrases, the right node are the sentence context, and edge weight is the number of times a certain noun phrase was within the context. The algorithm is very simple (described in page 43 of John's PhD thesis:

As seen above, the noun labels simply compute a weighted some of the edge values. The context nodes compute the same weighted sum (if they are not seed nodes). Seed nodes are the initial graph nodes we have ground truth labels about them.

The output of the probability for each noun phrase to be in a different categories.

Here are some more concrete example of the input file:

Additionally, ground truth is given about the negative and positive seeds. For example, assume we have two categories (city / not city). The seed lists classify certain nouns to their matching categories.

$ head city-seeds.txt
^New York$
^Boston$
^Pittsburgh$
^Los Angeles$
^Houston$
^Atlanta$
^London$

$ head city-neg-seeds.txt
^people$
^the world$
^time$
^life$
^God$
^children$
^students$
^work$
^a number$
^women$

And here is how to try it out in GraphChi
0) Install graphchi as explained here, and compile using "make ta"
1) Download the file http://graphlab.org/downloads/datasets/coem.zip and unzip it in your root graphchi folder
2) In the root graphchi folder run:

$ ./toolkits/text_analysis/coem --training=matrix.txt --nouns=nps.txt --contexts=contexts.txt --pos_seeds=city-seeds.txt --neg_seeds=city-neg-seeds.txt --D=1

The output is generation in the file: matrix.txt_U.mm:

$ cat matrix.txt_U.mm

%%MatrixMarket matrix array real general

%This file contains COEM output matrix U. In each row D probabilities for the Y labels

88322 1

4.081683456898e-01

4.162331819534e-01

4.119633436203e-01

The first three noun phrases all have a prob of around 0.4 of being a city.

Sunday, February 17, 2013

A few useful things to know about machine learning

A stumbled upon the following BigML blog post:

Recently, Professor Pedro Domingos, one of the top machine learning researchers in the world, wrote a great article in the Communications of the ACM entitled “A Few Useful Things to Know about Machine Learning“. In it, he not only summarizes the general ideas in machine learning in fairly accessible terms, but he also manages to impart most of the things we’ve come to regard as common sense or folk wisdom in the field.

One thing you should worry about applying machine learning to high dimensional data is random correlation. I got a great example from my friend and collaborator Erik Aurell:

Chocolate Consumption, Cognitive Function, and Nobel Laureates
Franz H. Messerli, M.D., N Engl J Med 2012; 367:1562-1564
Chocolate consumption could hypothetically improve cognitive function not only in individuals but in whole populations. Could there be a correlation between a country's level of chocolate consumption and its total number of Nobel laureates per capita?

This hilarious graph show a great correlation between chocolate consumption and Nobel prizes..

Monday, February 11, 2013

Label propagation in GraphChi

A few days ago I got a request from Jidong, from the Chinese Renren company to implement label propagation in GraphChi. The algorithm is very simple described here:
Zhu, Xiaojin, and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.

The basic idea is that we start with a group of users that we have some information about the categories they are interested in. Following the weights in the social network, we propagate the label probabilities from the user seed node (the ones we have label information about) into the general social network population. After several iterations, the algorithm converges and the output is labels for the unknown nodes.

Here is a pseudo code for the algorithm:

Where Y is the matrix with probabilities for each label, T is the original adjacency graph (where weights are normalized first to one). Clamping means that for the labeled data the weights are fixed to be the input probabilities.

Label propagation is now a part of GraphChi graph analytics toolkit, which includes the following algorithms:
kcores algorithm
subgraph - cut subgraph following X hops around seed nodes / output node degrees
inmemconcomp - in memory connected components

The way to run label propagation is to prepare two input files.
1) The --training=filename file is a file in sparse matrix market format which describes the social graph (an adjacency list of size N x N).
2) Additionally we need to provide a seed file (with the filename: filename.seeds ) which has a sparse matrix of size N x D. D is the number of possible classes categories.
For example, if node 1 is a seed node with 3 categories with probabilities p_1 = 0.3, p_2 = 0.4, p_e = 0.3, we need to add the following inputs:
1 1 0.3
1 2 0.4
1 3 0.3

Here is an example training file for a network of size 5x5 (file name in this example is renren)

%%MatrixMarket matrix coordinate real general
5 5 6
1 2 1
2 1 3
3 4 1
4 5 1
1 5 2
5 2 1

Here is an example seed file (filename is renren.seeds)

%%MatrixMarket matrix coordinate real general

5 4 3

1 1 0.5

1 2 0.3

1 3 0.2

Here is an example run:

bickson@thrust:~/graphchi$ ./toolkits/graph_analytics/label_propagation --training=renren --D=4 --quiet=1

WARNING: common.hpp(print_copyright:144): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com

[training] => [renren]

[D] => [4]

[quiet] => [1]

And here is the output of the run:
$ cat renren_U.mm

%%MatrixMarket matrix array real general

%This file contains LP output matrix U. In each row D probabilities for the Y labels

5 4

5.000000000000e-01

3.000000119209e-01

2.000000029802e-01

0.000000000000e+00

4.999980032444e-01

2.999970912933e-01

2.000028789043e-01

2.033766122622e-06

4.296581745148e-01

6.834248453379e-02

2.388831526041e-01

2.631161808968e-01

4.296608269215e-01

6.834241002798e-02

2.388962060213e-01

2.631005644798e-01

5.000011920929e-01

2.999966442585e-01

1.999970227480e-01

5.159994998394e-06

It is easy to verify the seed node (node 1) probabilities were not changed. But the other nodes have now probabilities which originate in their connections with node 1.

An update: here is a note I got from Soundar Kumara, Prof. in Penn State Univ:

It was nice to see your blog and reference to Label Propagation. Our algorithm (Raghavan, Albert and Kumara, Phys Reviews 2007) was improved further with game theoretic approach.

This is definitely an interesting work. Thanks Soundar!

Next post: Co-EM algorithm in GraphChi

Sunday, February 10, 2013

Setting up Java GraphChi development environment - and running sample ALS

As you may know, our GraphChi collaborative filtering toolkit in C is becoming more and more popular. Recently, Aapo Kyrola did a great effort for porting GraphChi C into Java and implementing more methods on top of it.

In this blog post I explain how to setup GraphChi Java development environment in Eclipse and run alternating least squares algorithm (ALS) on a small subset of Netflix data.
Based on the level of user feedback I am going to receive for this blog post, we will consider porting more methods to Java. So email me if you are interested in trying it out.

Preliminaries - setting up Maven

Download maven binary from:
http://maven.apache.org/download.cgi

Extract the tgz file into /usr/local/apache-maven-3.0.4/

Setup Maven environment:
export M2_HOME=/usr/local/apache-maven-3.0.4
export M2=$M2_HOME/bin

optional:
export MAVEN_OPTS="-Xms256m -Xmx512m"

Note: you have to have Java JDK installed.

Download and install mercurial from:

http://mercurial.selenic.com/downloads/

Checkout GraphChi-Java from:

http://code.google.com/p/graphchi-java/source/checkout

Download Ecplise Classic Juno from:

http://www.eclipse.org/downloads/index-developer.php?release=juno

Download m2e eclipse plugin from:

http://eclipse.org/m2e/download/

By adding a new software site as explained here: http://help.eclipse.org/juno/index.jsp?topic=//org.eclipse.platform.doc.user/tasks/tasks-127.htm

Eclipse -> install -> work with: http://download.eclipse.org/technology/m2e/releases
software name: m2e ->

Restart eclipse.

Import GraphChi Java project into Ecplise

Eclipse -> File -> import -> existing maven project ->

Next->Browse for the graphchi-java project (the path you checked using mercurial)

Project -> Build (remove the check mark on build automatically if present).
At the first compilation maven will download some plugins:

Verify that the project compiler is pointing to Java 1.6: Right mouse click GraphChi Java project root -> properties - > compiler -> 1.6 (see picture):

Hopefully now the project compiled without errors.

Now run ALS with subset of netflix data

Download the file: smallnetflix_mm and put it in your project folder.

Right mouse click ALSMatrixFactoriztion,java -> Run as.. -> run configuration and add command line arguments:

Namely full path to the downloaded file name, and the number of shards (1 in this case).

Also set the virtual machine parameters to increase memory.

Press the "Run" button.
Correct run should be:

9:54:25 AM ALS main - INFO: Found shards -- no need to preprocess

9:54:25 AM ALS main - INFO: Set latent factor dimension to: 5

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

9:54:26 AM engine run - INFO: :::::::: Using 4 execution threads :::::::::

9:54:26 AM ALS beginIteration - INFO: Initializing latent factors for 96576 vertices

Creating 1 blocks

9:54:26 AM engine run - INFO: 0.672s: iteration: 0, interval: 0 -- 96575

Tried to read past file: 0 --- 772608

9:54:26 AM engine run - INFO: Subinterval:: 0 -- 96575 (iteration 0)

9:54:26 AM engine run - INFO: Init vertices...

9:54:27 AM engine run - INFO: Loading...

9:54:27 AM engine run - INFO: Loading memshard started. pool-2-thread-1 id=11

9:54:27 AM engine run - INFO: Memshard: 0 -- 96575

9:54:27 AM engine run - INFO: Vertices length: 96576

9:54:27 AM memoryshard loadVertices - INFO: Load memory shard: 0 --- 96575

9:54:27 AM engine run - INFO: Loading memory-shard finished.pool-2-thread-1

9:54:27 AM engine run - INFO: Load took: 274ms

9:54:27 AM engine run - INFO: Update exec: 610 ms.

9:54:27 AM engine run - INFO: 1.793s: iteration: 1, interval: 0 -- 96575

9:54:27 AM engine run - INFO: Subinterval:: 0 -- 96575 (iteration 1)

9:54:27 AM engine run - INFO: Init vertices...

9:54:27 AM engine run - INFO: Loading...

9:54:27 AM engine run - INFO: Loading memshard started. pool-2-thread-2 id=16

9:54:27 AM engine run - INFO: Memshard: 0 -- 96575

9:54:27 AM engine run - INFO: Vertices length: 96576

9:54:27 AM memoryshard loadVertices - INFO: Load memory shard: 0 --- 96575

9:54:28 AM engine run - INFO: Loading memory-shard finished.pool-2-thread-2

9:54:28 AM engine run - INFO: Load took: 163ms

9:54:28 AM engine run - INFO: Update exec: 391 ms.

9:54:28 AM engine run - INFO: 2.422s: iteration: 2, interval: 0 -- 96575

9:54:28 AM engine run - INFO: Subinterval:: 0 -- 96575 (iteration 2)

9:54:28 AM engine run - INFO: Init vertices...

9:54:28 AM engine run - INFO: Loading...

9:54:28 AM engine run - INFO: Loading memshard started. pool-2-thread-3 id=17

9:54:28 AM engine run - INFO: Memshard: 0 -- 96575

9:54:28 AM engine run - INFO: Vertices length: 96576

9:54:28 AM memoryshard loadVertices - INFO: Load memory shard: 0 --- 96575

9:54:28 AM engine run - INFO: Loading memory-shard finished.pool-2-thread-3

9:54:28 AM engine run - INFO: Load took: 134ms

9:54:29 AM engine run - INFO: Update exec: 374 ms.

9:54:29 AM engine run - INFO: 2.997s: iteration: 3, interval: 0 -- 96575

9:54:29 AM engine run - INFO: Subinterval:: 0 -- 96575 (iteration 3)

9:54:29 AM engine run - INFO: Init vertices...

9:54:29 AM engine run - INFO: Loading...

9:54:29 AM engine run - INFO: Loading memshard started. pool-2-thread-4 id=18

9:54:29 AM engine run - INFO: Memshard: 0 -- 96575

9:54:29 AM engine run - INFO: Vertices length: 96576

9:54:29 AM memoryshard loadVertices - INFO: Load memory shard: 0 --- 96575

9:54:29 AM engine run - INFO: Loading memory-shard finished.pool-2-thread-4

9:54:29 AM engine run - INFO: Load took: 170ms

9:54:29 AM engine run - INFO: Update exec: 398 ms.

9:54:29 AM engine run - INFO: 3.5820000000000003s: iteration: 4, interval: 0 -- 96575

9:54:29 AM engine run - INFO: Subinterval:: 0 -- 96575 (iteration 4)

9:54:29 AM engine run - INFO: Init vertices...

9:54:29 AM engine run - INFO: Loading...

9:54:29 AM engine run - INFO: Loading memshard started. pool-2-thread-1 id=11

9:54:29 AM engine run - INFO: Memshard: 0 -- 96575

9:54:29 AM engine run - INFO: Vertices length: 96576

9:54:29 AM memoryshard loadVertices - INFO: Load memory shard: 0 --- 96575

9:54:29 AM engine run - INFO: Loading memory-shard finished.pool-2-thread-1

9:54:29 AM engine run - INFO: Load took: 117ms

9:54:30 AM engine run - INFO: Update exec: 505 ms.

9:54:30 AM engine run - INFO: Engine finished in: 4.2620000000000005 secs.

9:54:30 AM engine run - INFO: Updates: 482880

9:54:30 AM ALS main - INFO: Train RMSE: 0.7323246277805968, total edges:900817

9:54:31 AM ALS writeOutputMatrices - INFO: Latent factor matrices saved: /Users/bickson/Downloads/smallnetflix_mm_U.mm, /Users/bickson/Downloads/smallnetflix_mm_V.mm

Known errors:

in thread "main" java.io.FileNotFoundException: ~/Downloads/smallnetflix_mm.shovel.0 (No such file or directory)

at java.io.FileOutputStream.open(Native Method)

at java.io.FileOutputStream.<init>(FileOutputStream.java:194)

at java.io.FileOutputStream.<init>(FileOutputStream.java:84)

at edu.cmu.graphchi.preprocessing.FastSharder.<init>(FastSharder.java:113)

at edu.cmu.graphchi.apps.ALSMatrixFactorization.createSharder(ALSMatrixFactorization.java:176)

at edu.cmu.graphchi.apps.ALSMatrixFactorization.main(ALSMatrixFactorization.java:198)

Solution: Give a full absolute path pointing to the location of your file, namely /home/bickson/Downloads/smallnetflix_mm etc.

Error:

thread "main" java.lang.IllegalArgumentException: Java Virtual Machine has only 32489472bytes maximum memory. Please run the JVM with at least 256 megabytes of memory using -Xmx256m. For better performance, use higher value

at edu.cmu.graphchi.engine.GraphChiEngine.<init>(GraphChiEngine.java:120)

at edu.cmu.graphchi.apps.ALSMatrixFactorization.main(ALSMatrixFactorization.java:215)

Solution:

Increase virtual machine memory quota as explained on top.

Wednesday, February 6, 2013

GraphLab team 3rd place in Yandex WSDM contest

My never resting mega collaborator Justin Yan just notified me that his team got 4th place in Yandex WSDM 2013 contest :

As you can see the team name is "GraphLab". Great job Justin & team!

And here is the paper describing our construction.

An update: just heard from Justin that Google's winning solution was rejected since they are not willing to publish their method. So now we are 3rd place!! Yay!!

Saturday, February 2, 2013

Case study: million songs dataset

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli's cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities.

Instructions for computing item to item similarities:

1) For obtaining the dataset, download and extract this zip file.

2) Run createTrain.sh to download the million songs dataset and prepare GraphChi compatible format.
$ sh createTrain.sh
Note: this operation may take an hour or so to prepare the data.

3) Run GraphChi item based collaborative filtering, to find out the top 500 similar items for each item:

$ ./toolkits/collaborative_filtering/itemcf --training=train --K=500 --asym_cosine_alpha=0.15 --distance=3 --min_allowed_intersection=5
Explanation: --training points to the training file. --K=500 means we compute the top 500 similar items.
--distance=3 is Aillio's metric. --min_allowed_intersection=5 - means we take into account only items that were rated together by at least 5 users.

Note: this operation requires around 20GB of memory and may take a few ours...

Create user recommendations based on item similarities:

1) Run itemsim2rating to compute recommendations based on item similarities
$ rm -fR train.* train-topk.*
$ ./toolkits/collaborative_filtering/itemsim2rating --training=train --similarity=train-topk --K=500 membudget_mb 50000 --nshards=1 --max_iter=2 --Q=3 --clean_cache=1
Note: this operation may require 20GB of RAM and may take a couple of hours based on your computer configuration.

Output file is: train-rec

Evaluating the result

1) Prepare test data:
./toolkits/parsers/topk --training=test --K=500

Output file is: test.ids

2) Prepare training recommendations:

./toolkits/parsers/topk --training=train-rec --K=500

Output file is: train-rec.ids

3) Compute mean average precision @ 500:
./toolkits/collaborative_filtering/metric_eval --training=train-rec.ids --test=test.ids --K=500

About performance:

With the following settings: --min_allowed_intersection=5, K=500, Q=1, alpha=0.15 we get:

INFO: metric_eval.cpp(eval_metrics:114): 7.48179 Finished evaluating 100000 instances.

ESC[0mINFO: metric_eval.cpp(eval_metrics:117): Computed AP@500 metric: 0.151431

With --min_allowed_intersection=1, K=2500, Q=1, alpha=0.15 we get:

INFO: metric_eval.cpp(eval_metrics:114): 6.0811 Finished evaluating 100000 instances.
ESC[0mINFO: metric_eval.cpp(eval_metrics:117): Computed AP@500 metric: 0.167994

Acknowledgements:

Clive Cox, RummbleLabs.com for proposing to implement item based recommendations in GraphChi, and support in the process of implementing this method.
Fabio Aiolli, University of Padova, winner of Million songs dataset contest, for great support regarding implementation of his metric.

Friday, February 1, 2013

Spotlight: Kaggle's RTA Challenge

I had an interesting talk with José P. González-Brenes, A 6th year grad student from CMU LTI dept.
During the talk, I learned that Jose participated in the Kaggle's RTA challenge and actually won the 1st place out of more than 300 groups.

The challenge was for predicting RTA highway travel times. The data was recorded time of different segments different cars traveled. The winning solution (of Jose and Guido Matías Cortés) was composed of a very simple method - a random forest. Unfortunately, there was no paper published about it, but here is a blog post summarizing the solution method. And here is a link to their presentation. What is further interesting about the solution method is that it was composed of 90 lines of matlab code!

The reason we actually talked is that Jose was recently trying out my GraphChi collaborative filtering code for his research, so I gave him some advice on which methods to use. Once he has some interesting results I hope he will update us!

Large Scale Machine Learning and Other Animals