Tuesday, January 31, 2012

Image segmentation in GraphLab

I got the following question from Doug Epps:


I'm interested in implementing "Random Walks for Image Segmentation" (Leo Grady) using graphlab. The author posts a matlab implementation, and I've ported that to python happily. But I'm interested in doing it for large graphs.

Do you have any pointers on how to implement random-walks in graphlab ? I've looked a bit at the linear solver docs and they seem more how to run matlab/graphlab together, rather than how to code a linear solver in the package.

Graphlab looks very well done, thanks for making a great library avail.


Doug also supplied me a link for this paper's matlab code.

I briefly read the mentioned paper, and it seems that the image segmentation is represented as a
linear system. You may ask what is the relation between a linear system and random-walks?
A good overview of this relation is given by Doyle and Snell in their seminal paper: Random Walks and Electrical Networks.

To verify this i have also taken a look at the attached Matlab code. The Matlab code requires Graph Analysis Toolbox. And indeed the Matlab code solves a 2D Dirichlet equations using a linear system solver. (Specific solver function is called dirichletboundary).

Regarding the specific application at hand, it seems like a pretty cool application so I thought to give some more details about it. In a nutshell, the paper proposes to perform segmentation as follows.
Few seeds or cluster heads are manually selected by the user. Each seed is the starting node of a random walk. The image is segmented by the result of the random walk - each pixel is assigned to the seed which will most likely get to it using a random walk on the image graph.

The figures below are created using Leo Grady's Matlab demo code. The first image is the original picture, where the green point marks the foreground color in the area we want to segment and the blue point is marks the background color. (Total of two segments).

The second image shows the computed answer. The black pixels are more likely
to be reached from the foreground seed.

The third image shows the boundary between the segments:

Random walk probabilities:

And now back to GraphLab: GraphLab has three implemented iterative solvers: Gaussian BP, Jacobi and Conjugate Gradient. The simplest way to operate those solvers is to create a text file in sparse matrix market format. My suggestion is to first create the linear system in Matlab and export it to sparse matrix market using the mmwrite.m script. First verify that the iterative solvers are working in the desired accuracy on smaller images.  After everything is working satisfactory, then it is also possible to prepare all the problem in GraphLab. If you are working in Python you can try our Java/Jython interface to GraphLab.

Saturday, January 28, 2012

Animal breeding with GraphLab?

Believe it or not, there is unlimited supply of real life interesting problems that can be solved usign GraphLab! Here is an email I got this morning from Asst. Prof. Gregor Gorjanc from the Biotechnical Faculty of University of Ljubljana:

I work in the area of animal breeding and genetics. The task of paramount importance in my field is the inference of so called breeding (genetic) values from phenotypic values and pedigree graph. These breeding values are used for selection of animals that will become parents of better future generations. We use linear mixed models of a type:

y | b, a, \sigma^2_e ~ N(Xb + Za, I\sigma^2_e)
b | B, \sigma^2_b    ~ N(0, B\sigma^2_b)
a | A, \sigma^2_a    ~ N(0, A\sigma^2_b)

where y - is a vector of known phenotypic values (body weight, ...), b - is a vector of unknown factors affecting phenotype (sex, age, ...), and a - is a vector of breeding values, while X and Z are design matrices linking b and a with y. Matrices B and A are known covariance (kernel) matrices - the later describes the genealogy between animals. We usually infer variance components (\sigma^2_i) and use point estimates in the above model as known quantities. Therefore, the inference of b and a basically involves solving a system of linear equations, where LHS is:

| X'X+B⁻1*beta X'Z           |
| Z'Z          Z+Z+A⁻1*alpha |,
with beta=\sigma^2_e / \sigma^2_b and alpha=\sigma^2_e / \sigma^2_a. The X'X+B⁻1 can be quite dense, but is often much much smaller than the Z+Z+A⁻1 part so LHS is in general very sparse. Both B⁻1 and A⁻1 are sparse as well, but do have some structure. The RHS is:

| X'y |
| Z'y |.

We can easily solve the above system and state of the art applications use the PCG (Preconditioned Conjugate Gradiaten) method due to shear size of data sets we deal with in animal breeding - small applications involve few thousands of animals (~number of unknowns), while nation-wide applications can easily involve hundreds of thousands of animals or more for large countries. Iterative methods only deliver as solutions posterior means of b and a, while posterior variances are not knwon as the inversion of LHS is to costly. Posterior covariances between a are not of high importance. I came across GaBP algorithm in the Graphical models book by Koller and Friedman and got very interested when I read that GaBP can deliver both, posterior means and variances for my model. Do you have any suggestions regarding my applications?

And here is my answer:
The first thing that comes to my mind, is that GaBP does not require a square matrix to operate, it can work on non-square matrices using the construction described here: equation (2).
Now, if the matrix is square, it should better be symmetric or else you will get reduced performance in terms of accuracy.
It seems that your matrix is square but not symmetric - I suggest playing a little with the equation and creating a non-square matrix as in the construction above.

My second advice is first try my Matlab GaBP code (available here) on smaller matrices just to verify that the algorithms converges and you are happy with accuracy. When you need a performance boost for larger matrices then GraphLab GaBP should be the right tool to use.

Regarding variance approximation - it is only an approximation, the values may be lower than the actual variance. I suggest again trying with smaller example in Matlab and trying to characterize the approximation error magnitude.

Q) You mention that GaBP delivers underestimated posterior variances for x. What is the rational behind this? I will surely test out with the exact case, but it would be great to get some theoretical insight into this.

A) In a nutshell: The matrix (inv(I-A)), can be expanded using an infinite sum, (inv(I-A)) = A + A^2 + A^3... . (When it is first normalized to have a unit diagonal). This can be viewed as an random walk on a markov chain, when summing routes of distances of size 1,2,3 ... See the JMLR paper of Jason K. Johnson et al.

GaBP is capturing only a part of those routes - the ones which start and ends in the same node. So the variance is an estimation to the total variance.

Friday, January 27, 2012

LexisNexis ML library - Advisory Panel: to PCA or not to PCA?

I am honored to report I was invited to participate in LexisNexis advisory panel of their machine learning library. The goal of this voluntary panel is to identify the most useful machine learning algorithms in practice and the best way to implement them. I agreed to serve in this panel so I could report here the interesting topics that arise in the industry when implementing a large scale machine learning solution.

And here is the first question I got from David Bayliss:
We know we want a PCA solution. Based upon our reading around it looks like QR decomposition is the way to go and that the best result for QR decomposition comes from householder reflections.

And here is my answer:
I suggest starting with SVD and not with PCA.
PCA has a drawback that you need to subtract the the mean value from the matrix. This makes sparse matrices non sparse and limits your PCA to relatively small models. (There may be way to workaround this but they need some extra thought). In fact, this topic is frequently brought up in Mahout's mailing list. Below is one example:

Ted Dunning added a comment - 27/Nov/11 06:50
When it comes to making a scalable PCA implementation for sparse data, you can't do the mean subtraction before the SVD. This is because the subtraction will turn the sparse matrix into a dense matrix. In many cases of interest in Mahout land, this results in a million fold increase in storage costs and a million^2 increase in compute costs.
For dense data, the subtraction doesn't make things any worse, but SVD in general isn't really feasible for really large dense matrices anyway.
Most of the SVD algorithms in Mahout can be reworked to deal with the mean subtraction for PCA implicitly instead of having to actually do the subtraction. As far as I know, that is the only way that you are going to get this to scale beyond data that fits in memory and likely the only way to get it to work well even for large data that does fit in memory.
On the other hand, SVD works very nicely on sparse matrices using the Lanczos algorithm.

And here is a great feedback I got from Nick Vasiloglou, ismion:
I read the blog about SVD versus PCA and I agree with Danny ... From my experience the most successful SVD method in terms of speed for sparse data is the one discussed here. It was recently adopted by Mahout. As a note the method most of the time works but it can fail if some conditions are not satisfied. The most stable and still fast enough is the one that uses Lanczos method or its variants. It requires more iterations but it is stable and accurate.

Tuesday, January 24, 2012

Hyperspectral imaging using GraphLab - updated

Here is what I got from Josh Ash, a research scientist who is working on some pretty cool stuff at Ohio State University. He is using GraphLab's Java API for speeding up the computation.

Hyperspectral imaging (HSI) sensors record images at hundreds of wavelengths in the electromagnetic spectrum and thus provide significantly more information about a scene than broadband or standard color (tri-band) imagery.  As such, HSI is well-suited for remotely classifying materials in a scene by comparing HSI-estimated material reflectances to a spectral library of material fingerprints.

Estimating intrinsic material reflectances from HSI measurements is, however, complicated by localized illumination differences in the scene resulting from in-scene shadows, and by global distortions resulting from radiative transfer through the atmosphere.  With the goal of increasing material classification performance, this work combines recent direct, diffuse, and ambient scattering predictors with statistical models imposing spatial structure in the scene.  Multiple interacting Markov random fields are utilized to capture spatial correlation among materials in the scene and localized distortions, such as shadows.  The inference engine, which dissects a hyperspectral data cube into a material map and a shadow map, toggles between performing loopy belief propagation to generate approximate posteriors on material classifications and performing expectation maximization updates needed in characterizing and inverting the atmospheric distortion.

Computational speed-up is achieved by performing multiple EM and BP updates in parallel on different portions of the associated factor graph.  The parallelization was implemented using the GraphLab framework.

Q) Can you provide a bit more details about GraphLab implementation?
How faster is it relative to Matlab? As I recall you are using our Java API. right?

Yep, using the Java API. Each node has an update function that either performs a BP factor update, a BP variable update, or an EM update--as appropriate for that node in the graph. The update functions, various data structures for different types of nodes and edges, and the graph generation were all done in Java.

Speed-up relative to Matlab is difficult to judge because I don't have apples-to-apples implementations. Basically, at one point in the project I recoded everything from Matlab to GraphLab/Java . At that point I got 160x speedup (Matlab core2 duo 2.66Ghz vs GraphLab 8 core I7 2.93GHz); however the Matlab code was NOT optimized and could probably be improved 5-10x.

Q) Can you give some more details about problem size. How many pixels are there? How many nodes in the factor graph?

Right now I'm working with toy 100x100 = 10k pixel images. There are 30k nodes in the graph with 130k (unidirectional) edges. In the future, I will be pushing these sizes significantly as I make the model more general. These sizes could grow by a couple orders of magnitude.

Q) What is the machine you are using - which os? how many cores? how much memory?

I use two computers

1) Red Hat Enterprise Linux Workstation release 6.2
8 core I7
model name : Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz
cpu MHz : 1199.000
8GB ram

2) Red Hat Enterprise Linux Server release 5.5 (Tikanga)
16 core opteron
model name : Quad-Core AMD Opteron(tm) Processor 8378
cpu MHz : 2400.141
64GB ram

In the short future, I hope to be able to get access to a 512 core shared-memory machine.

Q) What additional features would you like to have in GraphLab?

Two features that would be useful: 1) Java API for distributed (not shared-memory) environment, 2) Implement the 'sync' operation for the Java API.

Monday, January 23, 2012

ismion overview of GraphLab

Here is a quick review of GraphLab as found in the ismion website. While I don't agree with the conclusion, it is interesting to see how GraphLab is rated by an objective third party.

If you like to learn more about Nick and ismion you can read my previous blog post.

Sunday, January 22, 2012

Intel + GraphLab collaboration

I am glad to announce that the GraphLab project is starting to collaborate with Intel. Dr. Theodore Willke, lead research scientist for the Cluster Computing Architecture team in Intel Labs, will head this effort from Intel's side. The goal of this research project is to optimize the communication layer performance of a high performance computing software over commodity Intel clusters. And you guessed right! GraphLab is the demo application we are going to use.

Friday, January 20, 2012

GraphLab input via Greenplum SQL

I got some great tips from Brian Dolan, Founder and VP product of Discovix Inc, a cool California area startup specializing in building machine learning and mathematical products.

Brian asked me a few days ago if we thought about interfacing GraphLab to some commonly used databases, since usually there is where the client data is found. Since I have no green clue in databses, Brian volunteered to teach me (and you my blog reader!) how to do it.

And here is the tutorial as I got it from Brian. First we create table with fake data
to be used in the example:

-- Create a table to hold some bogus data
DROP TABLE IF EXISTS glab.bs_data;
CREATE TABLE glab.bs_data
, vl numeric DEFAULT NULL

-- Now create some bogus data.  I use a few postgres tricks.
-- Ultimately, this just creates a relatively sparse "matrix" in tall format
-- I'm trying to use best practices in formatting and stuff
INSERT INTO glab.bs_data (rw, cl, vl)
, y
, random() AS vl
    CASE WHEN random() < 0.1 THEN x ELSE NULL END AS x
  , CASE WHEN random() < 0.1 THEN y ELSE NULL END AS y
    generate_series(1, 1000) AS x
  , generate_series(1, 1000) AS y
  ) AS A
After the table is ready we can push it into a file:
-- So we have some data.  Let's create an external table
-- to push it OUT of the database
-- You need to run gpfdist to write to an external table.
-- In this example, from the directory /Users/gpadmin/glab/tables the command was
-- > gpfdist 
  i bigint
, j bigint
, v numeric
LOCATION ('gpfdist://macbuddha.local/glab_out.txt')

-- Right, now let's put the data in it.  This is two queries unioned so
-- we can create the first row.  Best to do them separately.
INSERT INTO glab.ext_out
  max(rw) AS i
, max(cl) AS j
, count(*) AS v
FROM glab.bs_data;

INSERT INTO glab.ext_out
  rw AS i
, cl AS j
, vl AS v
FROM  glab.bs_data
And here is the example output glab_out.txt:
Now it is very easy to read it in GraphLab. In Matlab/Octave do:
load glab_out.txt; % load data to memory
A = sparse(glab_out(2:end,1), glab_out(2:end,2), glab_out(2:end,3), glab_out(1,1), glab_out(1,2), glab_out(1,3)); % create a sparse matrix
mmwrite('A', A);  % save matrix to file. You may need to donload the script mmwrite.m from http://graphlab.org/mmwrite.m

Since this is potentially useful, I am going to have support for this data format soon in GraphLab, without the matlab conversion.

Next, Brian promised to teach me how to export data from GraphLab back to SQL.

Installing GraphLab on Free BSD 8.2

0) Verify that bash installed on the machine by typing the command "which bash".
If it is not installed, install it using the directions here.

1a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.
copy the command string: "hg clone..." and execute it in your ubuntu shell.

or 1b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

2) configure and compile
cd graphlabapi
./configure --bootstrap 
cd release/
make -j4

3) Test Graphlab
cd tests
bash ./runtests.sh

Common errors:
[daniel51@pe-00 ~/graphlabapi]$ ./configure --bootstrap
Tue Jan 17 17:50:14 IST 2012
function: not found
Usage: ./configure [--bootstrap] [--force_cmake_install]
                   [--force_boost_install] [--prefix=PREFIX]

edit the file configure, and change /bin/sh to the location of your bash command line interpreter. (You can find it using "which bash" command.)

In file included from /cs/alum/daniel51/graphlabapi/src/graphlab/extern/metis/GKlib/GKlib.h:23,
                 from /cs/alum/daniel51/graphlabapi/src/graphlab/extern/metis/GKlib/b64.c:14:
/cs/alum/daniel51/graphlabapi/src/graphlab/extern/metis/GKlib/gk_arch.h:26:22: error: execinfo.h: No such file or directory
*** Error code 1

Solution: install libexecinfo.
Add the following line:
Into CMakeLists.txt file, above (cmake_minimum_required).  Reconfigure and recompile.

Linking CXX executable anytests
../src/graphlab/libgraphlab.a(assertions.o)(.text+0x11): In function `__print_back_trace()':
/cs/alum/daniel51/graphlabapi/src/graphlab/logger/assertions.cpp:35: undefined reference to `backtrace'
../src/graphlab/libgraphlab.a(assertions.o)(.text+0x20):/cs/alum/daniel51/graphlabapi/src/graphlab/logger/assertions.cpp:36: undefined reference to `backtrace_symbols_fd'
*** Error code 1

Add the following lines:
Into the file CMakeLists.txt above (cmake_minimum_required). Reconfigure and recompile.

Wednesday, January 18, 2012

LexisNexis ML library

A couple of days ago, LexisNexis released a machine learning library on top of the HPCC platform.

I asked David Alan Bayliss who is the head of this effort, a few questions about the library.

Q) What are the main design goals of your ML library?
  • Scalability
  • Robustness
  • Ease of Use

Q) What is the target user base of your library?

The data scientist; someone that wants to be able to perform machine learning exercises with minimal programming - quickly

Q) In your list I did not see performance. Do you target scalability, robustness, and ease of use first?

Well the ML library is coded in ECL. ECL is a language I can talk about for days (it is my baby) – but essentially it is responsible for handling the mapping from the algorithm down onto the machine. Thus; in a way performance is our zeroeth priority; it trumps all – but it is really implicit in our choice of platform. Keren can give you numbers (and we’ll be producing more). Specifically the ECL->Machine optimizer is responsible for making sure all of the components of all of the available nodes are fully utilized.
Thus in the design of the libraries; our job is to come up with algorithms that are ‘scalably clean’ – in other words – ones where the algorithm does not force bottlenecks onto the optimizer – that is what we call scalable – it implicitly gives us performance.

Q) What is the licensing model?

 It is open source; there is a license.txt file – I don’t follow the vagaries of the different open licenses. If the text file doesn’t give you what you need – let me know and I will chase it down for you.

Q) In what programming language is the library written?


Q) Is there an easy way to interface to distributed databases like HBase?

Well – the HPCC (/ECL) has two processing models – one is batch (similar to hadoop) and one is low-latency ‘real time’. With the former you would need to stream the data from hdfs to us. For the real-time you can use a soap call from our side if you wish. That said – it is usually worth moving the data across at the batch level

Q) What are the algorithms that are currently implemented?

The weblink I gave you lists them – at the moment we are just starting up so pretty much ‘one in each category’ – thus Naïve Bayes, k-means, OLS linear regression, logistic regression, association mining (éclat/apriori), co-location, perceptrons. We also have a matrix library and a document conversion library.

Q) For using the library, do I have to learn a new programming language or is
it some commonly used language?

ECL – which is open source – but still emerging

I Would love to see some performance and accuracy results on standard datasets once
they are there. currently it is a work in progress. David promised to update me once they are there.

Monday, January 9, 2012

Vowpal Wabbit Tutorial

Vowpal Wabbit is a popular online machine learning implementation for solving linear models like LASSO, sparse logistic regression, etc. Library was initiated in and written by John Langford, Yahoo! Research.

Download version 6.1 from here. Compile using:
make install

Note: A newer version of VW is now found in GitHub.

Here are some tutorial slides given in the big learning workshop.

Now to a quick example on how to run logistic regression:
Prepare an input file named inputfile with the following data in it:
-1 | 1:12 2:3.5 4:1e-2
1 | 3:11 4:12
-1 | 2:4 3:1

Explanation: -1/1 are the labels. 1:12 -> means that the first feature is 12. 2:3.5 means
that the 2nd feature is 3.5 and so on. Note that feature names can be strings, as well as their values. In case feature are string they will be hashed into integers during the run.

Now run vw using:
./vw -d A --loss_function logistic --readable_model outfile
using no cache
Reading from A
num sources = 1
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
0.679009   0.679009          3      3.0    -1.0000  -0.1004        3

finished run
number of examples = 3
weighted example sum = 3
weighted label sum = -1
average loss = 0.679
best constant = -1
total feature number = 10

Explanation: -d is the input file. --loss_function is the type of loss function (can be one of: squared,logistic,hinge,quantile,classic). --readable_model speicifies the output file name in readable format.

The program output is:
bickson@thrust:~/JohnLangford-vowpal_wabbit-9c65131$ cat outfile 
Version 6.1
Min label:-100.000000 max label:100.000000
ngram:0 skips:0
index:weight pairs:

In the above example, we did a single pass on the dataset. Now assume we want to make several passes for fine tuning the solution. We can do:
./vw -d inputfile --loss_function logistic --readable_model outfile --passes 6 -c

Explanation: -c means creating a cache file, which significantly speeds execution. it is required when running multiple iterations.

When running multiple passes we get:
creating cache_file = inputfile.cache
Reading from inputfile
num sources = 1
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
decay_learning_rate = 1
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
0.895728   0.895728          3      3.0    -1.0000   0.2633        5
0.626871   0.358014          6      6.0    -1.0000  -0.9557        5
0.435506   0.205868         11     11.0     1.0000   1.1889        5

finished run
number of examples = 18
weighted example sum = 18
weighted label sum = -6
average loss = 0.3181
best constant = -0.4118
total feature number = 102

Now assume we want to compute predictions on test data. We use the same command as before:
./vw -d inputfile --loss_function logistic -f outfile 
But we changes the --readable_model to -f, output binary file.
Next we compute predictions on test data using:
 ./vw -d testinputfile --loss_function logistic -i outfile -t -p out_predictions
Note that we use the -i flag to indicate the input model, and -p flag to output the predictions file to.

Further reading: a more comprehensive VW tutorial by Rob Zinkov.

Wednesday, January 4, 2012

Installing GraphLab on Centos 6.0

Note: this document described how to install GraphLab 1 (multicore version).
For other versions see: http://graphlab.org/downloads/

For Keren's request..
Login into your CentOS machine, or start Amazon EC2 instance: ami-03559b6a.

1) Installing libboost
yum install boost boost-devel
2) Installing itpp - this step is optional.
yum install lapack autoconf autogen libtool
ln -s /usr/lib64/liblapack.so.3.2.1 /usr/lib64/liblapack.so
ln -s /usr/lib64/libblas.so.3.2.1 /usr/lib64/libblas.so

yum groupinstall "Development Tools"
wget http://sourceforge.net/projects/itpp/files/itpp/4.2.0/itpp-4.2.tar.gz
tar xvzf itpp-4.2.tar.gz
cd itpp-4.2
./configure --without-fft --with-blas=/usr/lib64/libblas.so.3 --with-lapack=/usr/lib64/liblapack.so.3 CFLAGS=-fPIC CXXFLAGS=-fPIC CPPFLAGS=-fPIC
make && make install
Note: future version of lalpack may have different library name, check your /usr/lib64
dir for the correct lapack/blas lib names.

3) Install Mercurial
yum install mercurial
4) Install cmake
yum install cmake
5a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.
copy the command string: "hg clone..." and execute it in your ubuntu shell.
hg update v1

or 5b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

6) configure using It++ (if you installed it in step 2)
cd graphlabapi
./configure --bootstrap --itpp_include_dir=/usr/local/include/ --itpp_dynamic_link_dir=/usr/local/lib/libitpp.so --itpp_lapack_dir=/usr/lib64/
6a) Alternatively - you can configure and compile using Eigen:
cd graphlabapi
./configure --bootstrap --eigen

7) compile
cd release/
make -j4

8) Test Graphlab
cd tests
export LD_LIBRARY_PATH=/usr/local/lib/
cd release/tests
8) Optional: install Octave.

Sunday, January 1, 2012

GraphLab project end of year summary - statistics for 2011

Since we released our open source a year ago, the end of 2011 is a nice opportunity to summarize GraphLab's project progress so far.
  • In 2011, GraphLab was downloaded 2425 times (using both source downloads and mercurial checkouts). 
  • GraphLab users mailing list has 74 members and 840 posts, and growing!
  • He is an image depicting all the universities  that are currently using GraphLab (we are aware of). If you are using it and you did not find your university logo in the list, let me know!

  • Some companies who are using GraphLab can be found in our partners page. Naturally, we are aware of many more companies who are using GraphLab but at this point we are not able to share this information. We can only say that most of the top technological companies are using Graphlab or trying it out.
  • Up to date, 27 algorithms are implemented on top of GraphLab in 4 application libraries: linear solvers, matrix factorization methods (collaborative filtering library), clustering library and inference library.
Our main asset, which is the base of our success, is our users! Thanks for everyone for helping support open source community and the GraphLab project.