Large Scale Machine Learning and Other Animals

Tuesday, January 31, 2012

Image segmentation in GraphLab

I got the following question from Doug Epps:

Hi,

I'm interested in implementing "Random Walks for Image Segmentation" (Leo Grady) using graphlab. The author posts a matlab implementation, and I've ported that to python happily. But I'm interested in doing it for large graphs.

Do you have any pointers on how to implement random-walks in graphlab ? I've looked a bit at the linear solver docs and they seem more how to run matlab/graphlab together, rather than how to code a linear solver in the package.

Graphlab looks very well done, thanks for making a great library avail.

-doug

Doug also supplied me a link for this paper's matlab code.

I briefly read the mentioned paper, and it seems that the image segmentation is represented as a
linear system. You may ask what is the relation between a linear system and random-walks?
A good overview of this relation is given by Doyle and Snell in their seminal paper: Random Walks and Electrical Networks.

To verify this i have also taken a look at the attached Matlab code. The Matlab code requires Graph Analysis Toolbox. And indeed the Matlab code solves a 2D Dirichlet equations using a linear system solver. (Specific solver function is called dirichletboundary).

Regarding the specific application at hand, it seems like a pretty cool application so I thought to give some more details about it. In a nutshell, the paper proposes to perform segmentation as follows.
Few seeds or cluster heads are manually selected by the user. Each seed is the starting node of a random walk. The image is segmented by the result of the random walk - each pixel is assigned to the seed which will most likely get to it using a random walk on the image graph.

The figures below are created using Leo Grady's Matlab demo code. The first image is the original picture, where the green point marks the foreground color in the area we want to segment and the blue point is marks the background color. (Total of two segments).

The second image shows the computed answer. The black pixels are more likely
to be reached from the foreground seed.

The third image shows the boundary between the segments:

Random walk probabilities:

And now back to GraphLab: GraphLab has three implemented iterative solvers: Gaussian BP, Jacobi and Conjugate Gradient. The simplest way to operate those solvers is to create a text file in sparse matrix market format. My suggestion is to first create the linear system in Matlab and export it to sparse matrix market using the mmwrite.m script. First verify that the iterative solvers are working in the desired accuracy on smaller images. After everything is working satisfactory, then it is also possible to prepare all the problem in GraphLab. If you are working in Python you can try our Java/Jython interface to GraphLab.

Saturday, January 28, 2012

Animal breeding with GraphLab?

Believe it or not, there is unlimited supply of real life interesting problems that can be solved usign GraphLab! Here is an email I got this morning from Asst. Prof. Gregor Gorjanc from the Biotechnical Faculty of University of Ljubljana:

I work in the area of animal breeding and genetics. The task of paramount importance in my field is the inference of so called breeding (genetic) values from phenotypic values and pedigree graph. These breeding values are used for selection of animals that will become parents of better future generations. We use linear mixed models of a type:
y | b, a, \sigma^2_e ~ N(Xb + Za, I\sigma^2_e)
b | B, \sigma^2_b    ~ N(0, B\sigma^2_b)
a | A, \sigma^2_a    ~ N(0, A\sigma^2_b)
where y - is a vector of known phenotypic values (body weight, ...), b - is a vector of unknown factors affecting phenotype (sex, age, ...), and a - is a vector of breeding values, while X and Z are design matrices linking b and a with y. Matrices B and A are known covariance (kernel) matrices - the later describes the genealogy between animals. We usually infer variance components (\sigma^2_i) and use point estimates in the above model as known quantities. Therefore, the inference of b and a basically involves solving a system of linear equations, where LHS is:
| X'X+B⁻1*beta X'Z           |
| Z'Z          Z+Z+A⁻1*alpha |,
with beta=\sigma^2_e / \sigma^2_b and alpha=\sigma^2_e / \sigma^2_a. The X'X+B⁻1 can be quite dense, but is often much much smaller than the Z+Z+A⁻1 part so LHS is in general very sparse. Both B⁻1 and A⁻1 are sparse as well, but do have some structure. The RHS is:

| X'y |
| Z'y |.

We can easily solve the above system and state of the art applications use the PCG (Preconditioned Conjugate Gradiaten) method due to shear size of data sets we deal with in animal breeding - small applications involve few thousands of animals (~number of unknowns), while nation-wide applications can easily involve hundreds of thousands of animals or more for large countries. Iterative methods only deliver as solutions posterior means of b and a, while posterior variances are not knwon as the inversion of LHS is to costly. Posterior covariances between a are not of high importance. I came across GaBP algorithm in the Graphical models book by Koller and Friedman and got very interested when I read that GaBP can deliver both, posterior means and variances for my model. Do you have any suggestions regarding my applications?

And here is my answer:
The first thing that comes to my mind, is that GaBP does not require a square matrix to operate, it can work on non-square matrices using the construction described here: equation (2).
Now, if the matrix is square, it should better be symmetric or else you will get reduced performance in terms of accuracy.
It seems that your matrix is square but not symmetric - I suggest playing a little with the equation and creating a non-square matrix as in the construction above.

My second advice is first try my Matlab GaBP code (available here) on smaller matrices just to verify that the algorithms converges and you are happy with accuracy. When you need a performance boost for larger matrices then GraphLab GaBP should be the right tool to use.

Regarding variance approximation - it is only an approximation, the values may be lower than the actual variance. I suggest again trying with smaller example in Matlab and trying to characterize the approximation error magnitude.

Q) You mention that GaBP delivers underestimated posterior variances for x. What is the rational behind this? I will surely test out with the exact case, but it would be great to get some theoretical insight into this.

A) In a nutshell: The matrix (inv(I-A)), can be expanded using an infinite sum, (inv(I-A)) = A + A^2 + A^3... . (When it is first normalized to have a unit diagonal). This can be viewed as an random walk on a markov chain, when summing routes of distances of size 1,2,3 ... See the JMLR paper of Jason K. Johnson et al.

GaBP is capturing only a part of those routes - the ones which start and ends in the same node. So the variance is an estimation to the total variance.

Friday, January 27, 2012

LexisNexis ML library - Advisory Panel: to PCA or not to PCA?

I am honored to report I was invited to participate in LexisNexis advisory panel of their machine learning library. The goal of this voluntary panel is to identify the most useful machine learning algorithms in practice and the best way to implement them. I agreed to serve in this panel so I could report here the interesting topics that arise in the industry when implementing a large scale machine learning solution.

And here is the first question I got from David Bayliss:

We know we want a PCA solution. Based upon our reading around it looks like QR decomposition is the way to go and that the best result for QR decomposition comes from householder reflections.

And here is my answer:

I suggest starting with SVD and not with PCA.
PCA has a drawback that you need to subtract the the mean value from the matrix. This makes sparse matrices non sparse and limits your PCA to relatively small models. (There may be way to workaround this but they need some extra thought). In fact, this topic is frequently brought up in Mahout's mailing list. Below is one example:

Ted Dunning added a comment - 27/Nov/11 06:50

When it comes to making a scalable PCA implementation for sparse data, you can't do the mean subtraction before the SVD. This is because the subtraction will turn the sparse matrix into a dense matrix. In many cases of interest in Mahout land, this results in a million fold increase in storage costs and a million^2 increase in compute costs.
For dense data, the subtraction doesn't make things any worse, but SVD in general isn't really feasible for really large dense matrices anyway.
Most of the SVD algorithms in Mahout can be reworked to deal with the mean subtraction for PCA implicitly instead of having to actually do the subtraction. As far as I know, that is the only way that you are going to get this to scale beyond data that fits in memory and likely the only way to get it to work well even for large data that does fit in memory.

On the other hand, SVD works very nicely on sparse matrices using the Lanczos algorithm.

And here is a great feedback I got from Nick Vasiloglou, ismion:

I read the blog about SVD versus PCA and I agree with Danny ... From my experience the most successful SVD method in terms of speed for sparse data is the one discussed here. It was recently adopted by Mahout. As a note the method most of the time works but it can fail if some conditions are not satisfied. The most stable and still fast enough is the one that uses Lanczos method or its variants. It requires more iterations but it is stable and accurate.

Tuesday, January 24, 2012

Hyperspectral imaging using GraphLab - updated

Here is what I got from Josh Ash, a research scientist who is working on some pretty cool stuff at Ohio State University. He is using GraphLab's Java API for speeding up the computation.

Hyperspectral imaging (HSI) sensors record images at hundreds of wavelengths in the electromagnetic spectrum and thus provide significantly more information about a scene than broadband or standard color (tri-band) imagery. As such, HSI is well-suited for remotely classifying materials in a scene by comparing HSI-estimated material reflectances to a spectral library of material fingerprints.

Estimating intrinsic material reflectances from HSI measurements is, however, complicated by localized illumination differences in the scene resulting from in-scene shadows, and by global distortions resulting from radiative transfer through the atmosphere. With the goal of increasing material classification performance, this work combines recent direct, diffuse, and ambient scattering predictors with statistical models imposing spatial structure in the scene. Multiple interacting Markov random fields are utilized to capture spatial correlation among materials in the scene and localized distortions, such as shadows. The inference engine, which dissects a hyperspectral data cube into a material map and a shadow map, toggles between performing loopy belief propagation to generate approximate posteriors on material classifications and performing expectation maximization updates needed in characterizing and inverting the atmospheric distortion.

Computational speed-up is achieved by performing multiple EM and BP updates in parallel on different portions of the associated factor graph. The parallelization was implemented using the GraphLab framework.

Q) Can you provide a bit more details about GraphLab implementation?
How faster is it relative to Matlab? As I recall you are using our Java API. right?

Yep, using the Java API. Each node has an update function that either performs a BP factor update, a BP variable update, or an EM update--as appropriate for that node in the graph. The update functions, various data structures for different types of nodes and edges, and the graph generation were all done in Java.

Speed-up relative to Matlab is difficult to judge because I don't have apples-to-apples implementations. Basically, at one point in the project I recoded everything from Matlab to GraphLab/Java . At that point I got 160x speedup (Matlab core2 duo 2.66Ghz vs GraphLab 8 core I7 2.93GHz); however the Matlab code was NOT optimized and could probably be improved 5-10x.

Q) Can you give some more details about problem size. How many pixels are there? How many nodes in the factor graph?

Right now I'm working with toy 100x100 = 10k pixel images. There are 30k nodes in the graph with 130k (unidirectional) edges. In the future, I will be pushing these sizes significantly as I make the model more general. These sizes could grow by a couple orders of magnitude.

Q) What is the machine you are using - which os? how many cores? how much memory?

I use two computers

1) Red Hat Enterprise Linux Workstation release 6.2
8 core I7
model name : Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz
cpu MHz : 1199.000
8GB ram

2) Red Hat Enterprise Linux Server release 5.5 (Tikanga)
16 core opteron
model name : Quad-Core AMD Opteron(tm) Processor 8378
cpu MHz : 2400.141
64GB ram

In the short future, I hope to be able to get access to a 512 core shared-memory machine.

Q) What additional features would you like to have in GraphLab?

Two features that would be useful: 1) Java API for distributed (not shared-memory) environment, 2) Implement the 'sync' operation for the Java API.

Monday, January 23, 2012

ismion overview of GraphLab

Here is a quick review of GraphLab as found in the ismion website. While I don't agree with the conclusion, it is interesting to see how GraphLab is rated by an objective third party.

If you like to learn more about Nick and ismion you can read my previous blog post.

Sunday, January 22, 2012

Intel + GraphLab collaboration

I am glad to announce that the GraphLab project is starting to collaborate with Intel. Dr. Theodore Willke, lead research scientist for the Cluster Computing Architecture team in Intel Labs, will head this effort from Intel's side. The goal of this research project is to optimize the communication layer performance of a high performance computing software over commodity Intel clusters. And you guessed right! GraphLab is the demo application we are going to use.

Friday, January 20, 2012

GraphLab input via Greenplum SQL

I got some great tips from Brian Dolan, Founder and VP product of Discovix Inc, a cool California area startup specializing in building machine learning and mathematical products.

Brian asked me a few days ago if we thought about interfacing GraphLab to some commonly used databases, since usually there is where the client data is found. Since I have no green clue in databses, Brian volunteered to teach me (and you my blog reader!) how to do it.

And here is the tutorial as I got it from Brian. First we create table with fake data
to be used in the example:

DROP SCHEMA IF EXISTS glab;
CREATE SCHEMA glab;

-- Create a table to hold some bogus data
DROP TABLE IF EXISTS glab.bs_data;
CREATE TABLE glab.bs_data
(
  rw int DEFAULT NULL
, cl int DEFAULT NULL
, vl numeric DEFAULT NULL
) DISTRIBUTED BY (rw, cl)
;

-- Now create some bogus data.  I use a few postgres tricks.
-- Ultimately, this just creates a relatively sparse "matrix" in tall format
-- I'm trying to use best practices in formatting and stuff
INSERT INTO glab.bs_data (rw, cl, vl)
SELECT
  x
, y
, random() AS vl
FROM 
  (
  SELECT
    CASE WHEN random() < 0.1 THEN x ELSE NULL END AS x
  , CASE WHEN random() < 0.1 THEN y ELSE NULL END AS y
  FROM
    generate_series(1, 1000) AS x
  , generate_series(1, 1000) AS y
  ) AS A
WHERE 1=1
  AND a.x IS NOT NULL 
  AND a.y IS NOT NULL 
;

After the table is ready we can push it into a file:

-- So we have some data.  Let's create an external table
-- to push it OUT of the database
-- You need to run gpfdist to write to an external table.
-- In this example, from the directory /Users/gpadmin/glab/tables the command was
-- > gpfdist 
DROP EXTERNAL TABLE IF EXISTS glab.ext_out;
CREATE WRITABLE EXTERNAL TABLE glab.ext_out (
  i bigint
, j bigint
, v numeric
) 
LOCATION ('gpfdist://macbuddha.local/glab_out.txt')
FORMAT 'TEXT' (DELIMITER ',' NULL ' ')
DISTRIBUTED BY (i, j)
;

-- Right, now let's put the data in it.  This is two queries unioned so
-- we can create the first row.  Best to do them separately.
INSERT INTO glab.ext_out
SELECT
  max(rw) AS i
, max(cl) AS j
, count(*) AS v
FROM glab.bs_data;

INSERT INTO glab.ext_out
SELECT 
  rw AS i
, cl AS j
, vl AS v
FROM  glab.bs_data
ORDER BY 1,2
;

And here is the example output glab_out.txt:

1000,1000,10002
1,69,0.71019060537219
1,577,0.747299919836223
1,627,0.252593372948468
1,753,0.120338548440486
1,768,0.34520703041926
1,826,0.756854422856122
1,838,0.827316934708506
1,936,0.342057122848928
2,323,0.090937509201467
...

Now it is very easy to read it in GraphLab. In Matlab/Octave do:

load glab_out.txt; % load data to memory
A = sparse(glab_out(2:end,1), glab_out(2:end,2), glab_out(2:end,3), glab_out(1,1), glab_out(1,2), glab_out(1,3)); % create a sparse matrix
mmwrite('A', A);  % save matrix to file. You may need to donload the script mmwrite.m from http://graphlab.org/mmwrite.m

Since this is potentially useful, I am going to have support for this data format soon in GraphLab, without the matlab conversion.

Next, Brian promised to teach me how to export data from GraphLab back to SQL.