Saturday, January 28, 2012

Animal breeding with GraphLab?

Believe it or not, there is unlimited supply of real life interesting problems that can be solved usign GraphLab! Here is an email I got this morning from Asst. Prof. Gregor Gorjanc from the Biotechnical Faculty of University of Ljubljana:

I work in the area of animal breeding and genetics. The task of paramount importance in my field is the inference of so called breeding (genetic) values from phenotypic values and pedigree graph. These breeding values are used for selection of animals that will become parents of better future generations. We use linear mixed models of a type:

y | b, a, \sigma^2_e ~ N(Xb + Za, I\sigma^2_e)
b | B, \sigma^2_b    ~ N(0, B\sigma^2_b)
a | A, \sigma^2_a    ~ N(0, A\sigma^2_b)

where y - is a vector of known phenotypic values (body weight, ...), b - is a vector of unknown factors affecting phenotype (sex, age, ...), and a - is a vector of breeding values, while X and Z are design matrices linking b and a with y. Matrices B and A are known covariance (kernel) matrices - the later describes the genealogy between animals. We usually infer variance components (\sigma^2_i) and use point estimates in the above model as known quantities. Therefore, the inference of b and a basically involves solving a system of linear equations, where LHS is:

| X'X+B⁻1*beta X'Z           |
| Z'Z          Z+Z+A⁻1*alpha |,
with beta=\sigma^2_e / \sigma^2_b and alpha=\sigma^2_e / \sigma^2_a. The X'X+B⁻1 can be quite dense, but is often much much smaller than the Z+Z+A⁻1 part so LHS is in general very sparse. Both B⁻1 and A⁻1 are sparse as well, but do have some structure. The RHS is:

| X'y |
| Z'y |.

We can easily solve the above system and state of the art applications use the PCG (Preconditioned Conjugate Gradiaten) method due to shear size of data sets we deal with in animal breeding - small applications involve few thousands of animals (~number of unknowns), while nation-wide applications can easily involve hundreds of thousands of animals or more for large countries. Iterative methods only deliver as solutions posterior means of b and a, while posterior variances are not knwon as the inversion of LHS is to costly. Posterior covariances between a are not of high importance. I came across GaBP algorithm in the Graphical models book by Koller and Friedman and got very interested when I read that GaBP can deliver both, posterior means and variances for my model. Do you have any suggestions regarding my applications?

And here is my answer:
The first thing that comes to my mind, is that GaBP does not require a square matrix to operate, it can work on non-square matrices using the construction described here: equation (2).
Now, if the matrix is square, it should better be symmetric or else you will get reduced performance in terms of accuracy.
It seems that your matrix is square but not symmetric - I suggest playing a little with the equation and creating a non-square matrix as in the construction above.

My second advice is first try my Matlab GaBP code (available here) on smaller matrices just to verify that the algorithms converges and you are happy with accuracy. When you need a performance boost for larger matrices then GraphLab GaBP should be the right tool to use.

Regarding variance approximation - it is only an approximation, the values may be lower than the actual variance. I suggest again trying with smaller example in Matlab and trying to characterize the approximation error magnitude.

Q) You mention that GaBP delivers underestimated posterior variances for x. What is the rational behind this? I will surely test out with the exact case, but it would be great to get some theoretical insight into this.

A) In a nutshell: The matrix (inv(I-A)), can be expanded using an infinite sum, (inv(I-A)) = A + A^2 + A^3... . (When it is first normalized to have a unit diagonal). This can be viewed as an random walk on a markov chain, when summing routes of distances of size 1,2,3 ... See the JMLR paper of Jason K. Johnson et al.

GaBP is capturing only a part of those routes - the ones which start and ends in the same node. So the variance is an estimation to the total variance.

No comments:

Post a Comment