Thursday, February 24, 2011

Some thoughts about accuracy of Mahout's SVD

I was testing Mahout's SVD code and I encountered some subtleties.
I wonder if I am missing anything or is there a bug in the code?

1) The ordering of eigenvalues was the opposite than eigenvectors. But this was hopefully fixed by now in patch-369.
2) When requesting a rank of 4, we get 3 eigenvalues... So it seems that the rank is always lower by one.
3) There are two transformations which makes comparison of results with matlab (or pen & paper) harder:

a) The scaleFactor. Defined in: ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosState.java
I quote a documentation remark in: ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java:48
" /** To avoid floating point overflow problems which arise in power-methods like Lanczos, an initial pass is made
 * through the input matrix to
generate a good starting seed vector by summing all the rows of the input matrix, and
compute the trace(inputMatrixt*matrix)
This latter value, being the sum of all of the singular values, is used to rescale the entire matrix, effectively forcing the largest singular value to be strictly less than one, and transforming floating point overflow
problems into floating point underflow (ie, very small singular values will become invisible, as they  will appear to be zero and the algorithm will terminate).*/

b) The second transformation is orthonogolization of the resulting vector. This step is optional (IMHO).
see: ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java:118
The function call is: orthoganalizeAgainstAllButLast(nextVector, state);
Again I quote from documentation:
/** 

This implementation uses {@link org.apache.mahout.math.matrix.linalg.EigenvalueDecomposition} to do the * eigenvalue extraction from the small (desiredRank x desiredRank) tridiagonal matrix. Numerical stability is * achieved via brute-force: re-orthogonalization against all previous eigenvectors is computed after every pass. * This can be made smarter if (when!) this proves to be a major bottleneck. Of course, this step can be parallelized * as well. *


If anyone wants to reproduce my test, Can can add the function testLanczosSolver2() to TestLanczosSolver.java (code below).
1) To run it, you need first to comment the line:
//nextVector.assign(new Scale(1 / scaleFactor));
in LanczosSolver.java, so it is easier to compare the results to Matlab, without the scaling.
2) You need to also comment the line:
//orthoganalizeAgainstAllButLast(nextVector, basis);
in LanczosSolver.java

The factorized matrix is:
>> A

3.1200 -3.1212 -3.0000
-3.1110 1.5000 2.1212
-7.0000 -8.0000 -4.0000

The eigenvalues are;
>> [a,b]=eig(A'*A)

a =

0.2132 -0.8010 -0.5593
-0.5785 0.3578 -0.7330
0.7873 0.4799 -0.3871

b =

0.0314 0 0
0 42.6176 0
0 0 131.2553

Now I run the unit test testLanczosSolver2 and I get:
INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal auxiliary matrix.
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 0 found with eigenvalue 131.25526355941963
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 1 found with eigenvalue 42.61761063477249
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Eigenvector 2 found with eigenvalue 0.03137295830779152
Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: LanczosSolver finished.

As you can see the eigenvalues are correct.

@Test
public void testLanczosSolver2() throws Exception {
int numRows = 3; int numCols = 3;
int numColumns = 3;
SparseRowMatrix m = new SparseRowMatrix(new int[]{numRows, numCols});
/**

    * 3.1200 -3.1212 -3.0000
      -3.1110 1.5000 2.1212
      -7.0000 -8.0000 -4.0000

*/
m.set(0,0,3.12);
m.set(0,1,-3.12121);
m.set(0,2,-3);
m.set(1,0,-3.111);
m.set(1,1,1.5);
m.set(1,2,2.12122);
m.set(2,0,-7);
m.set(2,1,-8);
m.set(2,2,-4);

int rank = 4;
Matrix eigens = new DenseMatrix(rank, numColumns);
long time = timeLanczos(m, eigens, rank, false);
assertTrue("Lanczos taking too long! Are you in the debugger? ", time < 10000);
}
Update: June 1st, 2011: Now GraphLab has also an efficient SVD Lanczos solver. Some performance benchmarks are found here:

2 comments:

  1. Hello sir,

    I am currently pursuing my research. I am a newbie to apache mahout. I have mahout installed in my system. I am interested in the Recommender part. I have 2 questions:
    1. I tried SVDRecommender with movielens(100k) data, with the parameters 10 for number of features, and 50 for number of iterations. I did not exactly understand what is to be passed for these two parameters?
    2. I want to make use of the Lanczos algorithm that is available. I want to create a recommender, that will factorize the matrix using Lanczos algorithm. Towards that I created a class extending AbstractRecommender. Inside this we need to override the method recommend, estimatepreference among others. I am not sure how to exactly go about. There are 2 lanczos solvers that I could see: DistributedLanczosSolver and LanczosSolver. The DistributedLanczossolver extends LanczosSolver, so should I make use of DistributedLanczosSolver. It has something called LanczosState, which I could not understand what it is. Any guidance is deeply thanked.

    Thanks,
    Yogalakshmi

    ReplyDelete
    Replies
    1. I suggest mailing Mahout user mailing list with this question.
      Good luck!

      Delete