I am honored to report I was invited to participate in LexisNexis advisory panel of their machine learning library. The goal of this voluntary panel is to identify the most useful machine learning algorithms in practice and the best way to implement them. I agreed to serve in this panel so I could report here the interesting topics that arise in the industry when implementing a large scale machine learning solution.
And here is the first question I got from David Bayliss:
We know we want a PCA solution. Based upon our reading around it looks like QR decomposition is the way to go and that the best result for QR decomposition comes from householder reflections.And here is my answer:
I suggest starting with SVD and not with PCA.
PCA has a drawback that you need to subtract the the mean value from the matrix. This makes sparse matrices non sparse and limits your PCA to relatively small models. (There may be way to workaround this but they need some extra thought). In fact, this topic is frequently brought up in Mahout's mailing list. Below is one example:
Ted Dunning added a comment - 27/Nov/11 06:50
When it comes to making a scalable PCA implementation for sparse data, you can't do the mean subtraction before the SVD. This is because the subtraction will turn the sparse matrix into a dense matrix. In many cases of interest in Mahout land, this results in a million fold increase in storage costs and a million^2 increase in compute costs.On the other hand, SVD works very nicely on sparse matrices using the Lanczos algorithm.
For dense data, the subtraction doesn't make things any worse, but SVD in general isn't really feasible for really large dense matrices anyway.
Most of the SVD algorithms in Mahout can be reworked to deal with the mean subtraction for PCA implicitly instead of having to actually do the subtraction. As far as I know, that is the only way that you are going to get this to scale beyond data that fits in memory and likely the only way to get it to work well even for large data that does fit in memory.
And here is a great feedback I got from Nick Vasiloglou, ismion:
I read the blog about SVD versus PCA and I agree with Danny ... From my experience the most successful SVD method in terms of speed for sparse data is the one discussed here. It was recently adopted by Mahout. As a note the method most of the time works but it can fail if some conditions are not satisfied. The most stable and still fast enough is the one that uses Lanczos method or its variants. It requires more iterations but it is stable and accurate.