Large Scale Machine Learning and Other Animals: SpotLight: Prof. Michael Mahoney, Stanford

A couple of days ago I had a delightful conversation with Prof. Mahoney from Stanford. Michael is the organizer of the successful MMDS workshop which every couple of year attracts 250 researchers interested in algorithms on big data. The coming workshop will be held July 10-13, 2012, at Stanford. For more information, including registration, how to submit poster presentations, etc., see the MMDS web page.

Michael did a lot of work in the field of random projection and I asked him some questions to learn some more about this field:

1) Can you provide some resources for novice people (like me) who likes to learn more on the technique of random projections?

Here are the links to the two reviews on randomized matrix algorithms that we discussed.

Randomized Algorithms for Matrices and Data, M. W. Mahoney,
NOW Publishers, Foundations and Trends in Machine Learning, Volume 3, Issue 2, 2011,
Technical Report, Preprint: arXiv:1104.5557 (2011).

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp
SIAM Review, Volume 53, Issue 2, pp. 217-288, 2011,
Technical Report, Preprint: arXiv:0909.4061 (2009).

The first of these reviews (mine) adopts a more data/TCS/ML approach for the least-squares and low-rank approximation problems and tries to highlight "why" the algorithms work with an eye to developing future methods more generally; while the latter review (HMT) focuses on coupling the novel randomized algorithms for low-rank matrix approximation with classical numerical methods, the goal being to get high-quality numerical code, typically for "scientific" applications. Thus, e.g., some of the assumptions that are made are more appropriate for scientific applications that "data analytics" applications---e.g., for them sparse matrices are typically structured, e.g., from pdes, and thus can be applied quickly as an operator to unstructured Gaussian random projections.

2) Are you familiar with the divide and conquer paper from last NIPS? ("Divide and Conquer Matrix Factorization" by Lester Mackey, Ameet Talwalkar and Michael Jordan)

I'm familiar with that paper on "divide and conquer," and I've talked a lot with Ameet Talwalkar about these issues. Their splitting of the data makes sense; but from the perspective of what we were discussing about analytics, this and other work of Ameet (and many others in ML) makes assumptions that the matrix has small "coherence," which is basically the largest "statistical leverage score." See my review monograph above, or see our PNAS paper on "CUR matrix decompositions for improved data analysis," or the paper below, for details. That assumption is fine in some cases, and Ameet has shown that it is true for at least a few ML data sets. But it is a very false assumption in many many other data sets, which the PNAS paper, the review monograph, and several other papers I can point you to discuss. More importantly, to develop robust large-scale analytic tools, you simply can't assume that, since it will often be false. A good discussion of these issues can also be found in our paper:

Fast approximation of matrix coherence and statistical leverage,
P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff,
Technical Report, Preprint: arXiv:1109.3843 (2011)

which also describes how to compute the coherence and statistical leverage scores "fast." Coupled with LSRN, the algorithm in this paper yields fast code to compute the statistical leverage scores and matrix coherence---very fast in the case of "rectangular" matrices, and "medium fast" in the case of large sparse matrices with a low-rank parameter, partly due to the issue mentioned above, partly due to numerical issues with iterative methods like Lanczos-based methods and partly due to the densification issue. The latter is going to be a problem in general, and there is no simple solution---we should talk in more detail about this if you like.

3) I heard about LRSN, your efficient parallel iterative least squares. Can you give us the essence of this construction?

ANSWER:

LSRN is a parallel iterative least squares solver developed by Meng,
Saunders, and Mahoney that is based on random projections and that takes
advantage of modern computer architectures is a novel way. The solver
computes the unique min-length solution to strongly over-determined or
under-determined least-squares systems of the form
$\min_{x \in \mathbb{R}^n} \|A x - b\|_2$. Thus, LSRN is similar to
recent implementations of randomized projection algorithms for the
least-squares problem; but it is more general and more robust, and it is
designed for larger-scale parallel or distributed environments. In
particular, all existing prior work assumes that $A$ has full rank; and,
for the iterative solvers, none of it comes with guarantees on the number
of iterations, independent of the spectrum of $A$. For LSRN, the matrix
$A \in \mathbb{R}^{m \times n}$ can be either full-rank or rank-deficient;
since $A$ is only involved in matrix-matrix and matrix-vector
multiplications, $A$ can be a dense matrix, a sparse matrix, or a linear
operator; and LSRN automatically speeds up on sparse matrices and with
fast linear operators. The preconditioning phase of LSRN consists of a
random normal projection, which is embarrassingly parallel, and a singular
value decomposition of a "small" matrix. Importantly, the preconditioned
system is well-conditioned, with a strong concentration result on the
extreme singular values, and hence that the number of iterations is fully
predictable (just as with direct solvers) when LSQR or the Chebyshev
semi-iterative method is applied to the preconditioned system. The latter
method is particularly efficient for solving large-scale problems on
distributed clusters with high communication cost. In addition, LSRN is
easy to implement using threads or with MPI, and it scales well in
distributed clusters and parallel environments. Numerical results
demonstrate that, on a local shared memory machine, LSRN outperforms
LAPACK's DGELSD on large-scale dense problems and MATLAB's backslash
(SuiteSparseQR) on large-scale sparse problems; and empirical results also
demonstrate that LSRN scales well on Amazon Elastic Compute Cloud (EC2)
clusters. LSRN is available for download , or here.

Large Scale Machine Learning and Other Animals

Friday, April 20, 2012

SpotLight: Prof. Michael Mahoney, Stanford

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax