Large Scale Machine Learning and Other Animals: LexisNexis

Friday, January 27, 2012

LexisNexis ML library - Advisory Panel: to PCA or not to PCA?

I am honored to report I was invited to participate in LexisNexis advisory panel of their machine learning library. The goal of this voluntary panel is to identify the most useful machine learning algorithms in practice and the best way to implement them. I agreed to serve in this panel so I could report here the interesting topics that arise in the industry when implementing a large scale machine learning solution.

And here is the first question I got from David Bayliss:

We know we want a PCA solution. Based upon our reading around it looks like QR decomposition is the way to go and that the best result for QR decomposition comes from householder reflections.

And here is my answer:

I suggest starting with SVD and not with PCA.
PCA has a drawback that you need to subtract the the mean value from the matrix. This makes sparse matrices non sparse and limits your PCA to relatively small models. (There may be way to workaround this but they need some extra thought). In fact, this topic is frequently brought up in Mahout's mailing list. Below is one example:

Ted Dunning added a comment - 27/Nov/11 06:50

When it comes to making a scalable PCA implementation for sparse data, you can't do the mean subtraction before the SVD. This is because the subtraction will turn the sparse matrix into a dense matrix. In many cases of interest in Mahout land, this results in a million fold increase in storage costs and a million^2 increase in compute costs.
For dense data, the subtraction doesn't make things any worse, but SVD in general isn't really feasible for really large dense matrices anyway.
Most of the SVD algorithms in Mahout can be reworked to deal with the mean subtraction for PCA implicitly instead of having to actually do the subtraction. As far as I know, that is the only way that you are going to get this to scale beyond data that fits in memory and likely the only way to get it to work well even for large data that does fit in memory.

On the other hand, SVD works very nicely on sparse matrices using the Lanczos algorithm.

And here is a great feedback I got from Nick Vasiloglou, ismion:

I read the blog about SVD versus PCA and I agree with Danny ... From my experience the most successful SVD method in terms of speed for sparse data is the one discussed here. It was recently adopted by Mahout. As a note the method most of the time works but it can fail if some conditions are not satisfied. The most stable and still fast enough is the one that uses Lanczos method or its variants. It requires more iterations but it is stable and accurate.

Wednesday, January 18, 2012

LexisNexis ML library

A couple of days ago, LexisNexis released a machine learning library on top of the HPCC platform.

I asked David Alan Bayliss who is the head of this effort, a few questions about the library.

Q) What are the main design goals of your ML library?

Scalability
Robustness
Ease of Use

Q) What is the target user base of your library?

The data scientist; someone that wants to be able to perform machine learning exercises with minimal programming - quickly

Q) In your list I did not see performance. Do you target scalability, robustness, and ease of use first?

Well the ML library is coded in ECL. ECL is a language I can talk about for days (it is my baby) – but essentially it is responsible for handling the mapping from the algorithm down onto the machine. Thus; in a way performance is our zeroeth priority; it trumps all – but it is really implicit in our choice of platform. Keren can give you numbers (and we’ll be producing more). Specifically the ECL->Machine optimizer is responsible for making sure all of the components of all of the available nodes are fully utilized.

Thus in the design of the libraries; our job is to come up with algorithms that are ‘scalably clean’ – in other words – ones where the algorithm does not force bottlenecks onto the optimizer – that is what we call scalable – it implicitly gives us performance.

Q) What is the licensing model?

It is open source; there is a license.txt file – I don’t follow the vagaries of the different open licenses. If the text file doesn’t give you what you need – let me know and I will chase it down for you.

Q) In what programming language is the library written?

ECL

Q) Is there an easy way to interface to distributed databases like HBase?

Well – the HPCC (/ECL) has two processing models – one is batch (similar to hadoop) and one is low-latency ‘real time’. With the former you would need to stream the data from hdfs to us. For the real-time you can use a soap call from our side if you wish. That said – it is usually worth moving the data across at the batch level

Q) What are the algorithms that are currently implemented?

The weblink I gave you lists them – at the moment we are just starting up so pretty much ‘one in each category’ – thus Naïve Bayes, k-means, OLS linear regression, logistic regression, association mining (éclat/apriori), co-location, perceptrons. We also have a matrix library and a document conversion library.

Q) For using the library, do I have to learn a new programming language or is
it some commonly used language?

ECL – which is open source – but still emerging

I Would love to see some performance and accuracy results on standard datasets once
they are there. currently it is a work in progress. David promised to update me once they are there.

Large Scale Machine Learning and Other Animals

Friday, January 27, 2012

LexisNexis ML library - Advisory Panel: to PCA or not to PCA?

Wednesday, January 18, 2012

LexisNexis ML library

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax