Large Scale Machine Learning and Other Animals: LexisNexis ML library

Wednesday, January 18, 2012

LexisNexis ML library

A couple of days ago, LexisNexis released a machine learning library on top of the HPCC platform.

I asked David Alan Bayliss who is the head of this effort, a few questions about the library.

Q) What are the main design goals of your ML library?

Scalability
Robustness
Ease of Use

Q) What is the target user base of your library?

The data scientist; someone that wants to be able to perform machine learning exercises with minimal programming - quickly

Q) In your list I did not see performance. Do you target scalability, robustness, and ease of use first?

Well the ML library is coded in ECL. ECL is a language I can talk about for days (it is my baby) – but essentially it is responsible for handling the mapping from the algorithm down onto the machine. Thus; in a way performance is our zeroeth priority; it trumps all – but it is really implicit in our choice of platform. Keren can give you numbers (and we’ll be producing more). Specifically the ECL->Machine optimizer is responsible for making sure all of the components of all of the available nodes are fully utilized.

Thus in the design of the libraries; our job is to come up with algorithms that are ‘scalably clean’ – in other words – ones where the algorithm does not force bottlenecks onto the optimizer – that is what we call scalable – it implicitly gives us performance.

Q) What is the licensing model?

It is open source; there is a license.txt file – I don’t follow the vagaries of the different open licenses. If the text file doesn’t give you what you need – let me know and I will chase it down for you.

Q) In what programming language is the library written?

ECL

Q) Is there an easy way to interface to distributed databases like HBase?

Well – the HPCC (/ECL) has two processing models – one is batch (similar to hadoop) and one is low-latency ‘real time’. With the former you would need to stream the data from hdfs to us. For the real-time you can use a soap call from our side if you wish. That said – it is usually worth moving the data across at the batch level

Q) What are the algorithms that are currently implemented?

The weblink I gave you lists them – at the moment we are just starting up so pretty much ‘one in each category’ – thus Naïve Bayes, k-means, OLS linear regression, logistic regression, association mining (éclat/apriori), co-location, perceptrons. We also have a matrix library and a document conversion library.

Q) For using the library, do I have to learn a new programming language or is
it some commonly used language?

ECL – which is open source – but still emerging

I Would love to see some performance and accuracy results on standard datasets once
they are there. currently it is a work in progress. David promised to update me once they are there.

Large Scale Machine Learning and Other Animals

Wednesday, January 18, 2012

LexisNexis ML library

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax