I asked David Alan Bayliss who is the head of this effort, a few questions about the library.
Q) What are the main design goals of your ML library?
- Ease of Use
Q) What is the target user base of your library?
The data scientist; someone that wants to be able to perform machine learning exercises with minimal programming - quickly
Q) In your list I did not see performance. Do you target scalability, robustness, and ease of use first?
Well the ML library is coded in ECL. ECL is a language I can talk about for days (it is my baby) – but essentially it is responsible for handling the mapping from the algorithm down onto the machine. Thus; in a way performance is our zeroeth priority; it trumps all – but it is really implicit in our choice of platform. Keren can give you numbers (and we’ll be producing more). Specifically the ECL->Machine optimizer is responsible for making sure all of the components of all of the available nodes are fully utilized.
Thus in the design of the libraries; our job is to come up with algorithms that are ‘scalably clean’ – in other words – ones where the algorithm does not force bottlenecks onto the optimizer – that is what we call scalable – it implicitly gives us performance.
Q) What is the licensing model?
It is open source; there is a license.txt file – I don’t follow the vagaries of the different open licenses. If the text file doesn’t give you what you need – let me know and I will chase it down for you.
Q) In what programming language is the library written?
Q) Is there an easy way to interface to distributed databases like HBase?
Well – the HPCC (/ECL) has two processing models – one is batch (similar to hadoop) and one is low-latency ‘real time’. With the former you would need to stream the data from hdfs to us. For the real-time you can use a soap call from our side if you wish. That said – it is usually worth moving the data across at the batch level
Q) What are the algorithms that are currently implemented?
The weblink I gave you lists them – at the moment we are just starting up so pretty much ‘one in each category’ – thus Naïve Bayes, k-means, OLS linear regression, logistic regression, association mining (éclat/apriori), co-location, perceptrons. We also have a matrix library and a document conversion library.
Q) For using the library, do I have to learn a new programming language or is
it some commonly used language?
ECL – which is open source – but still emerging
I Would love to see some performance and accuracy results on standard datasets once
they are there. currently it is a work in progress. David promised to update me once they are there.