Here are some of the projects Nick is involved in, in his own words:
- LexisNexis-ML is a machine learning toolbox combining the HPCC-LexisNexis hyperformance computing cluster and the PaperBoat/GraphLab library. HPCC is by a far a superior alternative to Hadoop. The system uses ECL, a declarative language that allows easier expression of data problems (see http://hpccsystems.com/ ). Inlining of C++ code can make it even more powerful when blending of sequential numerical algorithms with data manipulation is necessary. HPCC's heart is a C++ code generator that has the advantage of generating highly optimized binaries that outperform java Hadoop binaries.
- PaperBoat a single thread machine learning library built on top of C++ Boost MPL (template metaprogramming). The library is built with several templated abstractions so that it can be integrated easily with other platforms. The integration can be either light or very deep. The library makes extensive use of multidimensional trees for improving scalability and speed. Here is the current list of implemented algorithms. All of them support both sparse and desne data:
All nearest neighbors, (range, k, nearest, furthest, metric, bregman divergence), Kdtrees, ball trees
Kmeans, (kmeans++, peleg's algorithm, online, batch, kd-tree, sparse trees)
Kernel density estimation (kdtrees, balltrees)
Regression, (stepwise, vif, nonnegative, constrained)
SVM (smo, a faster method using trees)
Orhogonal range search
Maximum Variance Unfolding
- Mouragio is an asynchronous version of Paperboat where single threaded machine learning algorithms can exchange asynchronously data. Mouragio implements very efficiently a publish subscribe model that is ideal for asynchronous bootstraping (bagging) as well as for the racing algorithm (Moore & Maron 1997). Asynchronous iterations is an old idea from MIT optimization lab (Bertsekas and Tsinstikilis see link.
Mouragio is trying to utilize algorithms from the graph literature to automatically partition data and tasks so that the user doesn't have to deal with it. The mouragio daemon is trying to schedule tasks to the node where most of the required data and computational power are available. Mouragio is partially supported by LogicBlox.
- DataLog-LogicBlox scientific engine. LogicBlox has developed a database platform based on Logic. The language used is an enhanced version of Datalog. By far Datalog is the most expressive and declarative language for manipulating data. At this point datalog translates logic into a run-time database engine transactions. The goal of this project is to translate datalog to other scientific platfroms such as GRAPHLAB and MOURAGIO. Datalog is very good at expressing graphs so it very easily can translate to GRAPHLAB Also since the algorithm are described as sequence independent rules, automatic parallelization is more easy to do (although not always 100%).