Large Scale Machine Learning and Other Animals: June 2014

Thursday, June 26, 2014

Alice Zheng's GraphLab- O'Reilly Webinar is now online

To anyone who wants to learn about the latest features of GraphLab - I recommend watching this useful webinar.

Sunday, June 22, 2014

Interesting taxi rides dataset

I got the following from my collaborator Zach Nation. NY taxi ride dataset that was not properly anonymized and was reverse engineered to find interesting insights in the data.

For the sport, I have used GraphLab Create to load and analyze this dataset. I started with an image of some NY taxis:

Using GraphLab Create I was able to reverse engineer the anonymizaiton and query the data based on the medallion number (for example 8J77 for the lower left taxi in the image).

I was further able to dig into personal details based on the medallion number:

And finally ask questions like how much money the taxis in the image made in a certain week?

Anyone who wants to try it out is welcome to email me, I can send you the ipython notebook to play with.

Monday, June 16, 2014

Be a detective with GraphLab create! Follow bitcoin money transactions to reveal a criminal!

Just got a note from my collaborator Brian Kent, who just related a new notebook which shows how to analyze Bitcoin money transactions using GraphLab Create. Using this notebook, Brian is trying to reveal a thief who stole 25,000$ Bitcoin money. Here is a graph of some of the thief transactions:

To learn the rest of the story you will need to read the full notebook.

Related blog posts: Graph analytics is a promising tools for fraud detection and security. Recently, Cisco announced that GraphLab is part of their security stack. PNNL is using GraphLab for its cyber security projects. Lab41 (US gov. research lab) combines Titan and GraphLab for a powerful social graph analytic tool.

Saturday, June 14, 2014

Community detection survey by Lab41

Just got my hands on the community detection survey made by Lab41. A very comprehensive overview of the popular and useful methods to know. Some of the included methods are Girwan Newman, Infomaps, Fast Unfolding, Cesna and many more.

One of the interesting algorithms is BigClam:

Friday, June 13, 2014

Lab41 releases open source code for GraphLab + Yarn integration

Just heard from Erik Tryzlaar from Lab41, that a new github open source project called Twill is alive. The project allows for running GraphLab tasks on a Hadoop 2.0 cluster which supports Yarn.

To remind, Pivotal have also their own wrapper which allows for running GraphLab on their Hadoop cluster, as part of their HD project.

A lot of exciting activities from different parties who are helping to make Graphlab Hadoop compatible! We will also release some news from GraphLab about this direction soon.

Fantastic talk by Dafna Shahaf - Stanford

This week I attended a great talk by Dafna Shahaf. In a nutshell, she has a method for finding surprising insights in the data. An open source project is on the way for sharing some of those tools.
Several applications domains were covered. For example, in the medical domain, two of the system findings (out of 4) are major medical breakthrough revelation as defined by external physician who examined the output. For commerce, the system can find surprising Amazon products to recommend to people. Here is a nice example:

For people who are looking for child's toys there are a lot of related selections in the pet section. For people who need a bath mat, there are related products in the car department which are much cheaper..

Leading Cancer Hospital utilizes GraphLab LDA for HealthCare

Just heard very interesting report from Xinghua Lou, a researcher of machine learning in Microsoft Research. Xinghua utilized GraphLab topic modeling for clustering health related documents. This work was reported at the big data innovation summit 2014.

From the KDnuggets blog post about this work:
"Among various techniques for understanding text corpus, we chose LDA topic models (implemented in GraphLab) because of its previous success in understanding scientific literature as well as webpages. We followed a process roughly as follows: data cleaning and standardization, topic modeling, clinical note clustering and visualization, community finding and cancer-gene correlation analysis. This process was mainly implemented by Katherine Chanunder my supervision. We had a few interesting findings, such as a community of patients who highly care about the risk of the treatment, the ability of predicting icd-9 code from topic modeling output, and some interesting correlations between patient profile and genetic mutation tests (some supported by previous published research)."

Friday, June 6, 2014

GraphChi based new partitioning method wins the best paper at SASO 2013!

Just had a great visit in KTH University in Sweden. I learned there on a very interesting work about a new algorithm for graph partitioning from Fatemeh Rahimian who won the best paper award at SASO 2013. The paper uses a simulated annealing based local search to improve the partitioning.

I got the following clarification from Fatemeh:

Please find attached the two papers that we have, one is for edge-cut partitioning (JabeJa) and the other is for vertex-cut partitioning (JabeJa-vc), which is inspired by the first algorithm, as its name suggest.
Our algorithm, JabeJa, can be executed with different data distribution models: (i) in a completely distributed environment (like a p2p network, where each peer is actually a graph node), we call this model one-host-one node model; or (ii) in an environment where a machine can host a part of the graph, we call this model the one-host-multiple-nodes.

The implementation of JabeJa on GraphChi consists of the following files:
1. JabeJa.java: the main algorithm of JabeJa.
2. JabeJaWeighted.java: this is JabeJa for weighted graphs.
3. MessageRelay.java: this is the file that implements the "mail" API, i.e., "send" and "get".
4. PartitionAnalysis.java: this file writes the final partitions into different output files.

We are working on adding this code contribution to GraphChi Java code, in the meantime anyone who is interested in welcome to email Fatemeh directly.

Another interesting fact is that a new Data Intensive Computing course at KTH is teaching about GraphLab, among other systems. The slides are available for everyone on the web.

Sunday, June 1, 2014

3rd GraphLab Conference is getting closer!!

GraphLab conference attracts the most interesting emerging data science projects. Join us on Monday July 21, 2014 at the Nikko Hotel in SF.

We will have oral talks from GraphLab, Spark, Datapad (a startup from the creator of python pandas), Trifacta ( a startup from the creator of d3.js), Cloudera, Microsoft, Google, Pivotal, Adobe, Lab41, CMU and Pandora.

We can roughly divide the presenters to several domains: graph analytics (graphlab, pregel, petuum, grappa, stinger, grafos.ml, parameter server etc.), graph databases, graph visualization, python data science tools, and applications on graphs.

Graph Databases is an emerging field. Graph databases are used to store and query the graph and are optimized for high performance on data which has a graph structure. We will have demos from all the influential graph databases out there: Neo Technology (Neo4j), Aurelius (Titan), Franz, Objectivity (InfiniteGraph), Sparsity Technologies, which are all the leading graph databases companies.

Visualization helps data scientists deep dive into their data. In terms of visualization, we will have presentations for Trifacta, Cambridge Intelligence, Graphistry (viz using gpus), Linkorious (a startup from the creators of Gephi open source), Ayasdi, Tom Sawyer Software, Plot.ly

In terms of python/ data science we will have presenters from Skytree, bigML, Zipfian Academy (python training), Continuum Analytics, iPython, Domino Data Labs, Dataiku.

We have a very interesting presence of academic projects. Some examples are Petuum (CMU) a new system by Prof. Eric Xing, Parameter Server (CMU) a mega scale framework for cluster implementation of ML methods by Prof. Alex Smola. Grappa (UW) by mark Oskin from UW, a super fast graph analytic framework. Stinger - a streaming graph system from Georgia Tech.

Graphs are everywhere! We assembled the most interesting use cases for graphs in industry. For example, Senzari, a company based in Florida is creating the largest music graph - with 100 billion facts related to music! Ravel law is using graphs obtain by supreme court rules to deduce interesting and useful facts about law. Lumiata is compiling a healthcare graph for medical science based graph analytics. Crosswise is using graphs for security and entity disambiguation purposes.

GraphLab conference started with 300 attendees on 2012, grown to 600 attendees in 2013, and we expect 900 data scientists in 2014. Secure your place today!

A limited special offer of 20% discount: Dannysblog. This offer will expire in a couple of days!

Large Scale Machine Learning and Other Animals