Large Scale Machine Learning and Other Animals
Sunday, May 19, 2013
An Overview of Graph Processing Frameworks
A nicely written overview of Graph processing frameworks by Ben Lorica from O'Reilly media. Will help you master the essential buzzwords on big data analytics and graph processing.. And explains quite clearly why you should attend our 2nd GraphLab workshop!
Kaggle Titanic Contest
I got this from Sagie Davidovich a link to Kaggle's Titanic Contest. The task is to predict who survived the Titanic based on some features like age, cabin class, name, and presence of relatives on the Ship.
What I like about this task, is that you can actually explain to your gradma what you are predicting (unlike many other ML tasks which are hard to explain for the non expert..). The dataset is quite tiny in our standards. But still enables the usage of different ML methods. The best prediction on the leader-board has about 99% accuracy.
What I like about this task, is that you can actually explain to your gradma what you are predicting (unlike many other ML tasks which are hard to explain for the non expert..). The dataset is quite tiny in our standards. But still enables the usage of different ML methods. The best prediction on the leader-board has about 99% accuracy.
Tuesday, May 14, 2013
Funding for the next generation of GraphLab
The GraphLab journey began with the desire:
- to rethink the way we approach Machine Learning and Graph analytics,
- to demonstrate that with the right abstractions and system design we can achieve unprecedented levels of performance, and
- to build a community around large-scale graph computation.
We have been blown away by the excitement and growth of the GraphLab community and have been unable to keep up with the incredible interest from our amazing users.
Therefore, we are proud to announce GraphLab Inc, a company devoted to accelerating the development of the open-source GraphLab project.
Why a company?
Put simply, we need a full-time dedicated effort to take GraphLab from where it is today to where we would like to see it go in the future. With a dedicated team, you will see exciting new features, more integration with Cloud infrastructure, easier installation and deployment, along with a revamped support effort, with additional commercial-grade options and tools.
What happens to the open-source project?
The open source project will remain the flagship technology where we push the limits of graph computation and develop new ideas. With a dedicated team focused on the code, documentation, deployment process and most importantly support, you can expect a regular release schedule, faster turnaround on bug fixes and improvements, and a higher quality bar for the code going forward. We will continue to rely on our community to push us to tackle bigger, harder problems, and to contribute back to the open-source effort.
Are you going to charge for GraphLab now?
We believe the only way the best ideas in machine learning are developed is with the help of a vibrant community (it takes a village). That is why GraphLab will remain an open-source project; now with a company behind it, we can double our efforts to keeping the project healthy.
What is the roadmap for the open-source project?
GraphLab 2.2 is just around the corner, see here for more details as to what is in it. Beyond that, we are exploring a new computation engine and further enhancements to the communication layer, as well as simpler integration with existing Cloud technologies, easier installation procedures, and an exciting new graph storage system. And of course, we look forward to working with you to develop the roadmap and build the next generation of the GraphLab system.
The next 12 months will be an exciting time for us, and we hope that everyone will come along for the ride
.
Learn more at graphlab.com
Sunday, May 12, 2013
Bond Percolation in GraphLab
I was asked by Prof. Scott Kirkpatrick to help and implement bond percolation in GraphLab. It is an oldie but goldie problem which is closely related to the connected components problem.
Here is an explanation about bond percolation from Wikipedia:
A representative question (and the source of the name) is as follows. Assume that some liquid is poured on top of some porous material. Will the liquid be able to make its way from hole to hole and reach the bottom? This physical question is modelled mathematically as a three-dimensional network of n × n × n vertices, usually called "sites", in which the edgeor "bonds" between each two neighbors may be open (allowing the liquid through) with probability p, or closed with probability 1 – p, and they are assumed to be independent. Therefore, for a given p, what is the probability that an open path exists from the top to the bottom? The behavior for large n is of primary interest. This problem, called now bond percolation, was introduced in the mathematics literature by Broadbent & Hammersley (1957), and has been studied intensively by mathematicians and physicists since.
Here is an explanation about bond percolation from Wikipedia:
A representative question (and the source of the name) is as follows. Assume that some liquid is poured on top of some porous material. Will the liquid be able to make its way from hole to hole and reach the bottom? This physical question is modelled mathematically as a three-dimensional network of n × n × n vertices, usually called "sites", in which the edgeor "bonds" between each two neighbors may be open (allowing the liquid through) with probability p, or closed with probability 1 – p, and they are assumed to be independent. Therefore, for a given p, what is the probability that an open path exists from the top to the bottom? The behavior for large n is of primary interest. This problem, called now bond percolation, was introduced in the mathematics literature by Broadbent & Hammersley (1957), and has been studied intensively by mathematicians and physicists since.
2-d bond percolation problem. Will the water find their way
along the edges of this maze from top to bottom?
The algorithm for finding the connected edges is very simple:
In parallel
- for each edge, record the minimum id of the connected edges
- if there are no more changes in the network, break
This algorithm is frequently used in social networks to find groups of friends. And it is of course distributed and scales to very large problems. The output is the min component found for each edge. From this output we can find the edge clusters.
I have utilized the newest experimental version of GraphLab (v. 2.2) for quickly implementing the above algorithm. The resulting code is surprisingly simple. The main loop of the program:
| //create a GraphLab engine engine_type engine(dc, graph, clopts); //register the map and combine operations that will be used in the update function |
| engine.register_map_reduce(BOND_PERCOLATION_MAP_REDUCE, bond_percolation_map, bond_percolation_combine); |
| //in a loop for (int i=0; i< max_iter; i++){ //perform update function on each node |
| engine.parfor_all_local_vertices(bond_percolation_function); //wait until all nodes are done |
| engine.wait(); //count the number of unsatisfied links |
| size_t diff = graph.map_reduce_edges<size_t>(count_component); |
| //if no more links to explore we are done |
| if (diff == 0) |
| break; |
| } |
bond_perculation_function which is executed in parallel across all node. In this function, we traverse all connected edges, and compare their minimum edges id using the bond_perculation_combine. As it is easy to verify, each of those implemented function has a single line of code.
| //return the min component id found across this edges and its two connecting nodes |
| unsigned int bond_percolation_map(const graph_type::vertex_type& center, |
| graph_type::edge_type& edge, |
| const graph_type::vertex_type& other) { |
| return edge.data().comp_id = std::min(std::min(center.data().comp_id, edge.data().id), other.data().comp_id); |
| }
|
| //find min component of two connected edges |
| void bond_percolation_combine(unsigned int& v1, const unsigned int& v2) { |
| v1 = std::min(v1, v2); |
| } |
| //the main update function, go over all nodes and selects their min edge id |
| void bond_percolation_function(engine_type::context_type& context, |
| graph_type::vertex_type& vertex) { |
| vertex.data().comp_id = context.map_reduce<unsigned int>(BOND_PERCOLATION_MAP_REDUCE, graphlab::ALL_EDGES); |
| } |
| //count the number of components that are still not satisfied. When the number of components are zero, we are done |
| size_t count_components(const graph_type::edge_type & edge) { |
| return = (edge.source().data().comp_id != edge.target().data().comp_id); |
| } |
The full code is here. It is now part of the graph_analytics toolkit in v2.2. Needless to say it is working.. :-)
Saturday, May 11, 2013
Hadoop Mortar
Just got from my boss Prof. Carlos Guestrin this link: http://blog.mortardata.com/post/49934459499/recommender-systems-for-free
It looks like a nice publicity stunt - Hadoop Mortar is proposing to build recommender systems for free for selected companies. I wonder if this will actually work?
It looks like a nice publicity stunt - Hadoop Mortar is proposing to build recommender systems for free for selected companies. I wonder if this will actually work?
Thursday, May 9, 2013
Fun Readings about Startups
I got the following from my collaborator Jay Gu:
Here's a fun reading about startup: a well written lecture notes on Peter Thiel's course at Stanford, which covers topics like culture, hiring, strategy, lessons and mindset etc. I've read first 5 chapters and found it interesting and rewarding so I decided to share with all of you.
Html version
Epub version if you like to read on your iPad
Below are some fun quotes just give you a quick taste.
About hiring good engineers from "Google":
"So the way to compete against the giants is not with money. Google will outbid you. They have oil derrick that spits out $30bn in search revenue every year. To win, you need to tell a story about cogs. At Google, you’re a cog. Whereas with me, you’re an instrumental piece of this great thing that we’ll build together. Articulate the vision. Don’t even try to pay well. Meet people’s cash flow needs. Pay them so they can cover their rent and go out every once in awhile. It’s not about cash. It’s about breaking through the wall of cynicism. It’s about making 1% of this new thing way more exciting than a couple hundred grand and a cubicle at Google."
"We tend to massively underestimate the compounding returns of intelligence. As humans, we need to solve big problems. If you graduate Stanford at 22 and Google recruits you, you’ll work a 9-to-5. It’s probably more like an 11-to-3 in terms of hard work. They’ll pay well. It’s relaxing. But what they are actually doing is paying you to accept a much lower intellectual growth rate. When you recognize that intelligence is compounding, the cost of that missing long-term compounding is enormous. They’re not giving you the best opportunity of your life. Then a scary thing can happen: You might realize one day that you’ve lost your competitive edge. You won’t be the best anymore. You won’t be able to fall in love with new stuff. Things are cushy where you are. You get complacent and stall. So, run your prospective engineering hires through that narrative. Then show them the alternative: working at your startup."
The fundamental question
"The path from 0 to 1 might start with asking and answering three questions: First, what is valuable? Second, what can I do? And third, what is nobody else doing?
The intellectual rephrasing of these questions is: What important truth do very few people agree with you on?
The business version is: What valuable company is nobody building?"
About escaping competition:
"Intense competition makes things hard because you just beat heads with other people. The intensity of competition becomes a proxy for value. But value is a different question entirely."
About how to own a market:
"For a company to own its market, it must have some combination of brand, scale cost advantages, network effects, or proprietary technology."
About hiring nerds vs athletes:
"In thinking about building good company culture, it may be helpful to dichotomize two extreme personality types: nerds and athletes. Engineers and STEM people tend to be highly intelligent, good at problem solving, and naturally non zero-sum. Athletes tend to be highly motivated fighters; you only win if the other guy loses."
"The optimal spot on the matrix is monopoly capitalism with some tailored combination of zero-sum and non zero-sum oriented people. You want to pick an environment where you don’t have to fight. But you should bring along some good fighters to protect your non zero-sum people and mission, just in case."
New collaborative filtering functionality in GraphLab and GraphChi
A few days ago I wrote in this blog about parallel coordinate descent for speeding up ALS.
I have just implemented parallel ALS using coordinate descent a.k.a.
I have just implemented parallel ALS using coordinate descent a.k.a.
CCD++ algorithm. The algorithm is described in the following two papers:
H.-F. Yu, C.-J. Hsieh, S. Si, I. S. Dhillon, Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems. IEEE International Conference on Data Mining(ICDM), December 2012.
Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars Schmidt-Thieme. Fast context-aware recommendations with factorization machines. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR '11). ACM, New York, NY, USA, 635-644.
In a nutshell, it speeds up ALS by avoiding the need for costly least square computation, each dimension (coordinate) is handled separately in parallel.
Documentation of the method for GraphChi is here:
Documentation of the method for GraphLab is here:
Note: For GraphLab the algorithm is implemented in version 2.2 which will be release this summer. It is still possible to checkout this version and try it out using the mercurial command "hg up v2.2".
Let me know if you try it out!
Subscribe to:
Posts (Atom)
