Large Scale Machine Learning and Other Animals: September 2013

Sunday, September 29, 2013

Spark raises 14M $ to compete with Hadoop

I heard the news 4 days late.. but here it is:

http://gigaom.com/2013/09/25/databricks-raises-14m-from-andreessen-horowitz-wants-to-take-on-mapreduce-with-spark/

See some older posts about Spark:
http://bickson.blogspot.co.il/2013/09/mlbase-spark-talk.html
http://bickson.blogspot.co.il/2012/12/graphlab-vs-piccolo-vs-spark.html

Thursday, September 19, 2013

GraphLab Internship Program (Machine Learning Summer Internship)

We are glad to announce our latest internship program for the summer of 2014. We have around 15 open positions, either at GraphLab/UW or affiliated companies we work with.

Would you like to have a chance to deploy cutting edge machine learning algorithms in practice? Do you want to get your hands on the largest and most interesting datasets out there? Do you have valuable applied experience working with machine learning in the cloud? If so, you should consider our internship program.

Candidates must be US-based PhD or master students in one of the following areas: machine learning, statistics, AI, systems, high performance computing, distributed algorithms, or math. We are especially interested in those who have used GraphLab/GraphChi for a research project or have contributed to the GraphLab community.

All interested applicants should send their resume to internships@graphlab.com. If you are a company interested having a GraphLab intern, please feel free to get in touch.
Here is a (very preliminary) list of open positions:

GraphLab is a graph-based, high performance, open source distributed computation framework written in C++. While GraphLab was originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; out-performing other abstractions by orders of magnitude. Our latest beta is a cloud based ipython interface to GraphLab. GraphLab office is located in UW campus in Seattle.

Rocket Fuel is a digital-advertising technology company in Silicon Valley that has grown rocket-fast since its founding in 2008. It is the leading provider of artificial intelligence online-advertising solutions. It combines a world-class engineering team with the industry's most productive sales and delivery teams. Its customers are some of the world’s most successful brands. Rocket Fuel serves them from its offices in 19 cities around the globe. Engineering is located in Redwood City (Bay Area), CA

Technicolor is an industry leader in the production of video content for movies, TV, advertising, games and more. The Palo Alto research lab offers internship positions in the fields of machine learning, optimization, networking, and systems, focusing on high quality academic publications. We encourage applications from highly motivated Ph.D. candidates in Computer Science, Electrical and Computer Engineering, Statistics and Applied Math, with a strong background in one of the above fields. Technicolor Labs are located downtown Palo Alto.

CloudCV is a large-scale distributed computer vision algorithm package, running on the cloud. CloudCV is a project developed at the Machine Learning and Perception Lab, Virginia Tech, led by Prof. Dhruv Batra. CloudCV uses GraphLab for implementing distributed computer vision algorithms. Some of the functionality is derived from using OpenCV with Graphlab SDK.

Tagged makes social discovery products that enable anyone to meet and socialize with new people. Our mission is to help everyone feel love and belonging, and we're building toward a vision where anyone can use a device to instantly connect with interesting new people anytime, anywhere. Founded in 2004 and profitable since 2008, Tagged is a market leader in social discovery with over 300 million registered members in 220 countries who make over 100 million new social connections every month.

Rackspace is the leading innovator in cloud computing and one of the founders and largest contributors to OpenStack. Rackers have the freedom to do great work and the challenge of using their skill and ambition to drive the future of the Web. Rackspace Cloud provides on-demand scalable website, application, and storage hosting. We enable developers and business decision-makers to avoid the hassles and costs of dedicated hardware while offering the infinite scalability of the Cloud. Rackspace has an inspiring culture which provides room for growth and innovation. We have multiple public and internal facing tech events each week. Rackspace is ranked on Forbes Fastest Growing Technology Companies list and is on Fortune and Glassdoor's Top 100 Companies to work for. Learn more about us: http://developer.rackspace.com/
http://rackertalent.com/sanfrancisco/
http://www.rackspace.com/blog/

Zillow, Inc. operates the largest home-related marketplaces on mobile and the Web, with a complementary portfolio of brands and products that help people find vital information about homes, and connect with the best local professionals. In addition, Zillow operates an industry-leading economics and analytics bureau led by Zillow's Chief Economist Dr. Stan Humphries. Dr. Humphries and his team of economists and data analysts produce extensive housing data and research covering more than 350 markets at Zillow Real Estate Research.

Want to work on the science behind the largest scale music service in the US?

Pandora's science team is building the next innovations in machine learning algorithms that help hundreds of millions of listeners discover music they love. We're looking for 3-4 interns (preferably 2nd or 3rd year Ph.D. Students) whose research interests are aligned with ML and are passionate about music. Potential topics include personalization, scalable algorithms, real time and effective recommendation measurement, user modeling, etc. Please apply here: http://www.pandora.com/careers/position?id=o06RXfwE, and indicate in your cover letter that you heard about the position from the GraphLab blog.

StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. More than 30 million people turn to StumbleUpon to be informed, entertained and surprised by content and information recommended just for them. In addition, more than 100,000 brands, publishers and other marketers have used StumbleUpon’s Paid Discovery platform to tell their stories and promote their products and services. StumbleUpon offices are located downtown San Francisco

Intel Labs is addressing the challenges of Big Data Analytics with disruptive technologies and is focused on delivering new software and hardware technologies that will transform datacenter operations and improve user experience. It is driving innovations to better handle data both big and small, from small-scale sensor networks that enable data gathering to large-scale cloud resources for deep analysis. And, it is developing knowledge discovery platforms to support emergent applications, from personalized recommendation systems to targeted drug therapies. We are interested in talking to MS and PhD candidates with excellent programming skills and coursework in machine learning, data mining, or parallel computing systems.

Comcast Labs Washington DC is an innovative research group within Comcast's Metadata Processing and Search Services unit that does groundbreaking research to develop the video and TV search & discovery technologies that support Comcast’s approximately 25 Million subscribers. Our projects focus on Machine Learning for Recommendations, Search and Click-through Prediction, NLP for Voice-based Interfaces, Video Annotation/Segmentation of premium video, and other Big Data problems. Comcast is the largest provider of TV and Broadband Services in North America, the largest provider of TV search and discovery applications in North America, & the 6th largest provider of search on the web.

The Social Computing Group at Adobe is working with data from Behance, a social network for creatives that joined Adobe in January 2013, developing an infrastructure that leverages this social graph to extract value for applications like talent search, recommendations and graph search. It is using Machine Learning and Graph Mining techniques to extract aggregate patterns and help users find work, interesting content and people. Behance is at the core of Adobe push to make the creative community more interconnected, organized and leverage each other's work to be more productive and successful."

LivingSocial is the local marketplace to buy and share the best things to do in your city. With unique and diverse offerings each day, we inspire members to discover everything from weekend excursions to one-of-a-kind events and experiences to exclusive gourmet dinners to family aquarium outings and more. We are interested in interns with in-depth knowledge of machine learning and/or recommendation systems as well as excellent programming skills who would like to help us solve unique marketing problems through leveraging big data infrastructures and tools.

Tapad Inc. is the undisputed leader in cross-platform marketing technology. The company’s groundbreaking technologies address the new and ever-evolving reality of media consumption on smartphones, tablets and home computers. Through Tapad, advertisers are now able to get a unified view of consumers across all screens. Tapad has built the most robust, real-time cross-platform audience buying technology available. Tapad is backed by major venture firms and “a hell of a list of entrepreneurs who created some of the most valuable online advertising companies of the last decade” (TechCrunch). Tapad is based in New York and has offices in Atlanta, Chicago, Dallas, Detroit, Los Angeles, Miami and San Francisco.

CBS Interactive, a division of CBS Corporation, is the premier online content network for information and entertainment. With more than 250 million people visiting its properties each month, it is a top 10 Web property globally and a top 5 Web property in the U.S. in terms of unique video viewers. Its portfolio of leading brands, which include CNET, CBS.com, CBSNews.com, CBSSports.com, GameSpot, TVGuide.com, TV.com and Last.fm, span popular categories like technology, entertainment, sports, news and gaming.

Rdio is the groundbreaking digital music service that is reinventing the way people discover, listen to, and share music. With on-demand access to over 20 million songs, Rdio connects people with music and makes it easy to search for and instantly play any song, album, artist, or playlist. Launched in August 2010, Rdio is headquartered in San Francisco and was founded by Janus Friis, the co-creator of Skype. Available in countries all over the world, Rdio is funded by Janus Friis through his investment entities, Atomico, and Skype.
Stay tuned - additional companies will be added soon.

ExxonMobil’s Corporate Strategic Research is currently offering graduate-level internships for summer of 2014 in the area of large scale machine learning within our Data Analytics and Optimization Section. During the three month internship, working closely with researchers in our team, the candidate will be expected to refine the problem definition, conduct fundamental research on algorithms and theory, demonstrate results on a prototype application and
prepare material for publication. The mission of Corporate Strategic Research is unique within the ExxonMobil Corporation. We are tasked with creating science-based opportunity and competitive advantage for ExxonMobil by conducting high-risk, high-reward research aligned with the Corporation’s business objectives. The laboratory is located 50 miles from New York City in scenic western New Jersey. The
successful candidate will join a dynamic group of scientists performing breakthrough research for all sectors of the corporation, developing new approaches to solve our most challenging problems. ExxonMobil offers an excellent working environment and competitive compensation. Start and end dates are flexible, typically encompassing mid-May to late-August. Subsidized housing
is offered.

The Bosch Research and Technology Center, with labs in Palo Alto, CA, Pittsburgh, PA, and Cambridge, MA focuses on innovative research and development for the next generation of Bosch products. The data mining group creates new data mining and large-scale machine learning algorithms for high-performance, distributed and parallel computing environments. Our problems deal with sparse, high-dimensional heterogeneous data that have temporal correlations, missing values and asynchronous streams. Topics that we work on include time-series analysis, latent variable models, sparse and missing data problems, association rule mining, to name a few. Our models and methods are implemented in a distributed, parallelized architecture and run on our HPC cluster in order to scale up to Big data sets. Internships are expected to be at least 10-12 weeks long during the summer months. Previous internships in our group have led to successful publications and/or patents. More information can bout found at http://www.bosch.us/content/language1/html/9799.htm

First Seattle GraphLab Users Meetup - Thursday November 21 in Seattle

Will take place on Thursday November 21 at the GraphLab office (in UW campus) in Seattle.
We will formally release our GraphLab Notebook (beta). Everyone is invited!

Place is limited. Please RSVP here.

Saturday, September 14, 2013

Predicting personal information from Facebook likes data

I got from Carlos del Cacho, a GraphLab user and a data scientist from Traity, a link to the following interesting paper:
Private traits and attributes are predictable from digital records of human behavior, by Michal Kosinskia, David Stillwella and Thore Graepelb, in PNAS 2013.

While this result is not surprising, and the method used is rather basic, it is still a nice demonstration of a fact we are all aware of, that machine learning is a powerful tool in predicting various user properties.

The authors use the following simple construction: create a user vs. likes matrix, decompose it using SVD with 100 dimensions, and then perform linear regression for each field of interest to find weights for each singular vector. Once a new user is observed, those weights are produced to compute prediction. (Logistic regression was used for binary categorical variables).

And here are some of the results. Each number signifies the success in prediction.

Another related paper which comes into mind is the paper by my friend Udi Weinsberg from Technicolor Labs:

"BlurMe: Inferring and Obfuscating User Gender Based on Ratings," Udi Weinsberg, Smriti Bhagat, Stratis Ioannidis and Nina Taft. ACM Conference on Recommender Systems (RecSys), 2012.

Wednesday, September 11, 2013

ParLearning 2014 workshop

My colleague Yinglong Xia from IBM Watson invited me to participate in the program committee of ParLearning 2014, a workshop dedicated to parallel machine learning, in conjucation with IPDPS 2014.

The workshop will be held on May 23, 2014 in Pheonix AZ. Workshop highlights are:
* Foster collaboration between HPC community and AI community
* Applying HPC techniques for learning problems
* Identifying HPC challenges from learning and inference
* Explore a critical emerging area with strong academia and industry interest
* Great opportunity for researchers worldwide for collaborating with Academia and Industry

Submission date is December 30, 2013.

Tuesday, September 10, 2013

Winning solution for Yelp! Business Prediction Contest

Just heard from my mega collaborator Justin Yan, now working in Alibaba, that his lab mates won the 1st place in the Yelp! business prediction contest. The team is composed of students from the Chinese Academy of Science, headed by Yuyu Zhang.

Brief solution method is explained here (see middle post from Brick Mover).

I will try to extract some additional tips from the winning team and report them here soon.

MLConf Machine Learning Workshop- Nov 2013 in SF

Our event organizers Courtney & Shon Burton are organizing a machine learning workshop mid November (Nov. 15) in SF.

Preliminary agenda is here. Preliminary speakers are Xavier Amatrain (Netflix), Ted Willke (Intel Labs), Jake Mannix (Twitter), Joey Gonzalez (GraphLab), Eric Bieschke (Pandora), Ameet Talwalkar (Berkeley) and others.

All of my blog readers are welcome to use this discount code.

Wednesday, September 4, 2013

Big Learning 2013 Workshop

My collaborator Joey Gonzalez is organizing this year (again) the NIPS Big Learning 2013 workshop.
The workshop will take place early December 2013 in Lake Tahoe. Submission deadline is October 9th 2013.

Some of this year's topics:

Distributed algorithms for online and batch learning
Parallel (multicore) algorithms for online and batch learning
Theoretical analysis of distributed and parallel learning algorithms
Implementation studies of large-scale distributed inference and learning algorithms --- challenges faced and lessons learnt
Database systems for Big Learning --- models and algorithms implemented, properties (availability, consistency, scalability, etc.), strengths and limitations

Tuesday, September 3, 2013

MLBase + Spark talk

I got his video talk from Senthil Gandhi, a senior data scientist at Firat Retail Labs. It is a video of MLBase project on top of Spark

The talk is by Ameet Talwalker and Evan Sparks from the Berekely AMP Lab. Video was released a couple of weeks ago, of a talk given at Twitter on Aug 6, 2013. Very recommended!

On a related note, I just got from my collaborator Joey Gonzalez, our man at the AMP lab, a link to the recent AMP camp. The talks videos will be released soon.

Monday, September 2, 2013

A guide for python frameworks for Hadoop

I got this interesting link from J D Chen, my colleague at Renren (Chinese Facebook).
It is a talk by Uri Laserson from Cloudera who is comparing different python high level programming on top of Hadoop. Lecture slides are available here. And here is a detailed blog post.

As we demoed in GraphLab workshop, we are also working on a python interface to GraphLab. Stay tuned!

Large Scale Machine Learning and Other Animals