I got the following interesting links from Shaul Dar, Director of Risk Data Science at Paypal. A brief overview of Cascading and Scalding, two higher level abstractions for data processing on top of Hadoop's map reduce. Cascading and Scalding are an interesting alternative to Pig.
1. Cascading - http://www.cascading.org/
In particular take the time to look at the “Cascading for the Impatient series”, start with Part 1. This will really give you the feel for it.
2. (Cascading vs Pig) http://stackoverflow.com/
questions/3681494/does-anyone- find-cascading-for-hadoop-map- reduce-useful
E.g.: “The biggest single advantage I see in Cascading is that it allows you to think about your data processing workflow in terms of operations on fields, and to (mostly) avoid worrying about how to transpose this view of the world onto the key/value model that's intrinsically part of any map-reduce implementation.
The biggest challenge with Cascading is that it is a different way of thinking about data processing workflows, and there's a corresponding conceptual "hump" you need to get over before it all starts making sense.”
3. Scalding - https://github.com/twitter/
4. (Scalding examples) http://blog.echen.me/2012/02/
09/movie-recommendations-and- more-via-mapreduce-and- scalding/
“Scalding is an in-house MapReduce framework that Twitter recently open-sourced. Like Pig, it provides an abstraction on top of MapReduce that makes it easy to write big data jobs in a syntax that’s simple and concise. Unlike Pig, Scalding is written in pure Scala – which means all the power of Scala and the JVM is already built-in. No more UDFs, folks!”