You have a Big Data problem. How do you go about programming a solution?
The usual suspects include Hive and Pig. The beginning is seductively easy. But the difficulties start amassing as your project stops being a cute experiment and starts growing in complexity and size. You may end up with un-debuggable large queries that bomb in their 3rd hour of execution, a myriad of user defined functions (written in yet another language!), shell scripts that are faking iterations missing in your Big Data language. Not to mention the lack of support for Test-Driven Development or Continuous Integration. It feels like you are in some parallel universe.
Enter Scalding. Scalding is a Scala library for Big Data processing. Its API, which resembles conventional Scala collections, immediately looks intuitive. It was developed at Twitter and is used at a number of well-known, “hot” companies.
Scalding enables us to use Scala, a full-blown language, with all the regular TDD and CI stuff you would expect. We have seen even conventional data professionals falling in love with the great API and started learning Scala just so they could use Scalding!
The future looks good for Scalding as a common API for Hadoop, Spark and Flink.
Come to the talk (below) to see Scalding in action! You’ll get a good feel for Scalding features and style and you will know what to do next.