Recently I was giving @trochee a tour of a project he’s been hired to work on. It’s a Hadoop pipeline that does very sophisticated things with gargantuan amounts of data. @trochee hasn’t used Hadoop before, but he is an experienced enough programmer to know how to make an asset out of ignorance, so at one point he stopped me and said, “That step looks interesting. How can I download a couple of the data files it is using to my laptop so that I can write Python scripts to analyze them?” I kind of looked at him and started sputtering about sequence files and serialization formats and, well, there’s this one tools that dumps them to the screen which is kinda messy but if you pipe it into grep…Turns out you can do it, but it isn’t easy, and it should be, and there’s no excuse for this.
There are any number of lessons to be drawn here. One is certainly: developers remember, no matter how bruising it may be to your ego in the moment, Rude Q&A Is Your Friend and should always happen earlier and more often. A more specific lesson is this: MapReduce is a scalability tool, not a programming language. Pace Jimmy Lin, Chris Dyer, and Graeme Hirst, who claim in their book that MapReduce is a paradigm on par with, say, functional programming or object orientation, it does only one thing: it scales. It does so in a way that leverages concepts from functional programming to make its strategy clear, and a well-crafted framework will mostly stay out of your way, but ultimately it’s just a technique for taking some operation you might perform thousands of times and letting you perform it billions of times instead.
Billions is good. These days doing something a billion times is often where the action is. But in the course of getting to billions you must not lose your ability to do thousands. Or hundreds. Or two. Scalability may be a necessary means for attaining a business or scientific goal, but it is never the goal itself. If your bakery can turn out a hundred cherry pies a day that’s great, but if I ask you for a single slice of cherry pie don’t just give me a dumb look.
In exchange for Hadoop’s scalability you pay a complexity price. As as with many areas of programming once you get good at paying the price (in Hadoop’s case: banging out job setup boilerplate, combiners, raw Writable comparators, etc.) you forget that it is a price, something that in a perfect world wouldn’t exist. Stockholm Syndrome is an occupational hazard. The following are reminders I use to help me see the bars of the MapReduce cage.
- Factor your algorithm out of your Hadoop code…Write your task’s logic in Java classes that can be run on the command line or in a unit test outside of Hadoop. Strive to make your actual map and reduce jobs generic wrappers for these classes. Also: call your mother, get a full night’s sleep, and make sure to drink eight ounces of water a day. Of course anyone who’s not a total greenhorn is going to factor like this. The problem is that it’s easier said than done. If your mapper function consists of two lines of code, this factorization can turn into distracting boilerplate. Should your core algorithm’s data types be Writable? Probably not, but then you need to derive Writable subclasses, which can get cluttered if you have to do it too often.The truth is, your algorithm may not naturally break up into map and reduce steps. In a draft of a paper I’m working on I present two pieces of pseudocode for the same algorithm, one a recursive function that gets the idea across and the other the MapReduce implementation I actually run. Shoehorning may be the price of scale, but you at least must be aware of the tradeoff when you’re making it.
- You still have to test the whole damn thing…The interface between Hadoop and non-Hadoop code itself becomes something that requires testing. Don’t let it become a fissure into which bugs can fall and multiply. At some point you need to run the thing end-to-end. The full system driver in MRUnit is currently the best way I know to do this at the unit test level. And for projects of any significant size you are using a continuous build and integration system right, right?
- Don’t let data fall down the rabbit hole…Files on HDFS are generally not human readable. They’re compressed. Then they’re usually sequence files, which is a machine-readable format. If those files contain complex objects they might be further serialized in a format like Avro or Thrift. If you’re not careful these multiple layers of encoding can become a barrier between you and your information, so that you have to fire up the whole Hadoop machinery just to read a file. Be mindful of this pitfall, and at every point make sure it’s easy to get your hands on your data in both human- and script-readable form.
- Everything should run without a cluster…Your code should run on a laptop against a small quantity of data in a single thread as if distributed computing had never been imagined. As with the first of these exhortations, this is easier said than done. It takes architectural skill to keep the redundant code to a minimum while recognizing that some extra work is necessary. The Mahout machine learning package is a good example to emulate in this regard. Mahout maintains a separation between data mining algorithms, which can be run at the command line, and Hadoop “drivers” that execute the same code at scale.
MapReduce is lovely, but it’s a tool and you’re the human, so never forget who is boss.