When to Use an Absurd Number of Reducers in Hadoop

You should endeavor to write your reducers so that they complete in a reasonable amount of time. What reasonable means depends on the size of your cluster and the patience of your co-workers, but a good rule of thumb is maybe twenty minutes. That’s long enough to amortize any spin-up overhead but fast enough not to starve out other people’s jobs.

That’s reducer time. As for reducer quantity, the general wisdom is to request something near the number of available slots. It’s fine if everyone else is doing the same thing, because fair apportioning of resources is the framework’s job. Within normal operating parameters, you should just launch ’em all and let Hadoop sort ’em out.

But what if your reducers run for longer than twenty minutes–like, say, a few hours each? Well, you might think about how to divide up the work more finely, since you might be entering a region of non-scalability. However, that may come at the cost of complicating your algorithm. There are times when it would be best to leave your solution as it is and just find some knob you can turn to make it play nice with the other jobs on the cluster.

One trick: set mapred.reduce.tasks to an absurdly large number, something much larger than the number of available reducer slots. Of course you don’t actually get that number of slots because no blood from a stone, but this will force Hadoop to make the input for each reducer task smaller. Altogether the reducers take the same amount of time, but each individual attempt is quicker, and the improved time granularity keeps your job from hogging the cluster. I resisted doing this for a long time because it seemed like cheating, but it works.

This entry was posted in Those that have just broken the flower vase. Bookmark the permalink.

2 Responses to When to Use an Absurd Number of Reducers in Hadoop

  1. Harsh says:

    I agree, when reducers take quite a lot of time, its best to increase them to a higher point of parallelism. Hash partitioning magics 🙂

    Also better to have it slow-start at 80% rather than the default 5% map completion stage. This makes for more ‘working’ reducers that do not wait much for map outputs nor make too many http connections.

    For some more general tips, you can look at a few points on this slide: http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera

  2. Pingback: Because Sex Tape | Corner Cases

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s