Hadoop Reducer Memory Should be O(1)

If I have one overarching lesson from my year or so of working heavily with MapReduce it is this: the memory requirements of your map and reduce functions must all be asymptotically constant. I say map and reduce, but I mostly mean reduce, because mappers tend to be fast and stateless by design. (But don’t forget to periodically flush the cache of any in-mapper combiners.) I say just memory instead of memory and runtime because the framework is already doing some non-constant (presumably logarithmic) sort for your reducers, and anyway memory overflows are what have consistently bitten me.

This issue doesn’t come up in basic Hadoop documentation, because word count is already constant memory, as are similar tasks that use the reducer to distill a list down into some summarizing statistic. If you move outside that use domain (for instance, if like me you use Hadoop to manipulate large sets of objects) it becomes an issue. Even getting the memory requirement down to O(N) may not be enough, since the whole point of distributed computing is to make N really, really big.

This entry was posted in Those that have just broken the flower vase. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s