If I have one overarching lesson from my year or so of working heavily with MapReduce it is this: the memory requirements of your map and reduce functions must all be asymptotically constant. I say map and reduce, but I mostly mean reduce, because mappers tend to be fast and stateless by design. (But don’t forget to periodically flush the cache of any in-mapper combiners.) I say just memory instead of memory and runtime because the framework is already doing some non-constant (presumably logarithmic) sort for your reducers, and anyway memory overflows are what have consistently bitten me.
This issue doesn’t come up in basic Hadoop documentation, because word count is already constant memory, as are similar tasks that use the reducer to distill a list down into some summarizing statistic. If you move outside that use domain (for instance, if like me you use Hadoop to manipulate large sets of objects) it becomes an issue. Even getting the memory requirement down to O(N) may not be enough, since the whole point of distributed computing is to make N really, really big.