Hadoop Object Reuse Pitfall: All my Reducer Values are the Same

Here’s a Hadoop subtlety that has tripped me up a couple times. Say your reducer collects all its enumerated values in a buffer to process them as a group.

public class MyReducer extends Reducer {
    public void reduce(Key key, Iterable<Value> values, Context context) {
        List<Value> buffer = new ArrayList<Value>();

        for (Value value : values)
            buffer.add(value);
        // Do something with the buffer
    }
}

You know that the following data is coming into the reducer:

Key Value
0
1 Red
1 Blue
1 Green
2

After the loop completes, you expect the buffer to contain <Red, Blue, Green>. Instead it contains <Green, Green, Green>. What gives?

The problem is that the iterator in the Hadoop reducer uses a single object whose contents it changes each time it goes to the next value. If you step through in the debugger you’ll see that all three Green objects in your list are stored at the same location in memory, which is also the same as the context.value object. The fix is simple: make a copy of the values as they come in.

public class MyReducer extends Reducer {
    public void reduce(Key key, Iterable<Value> values, Context context) {
        List<Value> buffer = new ArrayList<Value>();

        for (Value value : values)
            buffer.add(new Value(value));
        // Do something with the buffer
    }
}

Like a similar Hadoop pitfall, this is obvious in hindsight, but perplexing when it happens. There’s a perfect storm of elements that give rise to confusion in this particular instance: the fact that everything in Java is a pointer and the fact that Writable objects are wrappers that can change their contents without being reallocated. Also, accumulating reducer values in memory is not paradigmatic MapReduce behavior, because you can run out of memory if there are a lot of them. Still sometimes it’s what you have to do, so if you see repeated reducer values, you know where to look.

Advertisements
This entry was posted in Those that have just broken the flower vase. Bookmark the permalink.

7 Responses to Hadoop Object Reuse Pitfall: All my Reducer Values are the Same

  1. Szymon says:

    Awesome tips. And yet despite this, and having read about your ‘similar Hadoop pitfall’, my reducer STILL makes no sense when iterating over its values. It’s getting (Text, Text) KV pairs, but the values are always the same as the key, as though Text were a singleton. I’ve even resorted to copying byte arrays to make sure it’s not just referencing Text’s data. Oh how I pine for the long forgotten days of explicit memory management, when we knew exactly what the hell was happening with our data. In any case, thanks for some insights that have previously escaped me.

  2. Nuno says:

    Really awesome tip. I had no idea why all my values were the same.

  3. stumpart says:

    Awesome!! I’m a newbie and this issue had mi puzzled. Your tip fixed my problem.

  4. Paul Gonchar says:

    I was beginning to pull my hair out when I found this post. Thanks a TON!!!

  5. Nishant says:

    Thanks for the link to the ‘similar pitfall’! That’s been very useful to me! 🙂

  6. Pingback: mapreduce, sort values - TecHub

  7. Pranay Goyal says:

    Does Hadoop reusues the objects even in the combiner class

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s