Serializing Complex MapReduce Keys

Hadoop supports using arbitrary Java objects as MapReduce keys, and a few hours of futzing around are all you need to come up to speed on this useful feature, so you should resist the temptation to shoehorn your more complex keys into Text objects, as this is an act of False Laziness. Still, there are subtleties. Here’s a bug that tripped me up.

Say you have a complex Writable key comprised of a collection of simple atomic writable keys. Define the latter to consist of objects for which Hadoop has built-in serialization support–perhaps pairs of Text objects as described in the serialization example in chapter 4 of Hadoop: The Definitive Guide–and the former to be structures built from these atoms that must also define some serialization logic of their own. An example of a basic complex writable is a list of atoms that writes the length of the list as an integer field at the head of its serialization.

public class ComplexWritable implements Writable {
    List<AtomicWritable> atoms = new ArrayList<AtomicWritable>();

    public void write(DataOutput out) throws IOException {
        IntWritable size = new IntWritable(atoms.size());
        size.write(out);
        for (AtomicWritable atom : atoms)
            atom.write(out);
    }

    public void readFields(DataInput in) throws IOException {
        IntWritable size = new IntWritable();
        size.readFields(in);
        int n = size.get();
        while(n-- > 0) {
            AtomicWritable atom = new AtomicWritable();
            atom.readFields(in);
            atoms.add(atom);
        }
    }
}

Say you have data whose complex key consists of lists of Text pairs, like so:

Key Value
(Shape, Triangle), (Color, Blue) 1
(Shape, Triangle), (Color, Blue) 2
(Shape, Circle), (Color, Red) 3

What you want is for the reducer to be called with the following arguments:

Key Values
(Shape, Triangle), (Color, Blue) 1, 2
(Shape, Circle), (Color, Red) 3

Instead the code above produces:

Key Values
(Shape, Triangle), (Color, Blue) 1
(Shape, Triangle), (Color, Blue), (Shape, Triangle), (Color, Blue) 2
(Shape, Triangle), (Color, Blue), (Shape, Triangle), (Color, Blue), (Shape, Circle), (Color, Red) 3

The keys keep getting stacked one after the other. The problem is that the reducer code allocates a single key object which it reuses, so any state retained in your custom key piles up. The fix is easy enough: just clear out the list at the start of each deserialization.

public class ComplexWritable implements Writable {
    public void readFields(DataInput in) throws IOException {
        atoms.clear();
        IntWritable size = new IntWritable();
        size.readFields(in);
        int n = size.get();
        while(n-- > 0) {
            AtomicWritable atom = new AtomicWritable();
            atom.readFields(in);
            atoms.add(atom);
        }
    }
}

This is obvious in hindsight, but I found it tricky to debug. In particular, the only example code I could find is the Text pair example from the book. There both Text objects are overwritten by readFields, so the clearing of object state between reducer calls happens implicitly. This example shows that for more complex keys, that clearing may have to be carried out explicitly.

Also, you will probably have to write custom raw comparators for even relatively basic complex keys. Hadoop: The Definitive Guide describes this as a performance enhancement, but the need is more pressing than that. When I tried to run a key like the one described above with the default comparator, I ran out of memory on even small data sets. Even if the performance overhead of creating objects every time you deserialize a key during the sort phase isn’t that great, the sheer volume of objects quickly overwhelms the garbage collector. A bit of byte-fiddling is required, but it’s nothing too daunting, and the example in the book is a sufficient guide here.

Advertisements
This entry was posted in Those that have just broken the flower vase. Bookmark the permalink.

5 Responses to Serializing Complex MapReduce Keys

  1. Pingback: Hadoop Object Reuse Pitfall: All my Reducer Values are the Same | Corner Cases

  2. Thanks a lot for the detailed description 🙂

  3. quyetict says:

    Thank you so much!
    I have found some topics to talk about how to solve Iterator problem in Hadoop but not yet and not clearly. According to this article, it’s really very simple!

  4. RG says:

    Hi,
    Thanks for the information above, its very helpful!
    I will really appreciate if you could also describe how to invoke write and readFields method.
    Basically I m failing to understand how to construct ComplexWritable object in this case. Once the object is written to DataOutput obj, how do we restore it in DataInput object.
    This may sound silly, but am a newbie to Hadoop and have been assigned a project that uses Hadoop.

    Thanks in advance
    RG

  5. dsarkozi says:

    Thank you very much ! Indeed, debugging this behaviour is clearly time-consuming as a new Hadoop user like myself, but you made it a lot shorter !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s