My approach to this is to take the complicated bits of the mapreduce and put them in a separate class. Then I do a combination of two things as appropriate:<p>1. Hook it up to a debug server that fetches from the same datastore as the mapreduce, then test it on some keys that I'm interested in.<p>2. Test it like any other class.<p>The only awkward part of this is abstracting out the output calls, which I usually do by passing in a "handle some data" callback that outputs in the mapreduce and dumps some pretty html in the debug server.<p>The great part about this is that if the mapreduce ends up being something important, you already have the tools to introspect its internals on data you are interested in.
<i>The Cascalog abstraction layer fixes this issue by separating logic from data, allowing you to play creatively at massive scale.</i><p>I just checked out Casacalog and I like what I see, although I have yet to try it out myself. Does anyone know of something similar that would work with Scala as well?
Nice.<p>Check out another similar clojure library called "MR-Kluj" that you can use to write Hadoop MapReduce jobs in Clojure: <a href="https://github.com/cheddar/mr-kluj" rel="nofollow">https://github.com/cheddar/mr-kluj</a>