This past week I started working on a Python 3 port of this, mostly to learn. No EMR unfortunately, but Hadoop should be possible. I just got back from a trip, so it's still not very far along, just runs the "local" version, but it should get a bit farther next week.<p>I can confirm that it is a <i>great</i> way to learn about MapReduce.<p>Link: <a href="http://github.com/irskep/mrjob/tree/py3k" rel="nofollow">http://github.com/irskep/mrjob/tree/py3k</a><p>I will likely totally restart the py3k port now that I know what I am doing a bit better. I've been writing Python 3 for about, oh, two weeks.
Amazon EMR is an amazing value proposition for virtually any research need, and it's very cool to see wrapper frameworks targeting it directly. Still, for anyone managing their own compute clusters and wanting to do MR in python, I'd suggest checking out Disco.<p>Disco (<a href="http://discoproject.org" rel="nofollow">http://discoproject.org</a>) is a really elegant MR framework implemented in erlang and python, with additional support for jobs in C and Java. I've used it for a little over a year and am convinced it is the superior MR platform (Hadoop's terasort victories notwithstanding). New features are being integrated very quickly, the core platform is rock solid, management is simple and it's extremely flexible.
this was a game changer for us -- instead of everyone contending for the Hadoop cluster, each developer has their own personal arsenal of Hadoop clusters. huge win.
On this note, does anyone know a good tutorial on map reduce for experienced programmers? Basically, I want to learn how to frame advanced problems in terms of MR - I am particularly interested in expressing my discrete event simulation in terms of MR.