I've used and created numerous ETL stacks. So learn from my mistakes.<p>First, move the code, not the data. Batch processing is for mainframes. I know, I know, this has been impossible to realize any where I've ever worked.<p>Second, less is more. If command line tools work, use 'em.<p>Avoid IDLs, maps, schemas, visual programming, workflow engines, event sourcing, blah blah blah. It's all useless abstractions, measured by number of indirections and stack trace depth. It's all wicked hard to debug. It's all abandoned unmaintained obfuscation layers.<p>Data processing (ETL) is just cutting and pasting strings. Input, processing, output. Sometimes with sanity checks. Sometimes with transformations, like munging date fields or mapping terms ("yes" to "true"). Very rarely with accumulators (aggregators) where you need some local persistent state.<p>Third, and this is pretty rare, use better APIs for data extraction. It's all just scrapping. Don't over think it. I wish I could show the world the APIs I created for HL7 (healthcare) data. For 2.x, I created "fluent" (method chaining) data wrappers (like a DOM) which could not blowup (used Null Objects to prevent null pointer exceptions). For 3.x, I used path query thingie to drill down into those stupid XML files. This was CODE, not mappings, so it was practically a REPL, meaning fast to code, fast to debug.<p>Fourth, you control the execution. Be more like Postfix/Qmail, where each task has it's own executable. Be less like J2EE, BizTalk, where you ask the runtime to control the lifecycle of your code.<p>Good luck.