I second the Luigi recommendation. Republic Wireless uses it for all of our data warehouse ETL, and it's been fantastic to work with.<p>I also second the other comment that recommends starting with basic data extraction rather than diving into Hadoop or Spark immediately. Sure, at some point, you might need to process 100 billion lines of data. But in your average business, you're far more likely to be working with thousands or millions of records on customers, sales, orders, invoices, sales leads, etc. That stuff doesn't need Hadoop/Spark, it needs a Postgres database and a DBA with a good head on their shoulders keeping everything organized.<p>In my experience, government data sets (particularly demographics and other geographically-related data sets) are a fantastic way to get your feet wet with data processing. They're published by a bunch of different agencies, so they're not necessarily conveniently available in one place. However, they usually use standardized identifiers for geographies, which makes it easy to join the data sets together in new and interesting ways.<p>For instance, here at Republic, we recently used Form 477 data on wireless broadband availability from the FCC, data from Summary File 1 of the US Census, and a couple of Census geographic crosswalk files to be able to calculate the percentage of population in given zip codes and cities covered by various wireless carriers. That required reading the docs for several different data sources, automating some downloads, building database tables to hold all of the information, and then carefully crafting some SQL to pull it all together.<p>Of course, government data sets generally won't require a whole lot of automation (they're updated yearly or less than yearly, not daily). To build your skills on that front, I'd recommend learning to extract data from various APIs, structure it in a meaningful way, and make it available in a database. For example, if you have a website, set up a free Google Analytics account for it, then build a daily ETL that extracts some meaningful information from the Google Analytics API and stuffs it in a Postgres DB. Then see if you can build some charts or something that sit on top of that database and report on the information.