Ask HN: Simple, beginner friendly ETL / Data Engineering project ideas?

233 pointsby zabanaalmost 7 years ago

Hi HN,I'm a seasoned Python software developer. Recently I have found a new obsession with data processing, management, engineering etc ... I'd like to (eventually) branch off into that field but I find the lack of beginner friendly resources is slowing me down. All I can find is spark, hadoop related articles (I know these are prominent in the field, but I want to learn to walk before I run). So If any of you have pointers, websites, project ideas I can start to get a good grasp of all the fundamental concepts, I'd really appreciate it.Thanks a lot in advance

37 comments

3pt14159almost 7 years ago

Spark, etc, are great, but honestly if you're just getting started I would forget all about existing tooling that is geared towards people working at 300 person companies and I would read The Data Warehouse ETL Toolkit by Kimball:<a href="https://www.amazon.com/gp/product/1118530802/" rel="nofollow">https://www.amazon.com/gp/product/1118530802/</a>I learned from the second edition, but I've heard even better things about the third. As you're working through it, create a project with real data and from-scratch re-implement a data warehouse as you go. It doesn't really matter what you tackle, but I personally like ETLing either data gather from web crawling a single site[0] or push in a weekly gathered wikipedia dump. You'll learn many of the foundational reasons for all the tools the industry uses, which will make it very easy for you to get up to speed on them and to make the right choices about when to introduce them. I personally tend to favour tools that have an API or CLI so I can coordinate tasks without needing to click around, but many others like a giant GUI so they can see data flows graphically. Most good tools have at least some measure of both.[0] Use something like Scrapy for python (or Mechanize for ruby) with CSS selectors and use the extension Inspector Gadget to quickly generate CSS selectors.

评论 #17784306 未加载

评论 #17784227 未加载

评论 #17782443 未加载

评论 #17782838 未加载

评论 #17784414 未加载

drejalmost 7 years ago

1) Learn to do as much in plain Python as possible, focus on lazy evaluation (itertools, yielding, ...), you'll be able to process gigabytes with a tiny memory footprint, deployment will be a breeze etc.2) Get to know some of the basic Python data processing/science packages like pandas, numpy, scipy etc.3) Get used to writing short shell scripts - they probably won't be a part of your pipeline, but data engineering, especially development, involves a lot of data prep that coreutils will help you with. Throw in a bit of jq and you'll handle a lot of prep work.4) Only once you've gotten used to the above, look at Dask, PySpark, Airflow etc. It really depends on your use cases, but chances are you won't have to touch these technologies at all.Bottom line - wait with the heavyweight tools, they might be needlessly powerful. Also, work closely with DevOps, because the deployment side of things will help you understand consequences of your actions.

评论 #17782895 未加载

评论 #17783533 未加载

评论 #17784241 未加载

评论 #17782922 未加载

dbattenalmost 7 years ago

I second the Luigi recommendation. Republic Wireless uses it for all of our data warehouse ETL, and it's been fantastic to work with.I also second the other comment that recommends starting with basic data extraction rather than diving into Hadoop or Spark immediately. Sure, at some point, you might need to process 100 billion lines of data. But in your average business, you're far more likely to be working with thousands or millions of records on customers, sales, orders, invoices, sales leads, etc. That stuff doesn't need Hadoop/Spark, it needs a Postgres database and a DBA with a good head on their shoulders keeping everything organized.In my experience, government data sets (particularly demographics and other geographically-related data sets) are a fantastic way to get your feet wet with data processing. They're published by a bunch of different agencies, so they're not necessarily conveniently available in one place. However, they usually use standardized identifiers for geographies, which makes it easy to join the data sets together in new and interesting ways.For instance, here at Republic, we recently used Form 477 data on wireless broadband availability from the FCC, data from Summary File 1 of the US Census, and a couple of Census geographic crosswalk files to be able to calculate the percentage of population in given zip codes and cities covered by various wireless carriers. That required reading the docs for several different data sources, automating some downloads, building database tables to hold all of the information, and then carefully crafting some SQL to pull it all together.Of course, government data sets generally won't require a whole lot of automation (they're updated yearly or less than yearly, not daily). To build your skills on that front, I'd recommend learning to extract data from various APIs, structure it in a meaningful way, and make it available in a database. For example, if you have a website, set up a free Google Analytics account for it, then build a daily ETL that extracts some meaningful information from the Google Analytics API and stuffs it in a Postgres DB. Then see if you can build some charts or something that sit on top of that database and report on the information.

评论 #17782478 未加载

评论 #17782310 未加载

ebullientocelotalmost 7 years ago

Hi, data engineer here. Other comments have a lot of good suggestions. I especially agree with ideas like avoiding high powered frameworks in the beginning, and learning to write effective transform wrappers for different kinds of weird input data. One thing you'll find is that (for enterprise situations at least) your source data is going to come from extremely odd, old fashioned, or very poorly documented sources. Be prepared to find twenty and thirty year old manuals on formats you've never heard of at times.That aside, the other side of the coin that is very important is to get very familiar with how folks talk about their data problems. Managers, analysts, etc. will often request a specific solution that they've heard of or seems popular--sometimes it is what they need, often times it isn't. To get a solid footing in this space (not that it isn't for all types of SE) it is critical to have very strong understanding of business requirements, understanding the work of many other roles in your org, and being able to communicate with business folks in ways that allow you to develop a rational plan for a solution while getting them to realize what their needs really are.Best of luck!

评论 #17784484 未加载

veritas3241almost 7 years ago

Another commenter mentioned it as well, but "Designing Data-Intensive Applications" by Martin Kleppmann <a href="https://dataintensive.net/" rel="nofollow">https://dataintensive.net/</a> is a _fantastic_ overview of the field and, I think, more approachable and enjoyable to read than Kimball's book. But Kimball is a classic, especially for how to do warehouse design.I'll also make a plug for the Meltano[0] project that my colleagues are working on. The idea is to have a simple tool for extraction, loading, transformation, and analysis from common business operations sources (Salesforce, Zendesk, Netsuite, etc.). It's all open source and we're tackling many of the problems you're interested in. Definitely poke around the codebase and feel free to ping me or make an issue / ask questions.[0] <a href="https://gitlab.com/meltano/meltano/" rel="nofollow">https://gitlab.com/meltano/meltano/</a>

RickJWagneralmost 7 years ago

I worked in that field a number of years. My recommendation to you is to start with some form of data that you are passionate about. Baseball statistics, business metrics, investment figures, whatever.Once you have the data, then figure what you're going to do with it. (Don't agonize over it, this should all take just a day or so.)Then go after the toolkit. You'll find many interesting questions if you start with the end goal in mind.Good luck, and have fun!

评论 #17782813 未加载

评论 #17783203 未加载

评论 #17782266 未加载

jakesteinalmost 7 years ago

<a href="http://Singer.io" rel="nofollow">http://Singer.io</a> is an open source ETL project written in Python. The components are small, composable programs that you can run independently, so you should be able to walk before your run.A good beginner project is to build an adapter to a new data source (known as a "tap" in Singer). Most taps pull data out of business tools like Salesforce or Marketo, but people also build them to pull characters from the Marvel API (<a href="https://www.stitchdata.com/blog/tapping-marvel-api/" rel="nofollow">https://www.stitchdata.com/blog/tapping-marvel-api/</a>)Check out the getting started guide (<a href="https://github.com/singer-io/getting-started" rel="nofollow">https://github.com/singer-io/getting-started</a>), or jump into the Singer Slack and ask for help (linked from the guide)

评论 #17785779 未加载

评论 #17783208 未加载

评论 #17782964 未加载

评论 #17782929 未加载

danthelionalmost 7 years ago

I would recommend looking into Python-based workflow managers: Luigi[0] then Airflow[1], to get a hang of scheduling, DAGs, etc. Both are fairly simple to get started with (especially for a seasoned Python developer) and are used in production environments as well.[0] <a href="https://github.com/spotify/luigi" rel="nofollow">https://github.com/spotify/luigi</a>[1] <a href="https://github.com/apache/incubator-airflow" rel="nofollow">https://github.com/apache/incubator-airflow</a>

评论 #17785006 未加载

diogofrancoalmost 7 years ago

I wrote a simple blog post recently with links to data engineering resources I found useful to get into the field, hopefully it is helpful: <a href="https://diogoalexandrefranco.github.io/data-engineering-resources/" rel="nofollow">https://diogoalexandrefranco.github.io/data-engineering-reso...</a>

评论 #17782333 未加载

djeebusalmost 7 years ago

I've always found enjoyment in finding statistics in things that interested me, so here's a few thoughts on project ideas. The main thing is to find a data set that interests you, and use that interest for fun or for profit:- pull down your facebook backup, run it through sentiment analysis, throw it in a db, explore yourself through facebook's eyes! - a while back I was looking to buy a car and used scrapy to process cars.com, ran their vins through a look up tool to find cars that were actually manuals (and not just manumatics, seems few can tell the difference these days). Found reasonable national prices, average miles, etc. - interested in politics? pull down the data sets for voting records, explore your local politicians' voting records, compare them to national averages or historical information, etc. - interested in movies? find movie datasets (or scrape imdb.com, themoviedb.org, etc) and find which genres pay the least per actor, have the smallest cast, etc.Lots of datasets are available online, if not in machine readable format, then in a format that can be easily scraped. Have fun!

collinfalmost 7 years ago

If you want to make it as a career choice, I think you should start with learning Java and Scala. For better or worse, this field is tied to the JVM and learning these languages will make picking up Spark and Hadoop (which tbh is a prereq for any Data Engineering position to have on their resume) a lot easier.Also if you are looking to stay in the Python world, PySpark is pretty intuitive for any Python developer and tons of companies are using it.

pacunaalmost 7 years ago

Build a web scrapper and save some raw data in somewhere like S3. Then run some job on top of that data to get aggregated measures and save them somewhere. I built a project[1] like this and learned a lot in the process. I used airflow to run the scrapping tasks, save the data in S3, use AWS Athena to run queries and load data into Redshift. I did all of this just to learn more about Airflow and some AWS tools.[1] <a href="https://skills.technology/data-engineer" rel="nofollow">https://skills.technology/data-engineer</a>

mmaiaalmost 7 years ago

I find ETL best practices with Airflow a good start. Even if you don't go the Airflow route, you can benefit from the example implementations of Kimball and Data Vault ETLs.<a href="https://gtoonstra.github.io/etl-with-airflow/index.html" rel="nofollow">https://gtoonstra.github.io/etl-with-airflow/index.html</a><a href="https://github.com/gtoonstra/etl-with-airflow" rel="nofollow">https://github.com/gtoonstra/etl-with-airflow</a>

lixtraalmost 7 years ago

Software wise you could also have a look at dask. Which is more lightweight than Hadoop and spark.But since you are asking for a project, why not do something local. I.e. you could scrape some data for some time (cinemas, crime) and structure it nicely. After a few months you can start analysis. Bonus points if you make you data available.

kfkalmost 7 years ago

I have put together a (very short) post to build a dashboard using shopify data and pulling ETL with Stich. You could try to implement the stich part in Python and would have a complete solution. Like looking at Stich/Blendo would give you some ideas of simple ETL workflows. Keep in mind that ETL changes depending on what you want to do. In theory you can do all ETL just with Python code. If you have an SQL compliant database that can hold your data ETL processes could simply be a matter of running SQL queries to transform data within the database. Then you basically load your data from any data source into your db and you run a bunch of SQL to do the transform part.<a href="https://assemblinganalytics.com/post/airbnbapache-superset-and-shopify/" rel="nofollow">https://assemblinganalytics.com/post/airbnbapache-superset-a...</a>

indogooneralmost 7 years ago

Like other I would say start with a problem which interests you.To mimic the real world try to have extraction from variety of data sources (rdbms, NoSQL, files (different formats csv/json), APIs (salesforce)). Once you have different sources, extract the data to your datalake built on S3/GCS/HDFS. Once the data is present you need to integrate with tools which can extract value. You can use Vendor specific tools like Athena/BigQuery or open source like Presto/Impala/Hive. You can do analytics where you require filtering, cleansing, joining various datasets. You can also look at storing the results in different formats so that other tools like Tableau can use them. To orchestrate all of this you can use Azkaban or Airflow.My suggestion is slightly biased towards Hadoop ecosystem but the good thing is that most tools here have Open source alternatives.

cl0vnalmost 7 years ago

I think you can easily learn to walk also with Spark. There are a lot of beginners Spark tutorial online. You can see, for instance, <a href="https://community.cloud.databricks.com" rel="nofollow">https://community.cloud.databricks.com</a>. They give you a Spark cluster where you can start and there are several tutorials in a notebook-style. Check it out. Of course, you could start your ETL / Data Engineering in a more "traditional" way trying to learn about relational databases and the likes. But I would suggest you to start directly with Spark. You can use it with lots of other big data tools (such as Hadoop/Hive and also S3) and you could also find some interesting Machine Learning use case.

评论 #17782276 未加载

d__kalmost 7 years ago

This list has numerous pointers to resources related to data engineering: <a href="https://github.com/igorbarinov/awesome-data-engineering" rel="nofollow">https://github.com/igorbarinov/awesome-data-engineering</a>

Zaheeralmost 7 years ago

It may not be exactly what you're looking for but perhaps to get started without having to set up all the infrastructure you could use something like AWS Glue. There's some tutorials / examples in the official AWS docs: <a href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html" rel="nofollow">https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programm...</a>Disclosure: I work on AWS Glue. Note: We're Hiring! Email me at: zaheeram <at> amazon dot com if you're interested in this space!

account2almost 7 years ago

Do Data Engineering jobs require you to have knowledge of statistics, machine learning, etc? Or can you get data engineering jobs where you are focused on connecting pipelines and data flow? I am interested in a data engineering job, but it appears a lot of data engineering jobs require machine learning expertise in job listings. I am interested in just a data engineering focused role.

评论 #17785125 未加载

hhs19832almost 7 years ago

Many posts here are focused on classic ETL. I'm working on a small project for handling data just after ETL.It’s for dealing with annoyingly large data, bigger than RAM but sitting on a personal PC. It basically performs sampling and munging for this data. There’s no good solution for this right now (I know because I've been looking for more than a year).What might be interesting to you is that there's little abstraction in the project, but it's non-trivial to execute. To me, this makes it fun. Despite its simplicity, it has high utility and could be used by others. This would be a great outcome for an initial project.I've got a working version of it, but it would benefit from the eye of a seasoned python dev.Maybe it would be interesting to you to get in touch? My email is: mccain.alex@yandex.comCheers,

debarshrialmost 7 years ago

I think basic concepts of datawarehousing like creating data marts, building star or snowflake schemas, dimensional modelling, slowly changing dimensions are quite important before you jump into why hadoop, hive, hbase or spark is relevant.

idiotclockalmost 7 years ago

Spark is not too tricky to dive into, even though you can't really take advantage unless you have a big cluster to use :)if you want to practice data-manipulation, and a lot of the map reduce type stuff you can do with spark, I find Pandas useful for small datasets (And a lot of overlap in functionality as far as Dataframes are concerned)For pipeline stuff, definitely take a look at Luigi, but again without a cluster it'll be less fun. Still, if you can try automating tasks with a mini luigi scheduler on your localhost, it would be good practice

meterplechalmost 7 years ago

Google Cloud Composer, built on top of Airflow (mentioned by a number of people here) has a great Getting Started Guide to check out: <a href="https://cloud.google.com/composer/docs/quickstart" rel="nofollow">https://cloud.google.com/composer/docs/quickstart</a>You can dive right in to Airflow here: <a href="https://airflow.apache.org/tutorial.html" rel="nofollow">https://airflow.apache.org/tutorial.html</a>

textmodealmost 7 years ago

What are the industries that have the highest costs for ETL/Data Engineering?What companies in these industries are interested in reducing their costs for this work?"costs" as used above includes time expenditures as well as spending money<a href="http://web.archive.org/web/19991023120316/http://www.dbmsmag.com:80/9509d05.html" rel="nofollow">http://web.archive.org/web/19991023120316/http://www.dbmsmag...</a>

eleijonmarckalmost 7 years ago

Hey Zabana!I noticed you were in the Europe area. We are looking for developers with interest in going into the field.We produce a data pipeling/analytics platform for complex data to our customers.If you would like to know more about our company! Position is in Stockholm. <a href="https://gist.github.com/eleijonmarck/1b384480aaa3d22ab7e6ea036d297d10" rel="nofollow">https://gist.github.com/eleijonmarck/1b384480aaa3d22ab7e6ea0...</a>

chrisweeklyalmost 7 years ago

You might want to take a look at [lnav](<a href="https://lnav.org" rel="nofollow">https://lnav.org</a>), a sort of mini-ETL CLI powertool. Performance is fine up to a few million rows.Edit: of course this comment (like many others below) pertains to tooling vs a particular project per se. FWIW I agree w/ others' sentiment about doing things "by hand" and working with data that holds your interest.

jakecodesalmost 7 years ago

While still alpha right now, Meltano is a project from GitLab. We are looking for contributors who have a passion for bringing the excellent ideas of software development to the data science world. <a href="https://gitlab.com/meltano/meltano/" rel="nofollow">https://gitlab.com/meltano/meltano/</a>. Feel free to post some issues and give it a try.

twocatsalmost 7 years ago

an excellent survey of the field: <a href="https://dataintensive.net/" rel="nofollow">https://dataintensive.net/</a>

评论 #17785349 未加载

stadeschuldtalmost 7 years ago

You could check out <a href="https://github.com/mara/mara-example-project" rel="nofollow">https://github.com/mara/mara-example-project</a>The project is just a demonstrator for the Mara framework but it gives a good overview and you could take it as a starter for something you want to build.

currymjalmost 7 years ago

if you have any interest in baseball, <a href="https://www.retrosheet.org/" rel="nofollow">https://www.retrosheet.org/</a> might be of interest. They have play-by-play accounts for every Major League baseball game going back to 1914.this strikes me as a good testbed for personal projects, because:- it's enough data to be inconvenient, but can still definitely fit on a single machine- it comes in a weird format that is sort of a pain to process, which is good practice- just loading each event into a database won't be enough, you'll have to transform and reorganize it further to support answering interesting questions- last I checked, there are undocumented but public APIs for more recent MLB games (on some MLB website), so automatically scraping those and incorporating them into the retrosheet data is another interesting challenge.

zabanaalmost 7 years ago

OP Here, Thanks for all the great links, suggestions and ideas. One question I forgot to ask is the following: What exactly does Transform mean ? It seems to me that it's a very volatile concept and can change depending on the project. I'd love to know your thoughts on this. Cheers.

评论 #17784115 未加载

danbrooksalmost 7 years ago

Web scraping is a good project. LinuxAcademy has some good content on AWS and Hadoop ($50/month).

Swinx43almost 7 years ago

Hi I have been working in this field (and some of the previous incarnations of it) for quite a while. I would be happy to have a discussion to give some pointers and relay some experience if you want.My email and keybase.io is in my profile so feel free to get in touch.

dfsegoatalmost 7 years ago

Luigi. We literally use this for anything requiring a series of more than a few steps:<a href="https://github.com/spotify/luigi" rel="nofollow">https://github.com/spotify/luigi</a>

sbussardalmost 7 years ago

SQL. Most of my ETL jobs consist of setting up the overhead pieces in pyspark then using SQL for the most important logic. It's way more portable that way.

评论 #17783512 未加载

boredmgralmost 7 years ago

Convert .mat files to json,csv,xls formats