I am just an amateur, but gosh I love doing ETL. There's something about the specification of the data you're supposed to be getting, and then the harsh reality of the filthy trash you will actually be fed, and making something that can carefully examine each of the assumptions in the spec to test for how it won't go right, making tests for things that will, you are told, "never happen" (only to get an email from your program three years later that this has, in fact, happened), interpolating data where you can, and in general making a "simple load process" into one of those grinders people can feed cows and tungsten carbide rods into.<p>I feel like Data mentioning that he just ... loves scanning for life forms.
I've always found ETL frameworks to have their own problems. They seem great on paper but usually they don't account for a specific source system, APIs, applications, data size, data distribution or scheduling situations. If your project is using it then developers end up hacking the frameworks instead of writing simple code that does the specific thing they need to do.<p>Before you know it you have super long and super inefficient code just to fit the framework. It takes about the same time to read and understand an ETL framework as it is to write your own python/bash script, and at least with your own code it's easier to see bottlenecks.
Airbyte Engineer here.<p>I think some of the points made here about ETL scripts being just 'ETL scripts' are very relevant. Definitely been on the other side of the table arguing for a quick 3-hour script.<p>Having written plenty of ETL scripts - in Java with Hadoop/Spark, Python with Airflow and pure Bash - that later morphed into tech debt monsters, I think many people underestimate how quickly these can quickly snowball into proper products with actual requirements.<p>Unless one is extremely confident an ETL script will remain a non-critical good-to-have part of the stack, I believe evaluating and adopting a good ETL framework, especially one with pre-built integrations is good case of 'sharpening the axe before cutting the tree' and well worth the time.<p>We've been very careful to minimise Airbyte's learning curve. Starting up Airbyte is as easy as checking out the git repo and running 'docker compose up'. A UI allows users to select, configure and schedule jobs from a list of 120+ supported connectors. It's not uncommon to see users successfully using Airbyte within tens of mins.<p>If a connector is not supported, we offer a Python CDK that lets anyone develop their own connectors in a matter of hours. We have a commitment to supporting community contributed connectors so there is no worry about contributions going to waste.<p>Everything is open source, so anyone is free to deep as dive as they need or want to.<p>We also build in the open and have single-digit hour Slack response time on weekdays. Do check us out - <a href="https://github.com/airbytehq/airbyte" rel="nofollow">https://github.com/airbytehq/airbyte</a>!
ETL was actually a new acronym for me: Extract, Transform, Load.<p><a href="https://en.m.wikipedia.org/wiki/Extract,_transform,_load" rel="nofollow">https://en.m.wikipedia.org/wiki/Extract,_transform,_load</a>
Their standard scenario to avoid is actually a perfectly acceptable process to grow though prior to wedging another but of infrastructure into your org that needs it's own provisioning, maintenance and support, redundancy planning etc.<p>In that scenario the situation is so nebulous that the original implementers had no way to know at what point the business/use case would end up so why would they immediately jump to a belt and braces solution.<p>It's the infrastructure version of my go to phrase: don't code for every future.
We currently are using Airflow for ELTs/ETLs to ingest from different Postgres databases to BigQuery/Google Cloud Storage. Airbyte looks sweet for the same task and would free us from a big effort burden, but its Postgres source only supports SELECT * statements (i.e. you can't deselect columns).<p>That's kind of a dealbreaker for us, because for security reasons our Postgres users permissions are granularly configured with column-based security. I hope the Airbyte team solves this eventually because the software is looking great.
I had this same kind of impression writing product/business reports as an engineering manager at Oracle and elsewhere. But my way of solving it was to build a smarter IDE that helps you do this stuff without ETL.<p>You shouldn't need to waste time every time you build a report figuring out again how to query Postgres/MySQL/Elastic in Python and how to generate graphs and where to host it or how to set up recurring alerts on changes. The only part you should care about is the actual queries to write, the joins to do between datasets, and the fields to graph. The actual integration code (connecting to the database, copying a file from a remote server, etc.) should be handled for you.<p>The tool I'm building to solve this is open source if you're curious!<p><a href="https://github.com/multiprocessio/datastation" rel="nofollow">https://github.com/multiprocessio/datastation</a>
Very, very flawed reasoning here. It's basically arguing that YAGNI as a principle is exactly backwards: you're definitely going to need all this stuff eventually so embrace maximum complexity from day 1. Terrible, terrible strategy.<p>The only ETL framework you need is a general-purpose programming language. Cf. that article "You can't buy integration" from the other day.
I just got out of the job where we were working on legacy ETL "script" in Elixir, and terrible code architecture decisions aside, I think the pattern where you have to launch and monitor long lasting jobs is a breeze in BEAM.
You just spawn one process for each job, report back via mailbox and monitor via Supervisor. Unfortunately making changes to that system where all sources were hardcoded was to say at least abysmal, but the job running core was quite elegant.<p>Hopefully Elixir and other BEAM compiled languages will gain enough traction, I can't imagine rewriting something that available in Erlang from the box in OOP languages with mutable objects.
There's a despair that overcame me when the data we were processing changed date/time format in the middle of a file. It was a file for one day's data! Data on 5 minute intervals for 3 thousand data sources. Received in an email, no less. (Long story). But date/time format changed. The initial load was garbage. As many know, some countries use mm/dd/yy and others dd/mm/yy. I just wanted to cry when I dug in and saw the change. Instead, I hacked a little more on our lowly ETL script to check for this circumstance. This was just one example of an ongoing stream of data weirdness, as ETLers everywhere know.<p>In an early startup, (< 4 engineers in my case) I can't imagine using anything but a skanky script at the outset. We went broke before we needing a framework. Needed more engineering time on other things.
I was once tasked with replacing a dying set of ETLs composed in "ScribeSoft", apparently the built in scheduling and speed left too much to be desired, and calling other jobs from inside the job would halt the current job. Ended up replacing everything with a C# console application that ran every 1 minute unless it was currently running. There were a lot of bugs on both ends, but they were tired of paying $5k/yr for the ETL to run.<p>After I wrote the initial application, they handed it off to their South African dev team to maintain it.
Of course, the flip side is that sometimes that initial step of:<p>> "We are doing a prototype on our new mapping product, we need to write a one-off script to pull in some map data."<p>... is just that - a one off script. And it can prove to be a lot quicker to write a one-off script than getting involved with an ETL framework. I am not arguing against ETL Frameworks (Airbyte etc). Just that over-engineering carries its own costs, just like under-engineering does.
Write your own. You can either learn a general purpose programming language that will do exactly what you want, and your skills will be portable, or you'll learn some ETL-vendor-lockin-domain-specific-language. Sure, they may do some things for you and make some things easier, but you have to learn their system, and in that time you could be writing your own that does exactly what you want.
The work companies put into ETL is absolutely bizarre from a business standpoint.<p>Say you want to build a new kind of hammer. Normally what you do is you pay a company for access to some facility that has forging process, and you put your effort into designing the hammer, working with the facility to get it forged the way you want, and selling it.<p>Building ETL pipelines from scratch is like building an ire ore mining facility, building diggers, trucks, shipping containers, trains, boats, iron smelting and forging equipment, and warehouses, on top of designing and selling the hammer. People today are so brainwashed by the idea that they need to be writing software that they go to these extreme lengths to build everything from scratch, when they should almost always be paying somebody else to do most of it, and only spend time working on your business's actual problem. Building factories and logistics chains should not be a problem your business needs to solve if all you need to do is build and sell a hammer.
I have been working a lot in this space for the last two years but especially in the last 6 months. I believe we're about to enter a phase where much more elegant and less restrictive ETL platforms or frameworks are as commonplace as modern software CICD offerings. Prefect and Dagster both stand out to me as viable replacements for Airflow.
I am tasked with standardizing ETL in a small org, airbyte is on my list to evaluate. As I write this there are 64 comments in the thread and only a single comment mentions actual experience using airbyte. Any other actual insights using the tool? What about Airflow downsides mentioned in the article? Thx!
At our company, we actually built ETL-Framework-agnostic wrappers, monitoring, logging and scheduling tooling around the different ETL tools we used for four different ETL Product Frameworks we used: Microfocus COBOL, Torrent Orchestrate, Datastage (which incorporated Torrent) and Abinitio.
The wrappers invoked the ETL command, reformatted and consolidated logs.
For scheduling, we relied mostly on CA Autosys, instead of whatever scheduling mechanisms came with the ETL Product.<p>We found this approach made it easier to transition from one product to another.
As it consistently faster to plug the ETL framework into the supporting framework than to implement everything a new ETL Product offered.<p>As we move from our on-prem environment to the cloud, we hope we can implement a similar strategy even if we have to switch the support frameworks.
Lot of dislikes on this - but I find Airbyte to be a great idea - and really exciting for a lot of people. Just because you -can- write a SQL extract in CRON - you're missing a lot of the features that you don't have to face until you hit scale. At that point you're in trouble.
I'm interested in the concept, but i couldnt find a demo or 'hello world' example to look at. This is a very important aspect of promoting a new product.<p>Look at how Prefect does that (I know they are well along the path), but sonething is missing.
Hiya, I'm the original author. tl;dr for those deciding whether or not to read it:<p>If you are thinking about build versus buy for your ETL solution, in this day and age, there are enough great tools out there where buy is almost always the right option. You may think you can write a "simple" little ETL script to solve your problem, but invariably it grows into a monster that will be a reliability liability and engineering time suck. The post goes into more depth on why that is. Enjoy!
Argo workflow looks like the best of both worlds to me. You can easily build up complex etl/data processing dags where each step is a docker container, so you can choose the best tool for the job. Argo has all the retry/backoff logic built into it and you can plug the workflows into Argo events. It runs on kubernetes so you can scale your compute according to your workflow demands
Please correct me if I'm doing it wrong:<p>Scripts are collection data from external sources
Scripts are inserting Data to DB
Other Scripts are fetching DB and adding Data to other tables<p>Sure I mean it was mentioned, that data was missing and so on. But in a mature project you basically have already some monitoring, so you could just use existing solution instead of a new Framework.<p>Is there something wrong with such an approach?
I think before jumping onto no/low-code tools you should read the post by Brandon Byars just to get a complementary view on integration
<a href="https://martinfowler.com/articles/cant-buy-integration.html" rel="nofollow">https://martinfowler.com/articles/cant-buy-integration.html</a>
Can any of these ETL frameworks kick off an ETL script without a rewrite? Something that would handle scheduling, retries, emit metrics around those actions, but let me use my own tools for the actual data manipulation.
I've just been spinning up new C# console projects in VS and pulling down Dapper to do most nasty ad-hoc things between databases.<p>I never bother to save any of these to source control because the boilerplate is negligible and its usually a one-time deal.
Reminds me of the xkcd "how standards proliferate"<p>Arguably you could use Kubernetes as a scheduler, or "ETL framework/kit", it supports cron jobs, has a restful api, local and remote secrets storage, native support for cloud storage, support for multiple logging solutions, distributed workloads, supports cron jobs etc.<p>Years ago I worked for a financial services company and they would run their batch ETL jobs via a product called Tidal that later got bought by Cisco. I really liked using Tidal for the longest time, but 100% of what Tidal does, you can replicate with the scheduler features of Kubernetes.