IMO Data engineering is already a specialized form of software engineering. However what people interpret as DE's being slow to adopt best practices from traditional software engineering is more about the unique difficulties of working with data (especially at scale) and less about the awareness or desire to use best practices.<p>Speaking from my DE experience at Spotify and previously in startup land, the biggest challenge is the slow and distant feedback loop. The vast majority of data pipelines don't run on your machine and don't behave like they do on a local machine. They run as massively distributed processes and their state is opaque to the developer.<p>Validating the correctness of a large scale data pipeline can be incredibly difficult as the successful operation of a pipeline doesn't conclusively determine whether the data is actually correct for the end user. People working seriously in this space understand that traditional practices here like unit testing only go so far. And integration testing really needs to work at scale with easily recyclable infrastructure (and data) to not be a massive drag on developer productivity. Even getting the correct kind of data to be fed into a test can be very difficult if the ops/infra of the org isn't designed for it.<p>The best data tooling isn't going to look exactly like traditional swe tooling. Tools that vastly reduce the feedback loop of developing (and debugging) distributed pipelines running in the cloud and also provide means of validating the output on meaningful data is where tooling should be going. Trying to shoehorn traditional SWE best practices will really only take off once that kind of developer experience is realized.
Great article.<p>> data engineers have been increasingly adopting software engineering best practices<p>I think the data engineering field is starting to adopt some software engineering best practices, but it's still really early days. I am the author of popular Spark testing libraries (spark-fast-tests, chispa) and they definitely have a large userbase, but could also grow a lot.<p>> The way organizations structure data teams has changed over the years. Now we see a shift towards decentralized data teams, self-serve data platforms, and ways to store data beyond the data warehouse – such as the data lake, data lakehouse, or the previously mentioned data mesh – to better serve the needs of each data consumer.<p>I think the Lakehouse architecture is the real future of data engineering, see the paper: <a href="https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf" rel="nofollow">https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf</a><p>Disclosure: I am on the Delta Lake team, but joined because I believe in the Lakehouse architecture vision.<p>It will take a long time for folks to understand all the differences between data lakes, Lakehouses, data warehouses, etc. Over time, I think mass adoption of the Lakehouse architecture is inevitable (benefits of open file formats, no lock in, separating compute from storage, cost management, scalability, etc.).
In some sense, Data engineering today is where software engineering was a decade ago:<p>- Infrastructure as code is not the norm. Most tools are UI-focused. It's the equivalent of setting up your infra via the AWS UI.<p>- Prod/Staging/Dev environments are not the norm<p>- Version Control is not a first class concept<p>- DRY and component re-use is exceedingly difficult (how many times did you walk into a meeting where 3 people had 3 different definitions of the same metric?)<p>- API Interfaces are rarely explicitly defined, and fickle when they are (the hot name for this nowadays is "data contracts")<p>- unit/integration/acceptance testing is not as nearly as ubiquitous as it is in software<p>On the bright side, I think this means DE doesn't need to re-invent the wheel on a lot of these issues. We can borrow a lot from software engineering.
> The declarative concept is highly tied to the trend of moving away from data pipelines and embracing data products –<p>Of course an Airbyte article would say this, because they are selling these tools, but my experience has been the opposite. People buy these tools because they claim to make it easier for non-software people to build pipelines. But the problem is that these tools seem to end up being far more complicated and less reliable than pipelines built in code.<p>There's a reason that this domain is saturated with so. many. tools. None of them do a great job. And when a company invariably hits the limits of one, they start shopping for a replacement, which will have it's own set of limitations. Lather-rinse-repeat.<p>I built a solid career over the past 8 or so years of replacing these "no code" pipeline tools with code once companies hit the ceilings of these tools. You can get surprisingly far in the data world with Airflow + a large scale database, but all of the major cloud providers have great tool offerings in this space. Plus, for platforms that these tools don't interface with, you're going to have to write code anyway.
I'm a software dev who's been bumping up against the data engineering field lately, and I've been dismayed as to how many popular tools shunt you towards unmaintainable, unevolvable system design.<p>- A predilection for SQL, yielding "get the answer right once" big-ball-of-SQL solutions which are infeasible to debug or modify without causing regressions.<p>- Poor support for unit testing.<p>- Poor support for version control.<p>- Frameworks over libraries (because the vendors want to lock you in).<p>> data engineers have been increasingly adopting software engineering best practices<p>We can only hope. I think it's more likely that in the near term data engineers will get better and better at prototyping within low-code frameworks, and that transitioning from the prototype to an evolvable system will get harder.
It’s fascinating that the conclusion / forecast is that tools will abstract engineering problems and DE will move closer to the business . While over the last 20 years the exact opposite has happened and the toolset has actually become harder (not easier to use) but orders of magnitude more powerful and DE has moved closer to engineering, to the point where a good data engineer basically is a specialized software engineer.<p>The absolute pinnacle of “easy to use” was probably the Informatica / Oracle stack of the late 90’s and early 00’s. It just wasn’t powerful or scalable enough to meet the needs of the Big Data shift<p>Of course I guess this makes sense given the author works for a company with a vested interest in reversing that trend.
As someone who knows nothing about this stuff, I'm looking at the "Data Mart" wiki page: <a href="https://en.wikipedia.org/wiki/Data_mart" rel="nofollow">https://en.wikipedia.org/wiki/Data_mart</a>. Ok, so the entire diagram here is labelled "Data Warehouse", and within that there's a "Data Warehouse" block which seems to be solely comprised of a "Data Vault". Do you need a special data key to get into the data vault in the data warehouse? Okay, naturally the data marts are divided into normal marts and strategic marts - seems smart. But all the arrows between everything are labelled "ETL". Seems redundant. What does it mean anyway? Ok apparently it's just... moving data.<p>Now I look at <a href="https://en.wikipedia.org/wiki/Online_analytical_processing" rel="nofollow">https://en.wikipedia.org/wiki/Online_analytical_processing</a>. What's that? First sentence: "is an approach to answer multi-dimensional analytical (MDA)". I click through to <a href="https://en.wikipedia.org/wiki/Multidimensional_analysis" rel="nofollow">https://en.wikipedia.org/wiki/Multidimensional_analysis</a> ... MDA "is a data analysis process that groups data into two categories: data dimensions and measurements". What the fuck? Who wrote this? Alright, back on the OLAP wiki page... "The measures are placed at the intersections of the hypercube, which is spanned by the dimensions as a vector space." Ah yes, the intersections... why not keep math out of it if you have no idea how to talk about it? Also, there's no actual mention of why this is considered "online" in the first place. I feel like I'm in a nightmare where the pandas documentation was rewritten in MBA-speak.
For those of you who are genuinely curious why this field has so many similarly-named roles, here's a sincere, non-snarky, non-ironic explanation:<p>A Data Analyst is distinct from a Systems Analyst or a Business Analyst. They may perform both systems and business analysis tasks, but their distinction comes from their understanding of statistics and how they apply that to other forms of analysis.<p>A ML specialist is not a Data Scientist. Did you successfully build and deploy an ML model to production? Great! That's still uncommon, despite the hype. However, that would land you in the former position. You can claim the latter once you've walked that model through the scientific method, complete with hypothesis verification and democratization of your methodology.<p>A BI Engineer and a Data Engineer are going to overlap a lot, but the former is going to lean more towards report development, where the latter will spend more time with ELTs/ETLs. As a data engineer, most of the report development that I do is to report on the state of data pipelines. BI BI, I like to call it.<p>A Big Data Engineer or Specialist is a subset of data engineers and architects angled towards the problems of big data. This distinction actually matters now, because I'm encountering data professionals these day who have never worked outside the cloud or with small enterprise datasets (unthinkable only half-a-decade ago.)<p>It doesn't help that lack of understanding often leads to misnomer positions, but anybody who has spent time in this field gets used to the subtle differences quickly.
Business Analyst, Big Data Specialist, Data Mining Engineer, Data Scientist, Data Engineer.<p>Why is this field so prone to hype and repeating the same things with a new coat of paint. I mean what ever happened to OLAP, data cubes, Big Data, and whatever other super big next thing that has happened in the past 2 decades?<p>Methinks the problem with Business Intelligence solving problems is the firdt part of the term and not the second.
I live in the world of data lakes and elaborate pipelines. Now and again I get to use a traditional star schema data warehouse and … it is an absolute pleasure to use in contrast to modern data access patterns.
> Titles and responsibilities will also morph, potentially deeming the “data engineer” term obsolete in favor of more specialized and specific titles.<p>"analytics engineer" is mentioned but also just had its first conference at dbt's conference. all the talks are already up <a href="https://coalesce.getdbt.com/agenda/keynote-the-end-of-the-road-for-the-modern-data-stack-you-know" rel="nofollow">https://coalesce.getdbt.com/agenda/keynote-the-end-of-the-ro...</a>
I won't be surprised if DE ends up just falling under the "software engineering" umbrella as the jobs grow closer together. With hybrid OLAP/OLTP databases becoming more popular, the skillset delta is definitely smaller than it used to be. Data Engineers are higher leverage assets to an organization than they ever have been before.
Good readable historical overview, IMO.<p>The referenced "The Rise of the Data Engineer" article proved quite prescient.<p>The present overview seems less good on future predictions (it's hard!). Also, some mention of "data science" could've been a propos – though it's difficult to find consensus on what DS actually entails.<p>Overall, I think a good framework to navigate this space is to think of 2 overarching disciplines:<p>1) Engineering: scaling data & software.<p>2) Science: getting actionable insights from data combined with subject matter/domain expertise<p>Usually, #1 is a platform for #2 – but getting #1 good seems just as, if not more, important, and is perhaps harder.<p>EDIT: formatting
Have you seen what is possible with Elixir and its Broadway library? You can set up a fault tolerant, concurrent worker pool utilizing all of the necessary feedback mechanisms involved with message processing.
A full history of DE should include some original low code tools (Cognos, Informatica, SSIS). To some extent, the failure of these tools to adopt to the evolution of the DE role has lead to our modern data stack.