I feel the term "Data Engineer" gets used for a lot of catch-all "we have problems that need an owner" situations.<p>There's not much consistency across job postings and interviews for this kind of thing.<p>I just interviewed for one "Data Engineer" position which consisted of nearly 100% stored procedures. No one knew what else to call it, and they didn't want to advertise for a DBA, because there were no real DBA responsibilities. So "Data Engineer" was chosen.<p>Another "Data Engineer" position was almost entirely Spark. There was no SQL involved - they expected all applicants to be Spark experts, with a deep knowledge of Scala.<p>It's hard to know what to expect out of "Data Engineer" positions until you walk into a place and start asking questions in the interview.
I would pay in solid gold for a data engineer that knew how to glue <the things the data scientists need> to <the rest of the infrastructure> in a way that fixes the impedance mismatch that seems to exist in the tooling.<p>In my experience data tools don't mesh well with "cloud"-y IAM, monitoring, or auditing frameworks. Data folks ssh to shared cloud workstations and of course use agent forwarding because that's what the tooling expects. They want to use EFS to share data sets even though NFS on machines where people have sudo is a bad idea / EFS is maybe a poor fit if you're thinking about governance / provenance. There's a mix of "notebooks" running locally (or on the shared workstations) and DAGs running in the cloud with bespoke access control that either doesn't map to IAM or else there's no access control so to get to the dashboard you forward a port with SSH.<p>It's enough to make me want to wall them off in a separate AWS account, but maybe I'm just being a grumpy old SRE. <i>edit: as I mention downthread, this is a knee-jerk reaction and is not likely to "succeed" for whatever definition of "success" your business has.</i>
"The Role of a Data Engineer on a Team is Complementary and
Defined By The Tasks That Others Don’t (Want To) Do (Well)" -self<p>From a talk I've given a few times called, "Life of a Data Engineer"<p>(Google slides link: <a href="https://docs.google.com/presentation/d/1Oer3Z9OXPsk9H9WE5g6xlu4ATGEfEsU84xW3UMcIdd8/edit?usp=sharing" rel="nofollow">https://docs.google.com/presentation/d/1Oer3Z9OXPsk9H9WE5g6x...</a>)
I run a team of data engineers, and over the years there has been a lot of confusion between what is a data scientist and what is a data engineer.<p>I draw the divide in that data scientists discover the features and the methodology, while data engineers take these insights to production. One can argue that data scientists themselves could do that, but this is constrained by the domain expertise on tools(be that the depth of spark internals or whatever) and the number of hours in the day. It's hard enough to deal with the variance of the models to deal with the variance of the system.<p>A good data engineer is a unicorn.I define three central competencies for a data engineer:
<i>be a good coder</i>: quality, maintainability, efficiency,
<i>know how to explore the data</i>: SQL, R, just eye the damn data feed,
<i>know enough data science to interface with scientists</i><p>For a data engineer it's okay not to know probability theory and stats that much, but its a must for a data scientist( running TensorFlow out of the box with no understanding of the underlying math doesn't make a data scientist, just a common butcher).
When I tried to hire data engineers under that title I got a ton of resumes from people with very poor programming skills. It wasn’t until I swapped the job title to “software engineer” and put the data engineering details in the description that I got resumes from people with appropriate skills.<p>The main issue with good programmers is that you need to make sure that candidates know what the job entails and are onboard with it. There are definitely complexities involved but by and large it isn’t the type of work that CS programs glorify as “interesting work”.
I was under the impression that the Data Engineer role is just the market reaction to too many Data Scientists being produced without having the necessary Programming skills to self enable their day to day work.<p>Reading the comments maybe I was naive.
I'm interviewing for a Data Engineering position right now, and one of the questions I was told to prepare for is "What is data engineering?" I think it's far more than just the data science aspects this article talks about. Data Engineering touches more aspects of your engineering projects than most people think. Curious what this crowd has to say about my idea here. Also, I'm looking for work. If you like my thoughts, hit me up.<p>I think there are 4 buckets of data engineering problems, each with their own challenges and solutions.<p>Operational Data Engineering
This is the detritus that grows like weeds as parts of other projects and often isn't recognized as a data engineering problem. We need to pull a file off an FTP server or hit an API and do something with it. Next thing you know, there are dozens of these little things that are not individually hard, but having visibility into dependency trees and failure cases becomes difficult because they are spread out everywhere and it's not obvious where to look when things go wrong. Tools like Apache Airflow are a good solution even if you don't use them in other ways because they can centralize monitoring, logging, and graphs. Scaling isn't resource intensive for these tasks because they are discrete. You can fan out. The scaling challenge for this type of data engineering is really about tending your garden and keeping things coherently organized.<p>Business Logic Data Engineering
This is processing where the data is highly structured and sometimes even ordered or sequenced. It's hard to scale because you can't just throw things into a stream and apply multiple workers. You have to have a managed process and likely shared in-memory state that collects the worker results and applies strict rules to a process. This is the opposite problem from big data. It's small data, rigidly organized, and carefully managed.<p>Data Science Data Engineering
This is sort of classic ETL with a twist. ETL systems are typically pretty static once the E, T, and L are known quantities. But working with Data Scientists requires that your pipelines have to be pretty flexible because scientists are doing experiments. But they also have to be repeatable and comparable, which means your pipeline has to maintain version. This is also the area where you are most likely to encounter Big Data, so you have to be prepared to change your mental model and be able to use tools like Hadoop and Spark to bring compute to where your data is.<p>Analytics Data Engineering
This is classic ETL pipelines that move data from point A to data lakes or data warehouses. The key thing to understand here is what you are modeling at the endpoint. If it's a legit data warehouse, you are modeling business processes. If you aren't doing that, you are--by definition--pushing data to a lake. Understanding your endpoint is key to choosing your reporting and analytics tools to lay on top of your data source. Data lakes are a good use case for ad-hoc, SQL-driven reporting tools like MetaBase. But if you are sitting on top of a well-structured fact/dimension type of warehouse, you will want more formal tools like Tableau, Pentaho, or Cognos.
Another helpful distinction I think here is architect != engineer, however you often see data architects that are also data engineers. I do feel there is a clear difference of focus though.
Data Engineer and Information Architect terms have both been watered down and bastardized so they are ambiguous in meaning. I hate putting them on a CV anymore.<p>Next topic "HTML Programmer".
Data is a pretty major component of the programmer's craft, whether it's DBs, I/O, or blobs. Most any experienced programmer is a "Data Engineer".
So many posts in this thread are spot on. I've heard descriptions of some tech positions being equivalent to 'internet plumbers,' well, having spent a two week rotation shadowing plumbers in my youth, I have come to think of what I do as more akin to being an 'internet garbage man.' I deal with the shit the no one else wants to deal with, or maybe more like an e-waste manager. There is gold in the shit, but no one wants to actually do the dirty work of building the system to move all the nasty sharp PCBs to somewhere that the precious metals can be extracted in a way that that delicate workers won't cut themselves to pieces.<p>No surprise, it is hard to find people who want to do this job and are good at it. I see the demand in the academic world ('scholarly infrastructure' is a very niche place) where it is nearly impossible to hire someone who can do this work, so hearing that it is also impossible in industry means I guess it is time to start training the undergrads :/.<p>I have an idea for a curriculum that could teach some of the principles for this kind of work (give them the gentoo handbook for a start, and see if they can follow it to get a database up and running from a box of parts), but I suspect that mostly it would act as a way to filter out people who simply don't like the activity, and you also have to have some amount of interpersonal skills in order to understand the use cases of your colleagues ....<p>Anyone who cracks this problem will have solved a far more general one in the process.