How to Become a Data Engineer in 2021

264 pointsby adilkhashover 4 years ago

26 comments

(source for everything following: I recently hired entry-level data engineers)The experience required differs dramatically between [semi]structured transactional data moving into data warehouses versus highly unstructured data that the data engineer has to do a lot of munging on.If you're working in an environment where the data is mostly structured, you will be primarily working in SQL. A LOT of SQL. You'll also need to know a lot about a particular database stack and how to squeeze it. In this scenario, you're probably going to be thinking a lot about job-scheduling workflows, query optimization, data quality. It is a very operations-heavy workflow. There are a lot of tools available to help make this process easier.If you're working in a highly unstructured data environment, you're going to be munging a lot of this data yourself. The "operations" focus is still useful, but at the entry level data engineer, you're going to be spending a lot more time thinking about writing parsers and basic jobs. If you're focusing your practice time on writing scripts that move data in Structure A in Place X to Structure B in Place Y, you're setting yourself up for success.I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments - but - there are a lot of useful concepts in Hadoop/Spark that are helpful for data engineers to be familiar with. While you might not be using those tools on a day-to-day basis, chances are your hiring manager used them when she was in your position and it will give you an opportunity you know a few tools at a deeper level.

评论 #25730249 未加载

评论 #25730142 未加载

评论 #25731843 未加载

评论 #25740616 未加载

评论 #25731817 未加载

laichzeit0over 4 years ago

I think it's missing the resources to one of the hardest sections: Data modelling, like Kimball and Data Vault. That, and maybe a section to modern data infrastructure. I'd put a link to [1] and [2] for a quick overview and probably [3] for more detail.[1] <a href="https://www.holistics.io/books/setup-analytics/" rel="nofollow">https://www.holistics.io/books/setup-analytics/</a> [2] <a href="https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/" rel="nofollow">https://a16z.com/2020/10/15/the-emerging-architectures-for-m...</a> [3] <a href="https://awesomedataengineering.com/" rel="nofollow">https://awesomedataengineering.com/</a>

评论 #25737685 未加载

评论 #25738720 未加载

prionsover 4 years ago

SQL proficiency is important but I wouldn't say it supersedes programming experience. To me, Data Engineering is a specialization of software engineering, and not something like an analyst who writes SQL all day.As DE has evolved, the role has transitioned away from traditional low code ETL tools towards code heavy tools. Airflow, Dagster, DBT, to name a few.I work on a small DE team. We don't have the human power to grind out SQL queries for analysts and other teams. Our solutions are platforms and tools we build on top of more fundamental tools that allows other people to get the data themselves. Think tables-as-a-service.

评论 #25733701 未加载

评论 #25734944 未加载

评论 #25732147 未加载

StreamBrightover 4 years ago

2021? More like 2010. Hadoop is getting deprecated rapidly and more companies split their write and read workloads. Separated storage and compute is also popular. Scala is not used that much, I think it is not worth the time investment. More and more companies go for Kotlin instead of Java when these want to tap into the Java ecosystem.

评论 #25729315 未加载

评论 #25743339 未加载

评论 #25730509 未加载

评论 #25748662 未加载

ABeeSeaover 4 years ago

I think learning Scala is a bit of a waste of time, but I don’t know everyone’s stack. Maybe it’s a west coast bubble, but serverless seems to be the most popular choice for new ETL stacks even if the rest the cloud tech stack isn’t serverless. AWS tools like kinesis, glue (pyspark), step functions, pipelines, lambdas, etc.If you are working in that domain, being able to use the CDK in TypeScript becomes way more important than being able to build a Hadoop cluster from scratch using Scala.

评论 #25734602 未加载

wheatiesover 4 years ago

...and nothing of basic statistics? Data Science people want to know about your data pipeline and have some quantification of the quality of that data. Also, monitoring data pipelines for data integrity often relies upon a statistical test. You don't need to go as far as Bayesian but you do need to understand when a median goes way off or if it bi-modal, etc.

评论 #25729846 未加载

评论 #25729664 未加载

dominotwover 4 years ago

I've been in this space last 6 yrs or so and my scala usuage has gone down to zero. Not worth learning scala.

评论 #25730246 未加载

评论 #25730042 未加载

评论 #25729941 未加载

runT1MEover 4 years ago

I've been approached about various data engineering jobs over the last couple years and the job descriptions have varied wildly. It has been everything from:1. SQL/analytics wizard, capable of building out dashboards and quickly finding insights in structured data. Oracle/MSSQL/PostGres etc. Maybe even capable of FE development.2. Pipeline expert, capable of building out data pipelines for transforming data, Flink, Spark, Beam on top of Kafka/Kinesis/Pubsub run from an orchestration engine like Airflow. Even this could span from using mostly pre-built tools wiring together things with a bit of python to move data from A to B, to the other exteme of full fledge Scala engineer writing complex applications that run on these pipelines.3. Writing infrastructure software for big data pipelines, customizing Spark/Beam/Flink/Kafka and/or writing custom big data tools when out of the box solutions don't work or scale. Some overlap with 2, but really distinguished by it being a full fledged software engineer specializing in the big data ecosystem.So, are all three of these appropriate to call Data Engineer? Is it mainly #1 and people are getting confused? I would certainly fall into the #3, so I'm always surprised when people approach me about 'SQL transform' type jobs.

评论 #25741836 未加载

dibujanteover 4 years ago

"In order to undestand how these systems work I would recommend to know the language in which they are written. The biggest concern with Python is its poor performance hence the knowledge of a more efficient language will be a big plus to your skillset."What? The Apache stack that's written in Scala recompiles all your code into JVM bytecode, regardless of what language you've written it in. Yes, that includes Scala. Spark isn't actually firing up a python interpreter and running your python code on the data.

评论 #25742384 未加载

diehundeover 4 years ago

Nice article. From experience I would say the SQL knowledge should be advanced though. Not intermediate.

zaptheimpalerover 4 years ago

Somewhat outdated view. This may be the current stack, but its outdated now and is slowly being replaced. The new view is not big data pipelines and ETL jobs, its lambda architecture, live aggregations/materialized views and simple SQL queries on large data warehouses that hide the underlying details. The batch model may still apply to ML I guess, but I'm no expert there.

评论 #25739709 未加载

评论 #25732671 未加载

sseppolaover 4 years ago

Great resource, thanks for sharing it! I will dig deeper into the resources linked here as there's a lot I have never seen before. The main topics are more or less exactly what I've found to be key in this space in the last 2 months trying to wrap my head around data engineering in my new job.What I'm still trying to grasp is first how to assess the big data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases (mostly ETL). It just seems like Spark wins because it's most used, but I have no idea how to differentiate these tools beyond the general streaming/batch/real-time taglines. Secondly, assessing the "pipeline orchestrator" for our use cases, where like Spark, Airflow usually comes out on top because of usage. Would love to read more about this.Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.

freebee16over 4 years ago

Im my experience teams operating under the "The AI Hierarchy of Needs" principles are optimized for generating white papers

darth_avocadoover 4 years ago

We want all these skills, yet, we'll give you a separate title and pay you less than a software engineer. Meanwhile front end software engineers are still software engineers and get high pay.

评论 #25801462 未加载

评论 #25731844 未加载

mywittynameover 4 years ago

For GCP, our stacks tend to be Composer (Airflow), BigQuery, Cloud Functions, and Tensorflow.There's the occasional Hadoop/Spark platform out there, but clients using those tend to have older platforms.

评论 #25730448 未加载

u678uover 4 years ago

Incidentally does anyone have resources for SMALL data? EG a few MB of a time, but requires the same ETL, scheduling, traceability. I'd love some lite versions of big-data tools but needs to be simple, small and cheap.

评论 #25732353 未加载

评论 #25756615 未加载

评论 #25737009 未加载

评论 #25732455 未加载

评论 #25744333 未加载

评论 #25749623 未加载

评论 #25737386 未加载

master_yoda_1over 4 years ago

IMHO first you need to become a programmer then you can become a data engineer. So if you need to start by learning data structure then you are doing something wrong. Also the topics suggested in "Algorithms & Data Structures" could easily be skipped, the information is drastically misleading. We should seriously have some fact checker, otherwise this kind of bullshit article keep trending on HN and people keep wasting their time on learning LSM tree (what the fuck is that in the first place).

评论 #25742122 未加载

snidaneover 4 years ago

In data engineering your goal is "standardization". You can't afford every team using their unique tech stack, their own databases, coding styles, etc. People leave the company all the time and you as a data engineer always end up with their mess which now becomes your responsibility to maintain. You'd at least be grateful if those people had used the same methods to code stuff as you and your team so that you wouldn't have to become a Bletchley Park decoding expert any time someone leaves. Or you'd hope the tech stack was powerful and flexible enough that other people other than engineer types could pick it up and maintain themselves. They mostly cannot do that, because there is no such powerful system out there. Even when some modern ELT systems get you 80% there, you, data engineer, are still needed to bridge the gap for the 20% of the cases.Data Engineering really comes down to being a set of hacks and workarounds, because there is no data processing system which you could use in a standardized systematic way that data analysts, engineers, scientists and anyone else could use. It's kind of a blue-collar "dirty job" of the software world, which nobody really wants to do, but which pays the highest.There are of course other parts to it, such as managing multiple data products in a systematic way, which engineering minds seem to be best suited for. But the core of data engineering in 2020, I believe, is still implementing hacks and gluing several systems together so as to have a standardized processing system.Snowflake or Databricks Spark bring you closest to the ideal unified system despite all their shortcomings. But still, you sometimes need to process unstructured jsons, extract stuff from html and xml files, unzip a bunch of zip archives and put them into something that these systems recognize and only then you can run sql on it. It is much better than the ETL of the past, where you really had to hack and glue 50% of the system yourself, but it is still nowhere near the ideal system in which you'd simply tell your data analysts: you can do it all yourself, I'm going to show you how. And I won't have to run and maintain a preprocessing job to munge some data into something spark recognizable for you.It is not that difficult to imagine a world where such a system exists and data engineering is not evem needed. But you can be damn sure, that before this happens, that this position will be here to stay, and will be paying high, when 90% of ML and data science is data engineering and cleaning and all these companies hired a shitton of data science and ML people who are now trying to justify their salaries by desperately trying to do data engineers' job.

评论 #25741958 未加载

评论 #25742185 未加载

justinzollarsover 4 years ago

Amazon introduced Step Functions, which are very nice to dig into and a helpful skill for Data Engineering.

评论 #25741748 未加载

airbreatherover 4 years ago

Your data is only as good as your instrumentation and you usually only get one chance to grab that data, but can have many goes at processing it, do I would argue the bit not covered is the most important.

querulousover 4 years ago

i see a lot of "spark is dead" talk here. what replaces it for transform inbetween something like kafka and redshift/bigquery?

评论 #25748938 未加载

Nydhalover 4 years ago

Shameless plug to my much simpler (simplistic?) view of things. In this case, I think Data Engineers are the people building systems that solely focus on the data, all the data and nothing but the data.<a href="https://www.linkedin.com/pulse/mapping-data-science-professional-landscape-one-chart-nidhal-selmi/" rel="nofollow">https://www.linkedin.com/pulse/mapping-data-science-professi...</a>

somurzakovover 4 years ago

advanced proficiency in SQL and in any scripting language of your choice (C#/powershell, python) is enough to be a data engineer on any technical stack: windows/linux, on-prem/cloud, vendor specific/opensource, literally anything.

评论 #25730176 未加载

ectoplasmaboiiiover 4 years ago

Is anyone here using kdb+/q for data engineering, specifically outside of finance?

rmelhemover 4 years ago

where I work for, our stack is all about GCP/Airflow/Python/BigQuery ML, for recommender systems. I'm now playing around with Turicreate (Apple) to compare with BQML.

cargoshipitover 4 years ago

I don't recommend it