MLOps is mostly data engineering

169 pointsby dpbrinkmabout 2 years ago

23 comments

Longwelwindabout 2 years ago

I've been an MLOps Engineer for around 3 years now and I mostly agree with the article. There is a big overlap between the ML-specific tools that are popping up on the market and traditionnal Data Engineering tools, and I think people are not always realizing that:* Prometheus/Grafana/TSDB/... can be used to setup a model monitoring platform since you're observing metrics whether they are from an ML service or a normal service.* Any service deployment tool can be used to deploy ML models, since they are services.* AirFlow/Dagster/... can be used to orchestrate model training, since training a model is basically a data engineering task.With that said, I still believe that there is space for ML-specific tools to be created.* Model Monitoring tools (ArizeAI is the only one I've used) can be tailored to be easily usable by ML Engineers without requiring DE knowledge.* Deploying models in production has some specifities: things like GPU support, adaptive batching, ... Those specifities can be implemented inside a model deployment tool.* Training orchestration is the only domain where I think there's truly no need for new tools.

评论 #35439771 未加载

评论 #35440755 未加载

评论 #35447668 未加载

评论 #35439516 未加载

评论 #35441683 未加载

o10449366about 2 years ago

Sure, MLOps is just Data Engineering in disguise when you ignore the complexities of hardware provisioning, GPU optimization, integration tests for model performance and quality, benchmarking, resource constraints (network, disk, memory, GPU memory), etc.Anecdotally, I've worked in high-performance computing and machine learning for years now and the past few months I've seen a huge spike in the number of messages I get for MLOps positions. I think companies are slowly starting to realize that setting up machine learning at scale isn't as simple as deploying poorly written code by research scientists to managed platforms.

评论 #35440262 未加载

评论 #35440679 未加载

评论 #35440559 未加载

cpardabout 2 years ago

Hey folks I'm the author of the post and happy to see that it gets so much attention on HN. Thank you for the incredible comments!I want to clarify something about my intention with this post. There is a reason I chose "mostly" on the title. I'm not dismissing the different needs of ML.if a category withstands the tests of the market, then there's good reason for it to exist.But, we have ended up creating silos within orgs with fundamentally aligned goals because of the way we build products and companies around them.What I'm advocating for in this article, is the need to think more holistically when we design and build data infra tooling. Yes ML has unique challenges but these challenges won't be addressed by reinventing everything again and again.Tooling should be built having in mind all the practitioners involved in the lifecycle of data.It's harder to do but at least we'll stop wasting our time building one Airflow copy after the other that is doomed to fail.

评论 #35443198 未加载

jamesblondeabout 2 years ago

IMO, this article misses the essence of and principles of MLOps. The essence of MLOps is that it is about processes (and tooling/platforms) for creating ML assets - features/labels and models. We call them FTI pipelines. Data engineering has data pipelines that produce datasets for consumption.In MLOps, feature pipelines produce features (from raw data). Training pipelines produce models (from features/labels). Inference pipelines produce predictions (from models + features). There is no such thing as a "ML Pipeline" in a production ML system. There is no ML pipeline that goes from raw data to predictions. We have the above FTI pipelines (feature/training/inference pipelines).The principles of MLOps are around being able to develop faster (shorten the development lifecyle) through automated testing and versioning. You need to validate data to build features. You need tested features to build models. You need to test models for your ML systems. It's a hierarchy: data->features->models-ML Apps. Versioning is needed for features and models in order to safely upgrade systems and enable them to evolve over time.I cover a lot of this in a course i developed called 'serverless ml'.

ritzacoabout 2 years ago

I agree there's a lot of "Agile" like BS around MLOps, but this article doesn't really give the prior art enough attention IMO. Data Engineering is a large part of MLOps, but there are unique parts of production ML engineering so it makes sense that it is (slowly) evolving its own discipline.There are some people who have expertise in building production infrastructure, writing production code, and managing production data, but they are few and far between. So finding a "system" that lets data experts work with code experts work with infrastructure experts is important.Many people say it's "just software engineering" or "just DevOps", but I feel like they are either not respective enough of the challenges of whichever pillar they are ignoring, or they don't even know that those challenges exist.Filtering out the BS and finding the smart people who are writing interesting things about MLOps is difficult as they use the same terminology (and if the smart people switched, the BS people would follow, so they may as well stand their ground) but the BS cover doesn't mean that there's nothing substantial underneath.

评论 #35440539 未加载

mountainriverabout 2 years ago

This article feels like it’s written by someone who doesn’t understand the problems faced by ML. These people come through the MLops community every so often, they think no one has realized that dev ops and DE are similar, when in reality they just don’t yet realize how different ML is yet.For one, the customer is entirely different. You are mostly serving data scientists who don’t have strong engineering skills, which dramatically skews the solutions toward things like Python and Jupyter.This is a big reason why the tool space is different and has been successful at what it does.Model training and serving are absolutely nothing like traditional methods. In serving, you are deploying a stateful model, not a stateless backend. That model’s state should ideally be continuously trained, requires different scaling and monitoring capabilities.In training, the GPU problem is far from solved and it is unique to ML with things like how you shard models and fit their weights into memory.There are extremely challenging problems in this space that simply aren’t the same as devops, and this is coming from a former k8s contributor.

评论 #35439881 未加载

评论 #35440846 未加载

boredumbabout 2 years ago

Learning some pytorch is truly not the bulk of the work to build a model, having to wrangle and mangle a massive amount of data coming from less-than-ideal data sources, orchestrating the jobs and making all of this available for your training routines to slurp up is a lot of work that evolves fairly quickly beneath you.

评论 #35438629 未加载

danthelionabout 2 years ago

Yes, and Data Engineering is just Software Engineering "in disguise"

评论 #35441819 未加载

评论 #35452337 未加载

评论 #35438942 未加载

评论 #35438483 未加载

noobcoderabout 2 years ago

Before entering the field of ML, I perceived MLOps as a superhero with abilities to handle and deploy ML models. However, it seems that MLOps is more or less a typical engineer who acquired skills to manage and deploy data infrastructure for ML purposes (exclusively) by exposure to data engineering.

评论 #35439584 未加载

评论 #35439547 未加载

PaulHouleabout 2 years ago

(1) The killer product encompasses “all of the above”, if you really are going to buy five of them God help you because with all the mistakes vendors will make and you’ll have to work around plus the overhead of moving data around you’re in for it.(2) A major difference w/ the conventional software development CI/CD pipelines is the sheer size of the data involved. When you are dealing with “tiny data” you can waste resources on Docker, but when your foundation model is 100x the size, when the training process is distributed, and takes a day, the quantity is taking on a quality of its own.(3) The worst performance sin is moving data around although this will be necessary so far as the system is distributed. Avoiding excess data moving can be the difference between training a model and failing to train a model, but when you put together a patchwork of ML ops programs you will fin they are moving data around internally for good reasons sometimes and no reason other times plus the easy (and sometimes only) integration method is moving data around. Don’t be that guy!

chatmastaabout 2 years ago

If a buzzword is a portmanteau of a previous buzzword (DevOps), combined with a newly hot buzz word (ML), then chances are it's something in disguise.But that doesn't make it any less legitimate - DevOps came from Dev + SysOps, but nobody is arguing DevOps shouldn't be a thing (although you might argue it's no different from SysOps).In general, buzzwords align pretty closely to VC funding cycles.

评论 #35440494 未加载

bobbrunoabout 2 years ago

On a first read, I could agree with most, if not all, of the author’s arguments. But there are two aspects that were simply left out that I have to consider when doing MLOps, which can prove too complex for just saying “It’s an extension of tooling that data engineering already has”.First, there’s the matter that ML introduces a significant stack and complexity into what was already a relatively complex framework. I mean, managing storage, quality, data processing, streaming, scheduling/orchestration, transformation logic and SLAs requires a lot of tools, whichever combination pleases you the most. Even full platforms offered by some of the players in this market can get quite complex, and it’s very hard to set everything right and handle all the cases. Specialised tooling or skills is probably a good idea to focus on the things that matter to Ml and that go beyond what DE already covers. Think of all the frameworks, the statistical libraries, the different nature of the logic for ML features when compared to regular reporting needs, the different quality requirements and structure that ML expects, managing versions of raw, labelled and test datasets, etc. (there are many more, the discussion already covered quite a few).Which brings me to the second thing - the knowledge stack required to run ML. Besides some of the usual DE stack (developing, data manipulation, quality, etc), a whole new set of skills, related to several branches of math, parallelism, very complex and costly infrastructure management, research skills, experiment design, specific algorithms and approaches (does the regular DE need to understand neural network patterns, how data and model-parallel training works, the statistics behind setting up and running drift measures, what all the metrics behind model performance mean, etc.?). I find this a better reason for specialisation than any other - there’s just so much one person can hold in their heads, and ML development and operations is just getting more complex by the day.So, my point is, from a very simplified and abstract perspective, the author seems right. But, in practice, you won’t be able to just stack that on top of data engineers and not expect them to become specialized - and that’s where the ML Engineer and MLOps engineer roles are emerging. They’re not completely new, but they’re no longer your regular data engineer or data scientist.

kfkabout 2 years ago

We are experimenting with workers using the simple python arq library and Redis and I am yet to find a MLOps or Data Engineering use case that is not a good fit for a API+Worker on K8S. For instance, you need to manage ML artifacts? You can just offer an API endpoint so the ML models can automatically update the artifacts. You need data ingestion? You can have a worker running ingestion scripts and kick off the worker via API. We tried pub/sub and Kafka but it can be really wasteful, workers can process work for multiple streams, but Kafka cannot. But of course I wonder if I am missing something, I am not an ML engineer so probably I am?

评论 #35439375 未加载

mmierzabout 2 years ago

I'm currently working in an MLOps engineer role at a mid sized company. I agree with that article that most of what I do is plain old software engineering. I don't think I'm interchangeable with any other backend dev though, because ML expertise really does come in handy here. I think the thing that makes it a bit specialized is that we are providing tools to allow our data scientists to self-serve model deployment and monitoring, but they by and large not expert web programmers. So we need to anticipate the kind of mistakes they're likely to make and provide opinionated tools that guide them into building sane software in the specific context of our company's technology. As well as direct support as needed.We evaluated several commercial MLops tools and ended up going with generic tools that we already use, instead of something new that's branded for MLops. I.e. postgres + snowflake instead of a commercial feature store -- model deployment, monitoring, and alerting on the same platform as the rest of the company's applications -- etc. When we tried "ML" tools, they took so much work to adapt to our use cases that they really added no value.

antipaulabout 2 years ago

The overlap in "deployment" between MLOps and Software engineering was hinted at in that well-known 2015 NeuriIPS paper, "Hidden Technical Debt in Machine Learning Systems":"It may be surprising to the academic community to know that only a tiny fraction of the code in many ML systems is actually devoted to learning or prediction – see Figure 1"<a href="https://papers.nips.cc/paper_files/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html" rel="nofollow">https://papers.nips.cc/paper_files/paper/2015/hash/86df7dcfd...</a>PDF: <a href="https://papers.nips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf" rel="nofollow">https://papers.nips.cc/paper_files/paper/2015/file/86df7dcfd...</a>

jstx1about 2 years ago

This is a strange article. The body of the article correctly talks about all the model work... but data engineers typically don't have to do any of that work.So it's an okay overview of some ML engineering / ops things with a contradictory title which isn't followed up on (and which I'm sure gets more clicks).So no, MLOps isn't just data engineering. For more information read your own article.

评论 #35441174 未加载

评论 #35439443 未加载

stuartaxelowenabout 2 years ago

Feature stores are essentially materialized views (aside from any realtime feature resolution needed). I think it's a good thing that there is specialized effort being taken here, though: features stores are an abstraction that could be useful in other domains also, and this surge in interest is an opportunity for us to make better tools.

nonethewiserabout 2 years ago

Consider that there are different types of developers. This remains true in the context of “Devops”. Devops doesn’t mean web dev ops or something. Given that, how is MLops not a certain type of devops? It’s basically ML engineers figuring out how to deploy their systems to production, no?

评论 #35440995 未加载

achileasabout 2 years ago

alwayshasbeen.jpgI was doing this work since before MLOps was the new buzzword in town, and it was always under the data engineering job title. It was only in the past few years that data engineering has become more focused, requiring new titles/job descriptions to truly cover the different specializations.

Kalanosabout 2 years ago

<a href="https://docs.aiqc.io" rel="nofollow">https://docs.aiqc.io</a>systematically orchestrates the data preprocessing and post-processing of the training loop for multi-dimensional data and various types of analysis

jlduggerabout 2 years ago

Side note: was this a tweetstorm originally? Literally every paragraph is a single sentence, often long ones with no reader afforrdances like bolding key points.

hgsgmabout 2 years ago

Is ML just statistics (data analysis) in disguise?

评论 #35438632 未加载

评论 #35439207 未加载

steveBK123about 2 years ago

Data Engineering with a top hat & bow tie, really.