Thoughts on ML Engineering After a Year of My PhD

119 pointsby vavooomalmost 3 years ago

17 comments

Fiahilalmost 3 years ago

Nice article, however, I disagree on a couple of things :- "Plateform MLE" is juste regular Devops and Software Engineering. It's not because we're dealing with models and their accuracies that it is fundamentally different from what uses to be Devops before. We don't need to make a "special" title out of everything in Data Science.- I still like to introduce "MLOps" into the conversation, thus making it special, and infringing the rule above. Oups.- over fitting slightly on recent data, without accounting for gaps in modeling, is likely to lead to dire situations. See also : March 2020, when all forecasts went bananas. And , no, retraining at that time did not improved the situation. That's what the MLE was talking about here : > “I know it’s not really addressing the data drift problem,” ; they were right.- Everything about data and model drift is just the tip of the iceberg: What happen when your model ends up in production ? It will start affecting the behaviour of the very thing you're trying to predict. A prime example of this on the retail markdown case : did you sold more of that article because the product was rated better (as qualified by the marketer to compensate for lack of control over markdowns), and the in-store stock was higher, or did your markdown on that item had the effect you were looking for and the rating was actually okay ? Did the sales went down this markdown season because your model had a perfect but unattractive markdown strategy last season ? This is already very difficult to model properly, let alone trying to measure their contribution to the drift. Good luck with that.

评论 #32148747 未加载

评论 #32151614 未加载

ramraj07almost 3 years ago

Definitely a lot of interesting insights but I fail to see what 1 year of PhD has to do with it. If anything this undermines the article! We used to tell incoming grad students to never ask or believe anything first year phds tell. Either they’re trying to paint too good or too bad a picture depending on how they want to convince themselves of their big decision a year in.

评论 #32154466 未加载

评论 #32152768 未加载

sfifsalmost 3 years ago

> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain.> It puzzles me when people say that small companies can’t retrain every day because they don’t have FAANG-style budgets. It costs a few dollars, at best, to retrain many xgboost or scikit-learn models.if re-training every-day actually gives you a significant business material benefit (excepting cases where you specifically want to significantly over-weight recent data - phone keyboard autocomplete predictions as an example), you likely don't have a model that is actually picking signal from noise or generalizing on the data. This is memorization, not learning. This is essentially how you get expensive ML disasters.

评论 #32152076 未加载

评论 #32152771 未加载

评论 #32152045 未加载

评论 #32154579 未加载

评论 #32152041 未加载

zomglingsalmost 3 years ago

This article overstates the distinction between Task MLE and Platform MLE.From my experience managing data science teams (small teams - max 20 people), I always preferred for my Task MLE team mates to also do the Platform MLE work. I would not hire someone just to be a Platform MLE because they would be too distant from the day-to-day Task MLE needs.I like the way that Google SREs think about this - there's toil (Task MLE parts of the job) and then there's automation (Platform MLE parts of the job). Every programmer on the team should have toil and should be given enough time and freedom to address their most painful toil through automation.Distinguishing between Task MLEs and Platform MLEs so strictly is dangerous for anyone that applies this dichotomy in practice.I guess the article author never explicitly stated that they have to be different people, but I got the sense from reading the article that this was an unstated assumption on their part.

hwersalmost 3 years ago

> It puzzles me when people say that small companies can’t retrain every day because they don’t have FAANG-style budgets. It costs a few dollars, at best, to retrain many xgboost or scikit-learn models. Most models are not large language modelsSo true. This is a pretty easy flag that someone hasn’t actually been down and dirty with trying to train any actual models and doesn’t know what they’re talking about

评论 #32148526 未加载

evrydayhustlingalmost 3 years ago

Great insight here that a lot of performance drift is down to data and engineering, not "natural" non-stationarity. The corollary is that one of the hidden strengths of ML systems is the ability to adapt to imperfect inputs, and even changing inputs with retraining.Re: monitoring -- When I was doing automated trading, we of course had automated alerts to stop systems that were outside expected parameters. However, as OP describes, there was also a long tail of degradations where there was no great precision/recall tradeoff on flagging "real" issues. Instead, we put effort into "calibration reports" that visually surfaced as much information about training and performance as our eyes could handle. These would include things like over time plots and PCA plots for recent feature histories, where it was more an opportunity to spot patterns than an explicit metric. Reviewing these for 15m each day with our own eyes was much more effective at detecting a long tail of unanticipated degradations than we would have been at anticipating and coding for each one explicitly -- and it help build intuition that fueled new research ideas.

unlikelymordantalmost 3 years ago

I think 'becoming a historian of the craft' is very much what a first year phd is supposed to do, and a large part of what doing a literature review is all about. It helps crystallise what exactly your research questions should be, and what experiments you should do to answer the questions. I think this article will be useful for the author to see how their thinking has evolved over time when they read it back in 3 or 4 years, because I think a lot of changes happen to a persons thought over a phd.

cosmic_quantaalmost 3 years ago

Oh no:> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.We all know that overfitting is bad (e.g. in time-series forecasting, the past isn't always representative of the future). Depending on your domain, more recent data may be more valuable than older data, sure. The solution is not to overfit to recent data!In my experience, it is to design features which take into account recency. For example, in a particular quantity we wanted to forecast, we found out that using ~7 days worth of data was better than using multiple months, due to the data being non-stationary (the mean of the quantity was changing over time). What we did was combine features with an exponential decay with the appropriate decay constant, to great results.

stfwnalmost 3 years ago

Slight tangent: does anyone know of a good source for more useful blog posts on ML in the wild? Many that I come across are a funnel into some product, are very short, too theoretical, for beginners, or all of the above. This post strikes a nice balance in simply sharing some experiences and opinions, like you would see in blogposts on how to do software engineering well.

评论 #32148460 未加载

seancalmost 3 years ago

"Suppose every organization is able to clearly define their data and model quality SLOs."I laughed out loud. The rest of the essay is so on point I'm attributing the author full marks for a classic example of the dark, laconic understatement that builds so many bridges amongst technology professionals.

usgroupalmost 3 years ago

I think that MLops somewhat subdues the modelling and research effort in commercial settings. At least anecdotally I only seem to see the coexistence of ML pipelines paired with the “feature engineering” + fit(X,y) approach to data “science”. What I don’t see is first principles research, leading to insight, leading to models to exploit insight which ofttimes requires heavily augmenting how and what data is collected: that part almost never fits neatly anywhere. In its place DS often starts with data then looks for models to fit it. MLops reinforce that pattern, and we end up with the old joke:Man searching for his keys under a street lamp in the dark. Police man asks “Is this where you dropped your keys?”. Man says “No, but this is where the light is”.

评论 #32148141 未加载

bernfalmost 3 years ago

From practice the Platform MLE is non existent in most cases except for places like OpenAI and maybe Google. The key thing not addressed here is that in software, the right levels of abstraction provide ways for pluralities of downstream tasks. That level of abstraction has not really been found in ML (yet). Every time you start making a larger system, researchers then have to onboard to that system and I find they really don't like how much time it wastes.

wodenokotoalmost 3 years ago

> I should have taken the hyperparameters that yielded the best model for the latest evaluation set.> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.I don’t understand why that is

评论 #32148686 未加载

Kalanosalmost 3 years ago

<a href="https://docs.aiqc.io" rel="nofollow">https://docs.aiqc.io</a> solves a lot of the problems you mention. your contributions would be welcome

thrillalmost 3 years ago

A lot of good insights and musing in this article.

spywaregorillaalmost 3 years ago

I do a lot of data science work for various business problems. No deep learning image whatevers. Just understanding how processes work and influencing decisions. I can't relate to a lot of what this person is saying.> Sometimes, I was so scientifically sound that the business lost money. I automated a hyperparameter tuning procedure that split training and validation sets into many folds based on time and picked hyperparameters that averaged best performance across all the sets. I only realized how silly this was in hindsight. I should have taken the hyperparameters that yielded the best model for the latest evaluation set.What's silly here is thinking that the minor adjustment of hyper parameters from set to set is likely to make a difference. This might hold for some niche deep learning problems but it sounds later on like she isn't doing this. I rarely see an optuna parameter optimization affect the AUC of a model by more than 0.02 vs arbitrary choices. Most business problems tend to be pretty simple. I just can't imagine this parameter choice makes a difference.> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.Nonsense. Dangerous nonsense.> It puzzles me when people say that small companies can’t retrain every day because they don’t have FAANG-style budgets. It costs a few dollars, at best, to retrain many xgboost or scikit-learn models. Most models are not large language models. I learned this circuitously, while doing research in ML monitoring.This is accurate. I find it weird that people feel the need to retrain models so frequently though. Context matters I guess. But then they talk about data drift...> Anecdotes like this really get me in a tizzy. I think it’s because what I thought were important and interesting problems are now, sadly, only interesting. Researchers think distribution shift is very important, but model performance problems that stem from natural distribution shift suddenly vanish with retraining.It sounds like this person is just retraining their models at a high frequency until a hyper parameter search randomly produces something that looks good on an evaluation set. I don't see any evidence of them ACTUALLY EVALUATING THEIR PERFORMANCE. The missing piece to all of this is taking the predictions they made with this process and seeing how it compared to the unseen future outcomes in retrospect. If you think you're fine because you make a new model every day that works for yesterday's data, which they openly imply is overfit to that period, you might just be making a new mistake every day. A good AUC or R2 on your evaluation set doesn't guarantee that it'll be good on new data.> I had a hard time finding the gold nugget in data drift, the practical problem. Now it seems obvious: data and engineering issues—instances of sudden drifts—trigger model performance drops. Maybe yesterday’s job to generate content-related features failed, so we were stuck with old content-related features. Maybe a source of data was corrupted, so a bunch of features became null-valued.What? This seems pretty crazy to me. You should have alarm bells built in that freak out if important features are turning up null or corrupted. This person seems to have no idea if their model is working! They just know it worked for yesterday. They allude to doing this later on, but even then... broken data pipelines is not what people typically mean by data drift.I see no discussion of real performance evaluation, understanding how the models work, or how to make real use of the model outputs.

评论 #32153974 未加载

评论 #32151828 未加载

metropolisdelaqalmost 3 years ago

Interesting post, with many insightful comments but I think regarding data validation, this fella needs a bit of strong typed language training and functional programming skills, a lot of the problems from the current ML ecosystem comes because many use python, simple as that.

评论 #32148927 未加载

评论 #32148694 未加载