On moving from statistics to machine learning, the final stage of grief (2019)

175 pointsby yoloswaginsalmost 5 years ago

21 comments

laichzeit0almost 5 years ago

There's this undertone of "I should be payed as much/more than Data Science people because I'm better than them at statistics and data science = machine learning = statistics".My experience doing data science at small companies that can't afford to hire more than 1 person for the role is that it is so much more than just building models or doing statistics.You have to:1. Build APIs and work with developers to get predictive models integrated into the rest of the software stack2. Know how to add logging, auditing, monitoring, containerizing, web scrapers, cleaning data(!!), SQL scripts, dashboards, BI tools, etc.3. Do some basic descriptive stats, some basic inferential stats, some predictive modeling, work on time-series data, sometimes apply survival analysis, etc. (Python/R/Excel who cares)4. Setting up data pipelines and CI/CD to automate all this crap5. Trying to unpack vague high level requirements along the lines of "Hey do you think we could use our data to build an 'AI' to do this instead of manually doing it" and then coming up with a combination of software / statistical models that perform as least as good or better than humans at the task.6. Work with non-technical business users and be able to translate this back to technical requirements.Hey, if all you do all day is "build models" then that sounds like a very cushy DS job you have. It's definitely not been my experience. I would describe it more like a combination of software engineering and statistics and business analyst. That's why it pays higher than just statistics. But this is just my experience..

评论 #23723388 未加载

评论 #23724544 未加载

评论 #23726680 未加载

评论 #23722716 未加载

评论 #23724328 未加载

natalyarostovaalmost 5 years ago

> Machine learning is genuinely over-hyped. It’s very often statistics masquerading as something more grandiose. Its most ardent supporters are incredibly annoying, easily hated techbrosThis sort of fashionable disparagement of a group of people to signal that you’re not part of the “bad group of tech bro’s” is so trashy. Why are these random people you easily hate? Who are they? Why take glee in shared hatred?I’ve worked as a sr DS at FAANG for 4 years. I’ve recently worked through Casella Berger, because I wasn’t comfortable being one of those DS who didn’t know math stats. But before I did work through it, I worked with people from PhD stat programs who were so ineffective. Despite knowing so much more stats than me, they would freeze up and fail everytime they had to deal with any sort of software system or IDE. It was so weird to me that my ability to use a regression, even before I knew the theory, was more valuable than their ability to use a regression to its full power, simply because I could fight the intense battle to take that idea and put it into reliable production code.But generally I hate hate this war between DS and stats. It’s so stupid. Maybe not their first year, but eventually any DS who wants to be a master of their craft ought to learn math/theoretical stats. And some don’t want to be a master of their craft, and instead want to go into management or whatever, and that’s fine.

评论 #23725836 未加载

CrazyStatalmost 5 years ago

> I’m sure you’re asking: “why allow your parameters to be biased?” Good question. The most straightforward answer is that there is a bias-variance trade-off. The Wikipedia article does a good job both illustrating and explaining it. For β-hat purposes, the notion of allowing any bias is crazy. For y-hat purposes, adding a little bias in exchange for a huge reduction in variance can improve the predictive power of your model.I'm going to push back on this.The author seems to understand the bias-variance tradeoff as applying primarily to y-hat, and allows that if you are primarily interested in y-hat then it can make sense to make that tradeoff (introduce bias in exchange for lower variance). But the bias-variance tradeoff is more general than that. There's also a bias-variance tradeoff in beta-hat, and you can make a similar decision there to introduce some bias in beta-hat in exchange for lower variance, lowering the overall mean square error.There's nothing crazy about this. The entire field[1] of Bayesian statistics does this every day--Bayesian priors introduce bias in the parameters, with the benefit of decreasing variance. Bayesians use these biased parameter estimates without any problems.Classical (non-Bayesian) statistics has tended to focus heavily on unbiased models. I suspect this is largely because restricting the class of models you're looking at to unbiased models allows you to prove a lot of interesting results. For example, if you restrict yourself to linear unbiased models, you can identify one single `best` (i.e. lowest variance) estimator. As soon as you allow bias you can't do that anymore.[1] Except empirical Bayes, which is a dark art.

评论 #23724551 未加载

评论 #23725126 未加载

评论 #23724496 未加载

noelwelshalmost 5 years ago

Just a note that you can interpret regularization as placing a prior on weights. L2 regularization is a Gaussian prior, and L1 is a Laplacian prior. I.e. this is doing Bayesian statistics rather than an arbitrary hack to improve predictions.Elements of Statistical Learning is firmly in the frequentist world from what I recall, so this might not be discussed in that book.

评论 #23723192 未加载

评论 #23723111 未加载

评论 #23725714 未加载

d_burfootalmost 5 years ago

The author makes it sound like statistics is this grand beautiful mathematical edifice and ML is just a bunch of number crunching with computers. That contrast is just unfair; a huge portion of stats is just made up of hacks and cookbook recipes. Statistics has probably done more damage to the world than any other discipline, by giving a sheen of respectability to fake science in fields like nutrition, psychology, economics, and medicine.I'm particularly annoyed by the implication that statisticians have better understanding of the issue of overfitting ("why having p >> n means you can't do linear regression"). Vast segments of the scientific literature falls victim to a mistake that's fundamentally equivalent to overfitting, and the statisticians either didn't understand the mistake, or liked their cushy jobs too much to yell loudly about the problem. This is why we have fields where half of the published research findings are wrong.

评论 #23726710 未加载

评论 #23726241 未加载

Konohamarualmost 5 years ago

> Traditionally, it’s a cardinal sin in academia to use parameters like these because you can’t say anything interesting about the parameters, but the trick in machine learning is that you don’t need to say anything about the parameters. In machine learning, your focus is on describing y-hat, not β-hat.This kind of philosophy will cause future generations to see machine learning as something worse than a fad, almost as something in between a fad and crank science. If this encapsulates how all (generally speaking) machine learning operates then we will enter big trouble, if we have not already.> In machine learning, bad results are wrong if they catastrophically fail to predict the future, and nobody cares much how your crystal ball works, they only care that it works.This has moved from Cargo Cult Science into numeromancy. It's leveraging the occult (=hidden, incomprehensible parameters) for predicting the future. Because there exist no first principles, nothing can be further interpreted. Only more of the occult can be leveraged in order to make more predictions not amenable to interpretation, which will in turn require MORE occult to make MORE inscrutable predictions, until the heat death of the universe....And appealing to 80's AI (neural networks) as case precedence further harms the author's case. If ML operates like how AI neural network technology went, then this whole rigamole will go tits up by case precedence as well.

评论 #23722813 未加载

评论 #23723210 未加载

评论 #23722559 未加载

评论 #23722754 未加载

评论 #23723172 未加载

cycrutchfieldalmost 5 years ago

If this topic interested you, it may also be worth reading Leo Breiman’s “Statistical Modeling: The Two Cultures” from 2001. <a href="https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726" rel="nofollow">https://projecteuclid.org/download/pdf_1/euclid.ss/100921372...</a>

fxtentaclealmost 5 years ago

The author appears to misunderstand the main difference between statistics and ML. Let me cite him:> my gut reaction is to barf when someone says “teaching the model” instead of “estimating the parameters.”Typical statistics work is to use a known good model and estimate its parameters. Typical machine learning work is to think back from what task you want it to learn and then design a model that has a suitable structure for learning it.For statistics, the parameters are your bread and butter. For machine learning, they are the afterthought to be automated away with lots of GPU power.A well-designed ML model can have competitive performance with randomly initialized parameters, because the structure is far more important than the parameters. In statistics, random parameters are usually worthless.

评论 #23722442 未加载

评论 #23722482 未加载

scribualmost 5 years ago

This is one of the clearest explanations I've read on the difference between traditional Statistics and Machine Learning.

评论 #23724940 未加载

kelvin0almost 5 years ago

It seems like the 'scientist' part of 'Data scientist' might cause this sort of misunderstanding.There's a lot more 'engineering' and fiddling going on than any type of 'science-y' stuff it seems.

评论 #23725207 未加载

评论 #23724347 未加载

em500almost 5 years ago

This is really nice write-up, much better than yet-another-skin-deep-sklearn-tutorial. Skimming some other posts of the author, his domain understanding looks quite impressive to me.(Judging his writing as an ex-academic econometrician Data Scientist, about to be rebranded to Machine Learning Engineer by his megacorp employer, the author appears to have more insight in the field than many a PhD professional Data Scientist.)

评论 #23722727 未加载

Uptrendaalmost 5 years ago

Data science always seemed to me to be a profoundly boring job. Can anyone shed some light on what you find the most fascinating about it?

评论 #23723542 未加载

评论 #23724538 未加载

jmountalmost 5 years ago

Love the article. It inspired me to make a follow-up note on one of the memes: <a href="https://win-vector.com/2020/07/03/data-science-is-not-statistics-done-wrong/" rel="nofollow">https://win-vector.com/2020/07/03/data-science-is-not-statis...</a>

ur-whalealmost 5 years ago

From the article: " In statistics, bad results can be wrong, and being right for bad reasons isn’t acceptable. In machine learning, bad results are wrong if they catastrophically fail to predict the future, and nobody cares much how your crystal ball works, they only care that it works."

评论 #23725256 未加载

rob74almost 5 years ago

Off topic, but if someone uses "gut reaction" and "barf" in the same sentence, I'm tempted to think they really mean it literally...

xvilkaalmost 5 years ago

There is a big difference between ML practitioners and professional statistians. Former commonly are unaware[1] of a rich set of statistical biases and ways to tackle or mitigate them.[1] <a href="https://towardsdatascience.com/survey-d4f168791e57" rel="nofollow">https://towardsdatascience.com/survey-d4f168791e57</a>

anonymousDanalmost 5 years ago

Can someone elaborate on what is meant by 'estimating a parameter with a natural experiment'? This seems to be the key difference but I don't quite get how this would work. What would be your input data and how would the process differ from an ML approach?

评论 #23723562 未加载

Ericson2314almost 5 years ago

A pox in both there houses.I kinda want to ban this stuff for economies like ours. Think about it, we have many entrenched inefficient separate actors all engaging in nonsense alchemy. Surely this ruins the convergence to economic equilibrium.

YeGoblynQueennealmost 5 years ago

Well, if you look at machine learning from the point of view of data science it's inevitable to be confused about its relation to statistics, but machine learning is a sub-field of AI and statistical techniques are only one tool in its toolobx. Statistical techniques have dominated the field in recent ish years but much work in machine learning has historically used e.g. Probabilistic Graphical Models or symbolic logic as the "model" language. e.g. one of the most famous and well-studied classes of machine learning algorithms, decision tree learners, comprises algorithms and systems that learn propositional logic models, rather than statistical models.Tom Mitchell defined machine learning as "the study of computer algorithms that improve automatically through experience"[1]. This definition does not rely on any particular technique, other of course than the use of a computer. Even the nature of "experience" doesn't necessarily need to mean "data" in the way that data scientists mean "data"- for example, "experience" could be collected by an agent interacting with its environment, etc.Unfortunately in very recent years, since the big success of Convolutional Neural Networks in image classification tasks, in 2012, interest for machine learning has shifted from AI research to ... well, let me quote the article:>> Or you can start reading TESL and try to get some of that sweet, sweet machine learning dough from impressionable venture capitalists who hand out money like it’s candy to anyone who can type a few lines of code.I suppose that's ironic. But the truth is that "machine learning" has very much lost its meaning as industry and academia is flooded by thousands of new entrants that do not know its history and do not undestand its goals. In that context, it makes sense to have questions along the lines of "what is the difference between statistics and machine learning", which otherwise have a very obvious answer.___________[1] <a href="https://www.cs.cmu.edu/~tom/mlbook.html" rel="nofollow">https://www.cs.cmu.edu/~tom/mlbook.html</a>The excerpt I quote is an informal definition. The wikipedia article on machine learning has a more formal definition:<a href="https://en.wikipedia.org/wiki/Machine_learning#History_and_relationships_to_other_fields" rel="nofollow">https://en.wikipedia.org/wiki/Machine_learning#History_and_r...</a>

评论 #23723186 未加载

fnord77almost 5 years ago

this reads like something 5 or 6 years old

aabajianalmost 5 years ago

The author's pie chart showing data science to be 60% data manipulation is accurate. The biggest gap between good and bad data scientist is their comfort level with data wrangling. When interviewing candidates for data science positions, one of the simplest questions is to have them sort a 1 GB tab-delimited file.1. Poor candidates will try to open the file in Excel.2a. Marginal candidates will use R or Stata.2b. Okay candidates will use a scripting language like Python.3. Good candidates will use Unix sort.To my knowledge, there are no university courses teaching the Unix toolchain and it remains very much a skill learned through practice.

评论 #23722325 未加载

评论 #23722682 未加载

评论 #23722371 未加载

评论 #23728855 未加载

评论 #23724330 未加载