The First Rule of Machine Learning: Start Without Machine Learning

765 点作者 7d7n超过 3 年前

42 条评论

elexhobby超过 3 年前

Furthermore, follow <a href="https://twitter.com/_brohrer_/status/1425770502321283073" rel="nofollow">https://twitter.com/_brohrer_/status/1425770502321283073</a>"When you have a problem, build two solutions - a deep Bayesian transformer running on multicloud Kubernetes and a SQL query built on a stack of egregiously oversimplifying assumptions. Put one on your resume, the other in production. Everyone goes home happy."

评论 #28613783 未加载

评论 #28615603 未加载

评论 #28613485 未加载

评论 #28617097 未加载

评论 #28615549 未加载

评论 #28616142 未加载

评论 #28613826 未加载

评论 #28614375 未加载

评论 #28615397 未加载

quanto超过 3 年前

I recall attending a technical talk given by a team of senior ML scientists from a prestigious SV firm (that I shall not name here). The talk was given to an audience of scientists at a leading university.The problem was estimating an incoming train speed from an embedded microphone sensor near the train station. The ML scientists used the latest techniques in deep learning to process the acoustic time series. The talk session was two hours long. This project was their showcase.I guess no one in the prestigious ML team knew about the Doppler shift and its closed form expression. Typically taught in a highschool physics class. A simple formula that you can calculate by hand: no need for a GPU cluster.

评论 #28613946 未加载

评论 #28614745 未加载

评论 #28614326 未加载

评论 #28615342 未加载

评论 #28613993 未加载

评论 #28617821 未加载

评论 #28613912 未加载

评论 #28613870 未加载

评论 #28613840 未加载

评论 #28613900 未加载

评论 #28621757 未加载

dekhn超过 3 年前

I was very keen on machine learning for some time- I started working with ML in the mid 90s. The work I did definitely could have been replaced with a far less mathematically principled approach, but I wanted to learn ML because it was sexy and I assumed that at some point in the future we'd have a technological singularity due to ML research.I didn't really understand the technology (gradient descent) underlying the training, so I went to grad school and spent 7 years learning gradient descent and other optimization techniques. Didn't get any chances to work in ML after that because... well, ML had a terrible rep in all the structural biology fields and even the best models were at most 70% accurate. Not enough data, not enough training methods, not enough CPU time.Eventually I landed at Google in Ads and learned about their ML system, Smartass. I had to go back and learn a whole different approach to ML (Smartass is a weird system) and then wait years for Google to discover GPU-based machine learning (they have Vincent Vanhouke to thank- he sat near Jeff Dean and stuffed 8 GPUs into a workstation to prove that he could do training faster than thousands of CPUs in prod) and deep neural networks.Fast forward a few years, and I'm an expert in ML, and the only suggestion I have is that everybody should read and internalize: <a href="https://research.google/pubs/pub43146/" rel="nofollow">https://research.google/pubs/pub43146/</a> So little of success in ML comes from the sexy algorithms and so much just comes from ensuring a bunch of boring details get properly saved in the right place.

punnerud超过 3 年前

Most of the article is about the first of Google’s 43 rules about ML: “Don’t be afraid to launch a product without machine learning.”and this is the first part of the description:“ Machine learning is cool, but it requires data. Theoretically, you can take data from a different problem and then tweak the model for a new product, but this will likely underperform basic heuristics. If you think that machine learning will give you a 100% boost, then a heuristic will get you 50% of the way there.(..)”<a href="https://developers.google.com/machine-learning/guides/rules-of-ml" rel="nofollow">https://developers.google.com/machine-learning/guides/rules-...</a>

评论 #28613804 未加载

评论 #28614768 未加载

评论 #28613532 未加载

dataviz1000超过 3 年前

After months learning about machine learning for time series forecasting, several chapters in a book on deep learning techniques for time series analysis and forecasting, the author kindly pointed out that there are no papers published up to that point that prove deep learning (neural networks) can perform better than classical statistics.From the scikit-learn faqs:> Will you add GPU support?> No, or at least not in the near future. The main reason is that GPU support will introduce many software dependencies and introduce platform specific issues. scikit-learn is designed to be easy to install on a wide variety of platforms. Outside of neural networks, GPUs don’t play a large role in machine learning today, and much larger gains in speed can often be achieved by a careful choice of algorithms.Of course, there are libraries that can support GPU acceleration for numpy calculations using matrix transformations now. Nonetheless, they are not often necessary.

评论 #28617083 未加载

artembugara超过 3 年前

So true, especially about RegEx. I love RegEx. You can do so many things with simple RegEx rules.For example, have you ever tried to autodetect a published datetime of a news article published online? In many cases, it will be in metadata, or in the time/datetime tag.However, there still many websites where published time is just written somewhere with no logic at all.Writing a RegEx script by hand can resolve a problem. But every time I speak about it with our clients/prospects, they ask about ML that we use to parse news content.Product: <a href="https://newscatcherapi.com/news-api" rel="nofollow">https://newscatcherapi.com/news-api</a>

laichzeit0超过 3 年前

To offhand dismiss ML is also a cardinal sin. Control/treatment groups can show unambiguously when ML outperforms expert hand-crafted rules, pure random decisions or a simple model. The point is to measure, and not go 100% all in with one approach, but try many things and measure. I've done some process optimization with black-box methods, simple models, and SQL using domain expertise. In business you typically have budget and time constraints, so you go for the simplest and quickest solution first, show unambiguously that it works better, and then ask for more time and budget to build something more fancy. I ask myself "if this was my business, and my money, would I spend it doing this", if the answer is no, then you probably shouldn't.

评论 #28614157 未加载

___luigi超过 3 年前

ML can help reduce technical debt at logic layer, but it increases the technical debt at the infrastructure layer. It's a challenge for any company to deploy, manage and monitor models in production. If you can get away with a simple rule, that's a bigger win for the product (I'm not talking about research here).In the community, there is a trend that "complicated == better". imho, more is less in industrial ML. You need to deal with model management, worry about inference & latency when the model gets bigger. The author has another article where he argues that data scientists need to be full stack ninja. While I don't fully agree with that statement, I think it benefits the company in many many ways. Data scientists need to meet engineers in the middle, and all these challenges need to be considered from day 1. Another trend I see is that some data scientists are not driven by the question "Can we solve this problem for the company?", but rather "Can we solve this problem using ML/DL?". This will lead data scientists to use the shiny and trendy models, even if it is not suitable for the job. I would blame management here, in some environments, data scientists are evaluated based on "fancy" models they build, not solutions that they provide. Solutions can be simple (but not simpler) rules.

s_gourichon超过 3 年前

I can see both sides of the argument. On one side, using ML feels like huge overkill when a simple trick exists. Plus AI can freak out in some circumstances. On the other side, it may find other, less obvious cues giving something more robust.Rich Sutton's "bitter lesson" says the weight will move in time in favor of ML. <a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html" rel="nofollow">http://www.incompleteideas.net/IncIdeas/BitterLesson.html</a>

评论 #28614057 未加载

xnx超过 3 年前

I haven't read the article, but I really like the construction of the phrase. I would also propose:First rule of optimization: Don't optimize first. First rule of automation: Don't automate first.

评论 #28613338 未加载

MattGaiser超过 3 年前

Until a few weeks ago, I worked for a team trying to build AI driven products. A surprisingly challenging thing has been finding problems that aren't better solved without ML (as an ML company, we are supposed to be using it so those concepts get eliminated).

charles_f超过 3 年前

Thanks for that! Some people I work with are constantly asking for ML, they invoke like its magic and will figure shit out by itself. Then when I push back asking how they would make the decisions themselves, their answers tend to be in the line of "it's ML, it should figure out by itself", and when I ask about the data to be used, "it sshould adapt itself and find the data". Getting to have a heuristic in the first place is so hard.Reminds me of the book "Everything is obvious", where they experimented a few times and showed that in complex systems, advanced prediction systems made on many available and seamingly relevant variables are only marginally better (2 to 4% in the experiments) than the simplest heuristics you can use. They interpreted that as a limit of predictability, because systems with sufficient complexity behave with a seemingly irreducible random part.

lysecret超过 3 年前

To me that is just an iteration on first you makes it run then you make it right. And to make it run you start by the simplest approach. And building your own model is generally not the simplest however, it can be. There are some areas where you should start with ml. Most importantly Vison and some NLP, whenever a pretrained model for your task exists.

评论 #28613803 未加载

_wldu超过 3 年前

If you have not seen James Mickens (Harvard CS) USENIX Security keynote presentation from 2018, I highly recommend it. It's hilarious while clearly showing how reckless and dangerous ML is:<a href="https://www.youtube.com/watch?v=ajGX7odA87k" rel="nofollow">https://www.youtube.com/watch?v=ajGX7odA87k</a>

nabla9超过 3 年前

(for consulting in ML)The Second Rule of Machine Learning - Start Machine Learning with simple shallow models.50% of the problems are solved with good data choice of data + some generalized linear model.30% remaining problems solved with shallow models or old school ML models. Anything from support-vector machines, decision trees, nearest neighbors, very shallow neural networks.Remaining 20% require more work.

joeldo超过 3 年前

I wonder if this also applies to computer vision? There are certainly problem spaces where heuristics are well established, but many approaches around object detection/segmentation seem much easier/robust to implement with machine learning.

评论 #28613678 未加载

评论 #28615408 未加载

评论 #28613539 未加载

cgufus超过 3 年前

I fully agree with the article. One thing not mentioned, however probably assumed to be given: domain knowledge. A domain expert using simple methods will probably beat any decent ML model because they are able to define strong features.

评论 #28614009 未加载

alkonaut超过 3 年前

But the point of ML to begin with is likely often to appeal not by a better product but by appealing to investors or managers. If you create a better product but it doesn’t have “AI” in it then it failed in that aspect. What’s needed is a set of things that can be sold as AI or ML but isn’t.

评论 #28613560 未加载

评论 #28613894 未加载

评论 #28620018 未加载

masswerk超过 3 年前

> Solve the problem manually, or with heuristics. This way, it will force you to become intimately familiar with the problem and the data, which is the most important first step.Back then, when I did social research at university, I found it helpful to just look at the raw data. This is immensely helpful for familiarizing yourself with the data and discerning patterns that high-level analysis wont reveal easily. (In this case, you may want to start with a subset for evident reasons.)

评论 #28614537 未加载

xyzzy21超过 3 年前

This is generally correct about ALL technologies. You should NEVER start with a solution and look for a problem except in a very general sense (e.g. looking for potential markets in the abstract). Taking a solution market without having the problem well defined and identified is absolutely Epic Fail.Once you think you have a market, you should see how it can be done FIRST without your "fancy miracle technology" because NOBODY buys a technology because it's sexy or trendy: they buy because it added value in terms of more capability or lower costs.And ALL problems have current solutions that almost certainly DO NOT use anything as complex as your technology solution so you have to trend very carefully and deliberately in a rational sense: what value are we REALLY adding? That starts with knowing your competition and the current solution to solving the problem first and then finding every reason why your technology won't work or will be problematic.You ONLY have market potential once you've exhausted those faults or have objective arguments for your value proposition that have been validated by actual customers. The actual prove is made when they are willing to write a PO to you for the solution. Until then, everything you are doing is unproven.

评论 #28720792 未加载

streamofdigits超过 3 年前

What people call "ML" is actually several bundled phenomena. Unbundling them is profitable exercise that can help prevent alot of heartburn* 1 -> the discovery of specific families of non-linear classification algorithms (with image and language patterns being examples succesful new domains). the domain where these approaches are productive might be significantly smaller than what all the hyperventilation and obfuscation suggests.* 2 -> the ability to deploy algorithms "at scale". this cannot be overemphasized. Statistics used to be dark art practiced by scienty types in white lab coats locked in ivory towers. With open source libraries, linux, etc to a large degree ML means "statistics as understood and practiced by recently graduated computer scientists"* 3 -> business models and regulatory environments that enabled the collection of massive amounts of personal data and the application of algorithms in "live" human contexts without much regard for consent, implications, risks etc. Compare that wild west with the hoops that medical, insurance or banking algorithms are supposed to passConclusion, ML is here to stay in some shape or form, but ML hype has an expiration date

unhammer超过 3 年前

Googe's Rule #2:> First, design and implement metrics.> Before formalizing what your machine learning system will do, track as much as possible in your current system. Do this for the following reasons:> * It is easier to gain permission from the system’s users earlier on.> * If you think that something might be a concern in the future, it is better to get historical data now.:-/

Dumblydorr超过 3 年前

You always start by looking at the data, not by busting out advanced statistical methods. Those methods are obscure and could easily hide how ugly and unclean your dataset is. You really do need to look at types, missingness, the data structure and ensuring the row ID is what you want it to be, eliminating duplicates, joining on other datasets; it's a massive list of steps.Even with a clean dataset, most clients will want basic arithmetic calculations: averages, counts, percentages, standard deviation, etc. Occasionally they'll want some basic logistic models, something slightly more causal. If they go straight to machine learning without these steps, do they actually understand their problem and what they want? Or are they reaching for the shiniest thing they've heard of?

wanderingmind超过 3 年前

I come from a core engineering background. In my experience, ML especially DNN these days is a way for people to avoid doing critical thinking. The improvement even if it works is extremely marginal making the ROI useless. Further unlike social media, a failure of ML model will result in a loss of limb or life.Unfortunately most decision making C-suites are not engineers who fall for the marketing hype and burn through time and capital without tangible outcomes.

Iv超过 3 年前

I went into ML when I realized that this piece of advice is now wrong, at least in computer vision.It was a few years ago. I had to classify pictures of closed and opened hands. I thought surely I don't need ML for simple stuff like that: a hue filter, a blob detector, a perimeter/area ratio should give me a first prototype faster and given the little amount of data I had (about a hundred images of each), not worth the headache. I quickly had a simple detector with 80% success rate.Then as I was learning a new ML framework, I tried it too, thinking that would surely be overengineering for a poor result. I took the VGG16 cat-or-dog sample, replaced the training set with my poorly scaled, non-normalized one, ran training for a few hours and, yes, outperformed the simple detector that took me much longer to write.Now in computer vision, I think it makes sense to try ML first, and if you are doing common tasks like classification or localization of objects, setting up a prototype with pre-trained models has become ridiculously easy. Try that first, and then try to outperform that simple baseline. In most case, it will be hard and instead worth improving the ML way.

评论 #28614761 未加载

tikiman163超过 3 年前

Lately I've been thinking a lot about data cubes and how their use cases and methodologies for making them applicable are very similar to most machine learning algorithms. I don't mean how the output is generated or how things are programmed. What I mean is that they both tend to produce far more output than is practically useful. Additionally, it can be very easy to look at any small part of the output and draw incorrect conclusions.To clarify, when I talk about ML I'm primarily referring to classifier algorithms and approaches (including nlp). In the large part the ML is being used to generate classifier rules which generalize patterns, and data cubes are often used to look for aggregations and data sequences which generalize patterns. The problem is that random patterns happen all the time, and may even persist for a long time despite a lack of real correlation. Semantic analysis of data cube output is really important in order to find meaningful patterns.What I'm getting at is I often wonder why most ML projects try to treat it like it's magic. Human assisted learning has shown repeatedly to be the system which actually works in practical application. The classifier output needs to be pruned to remove rules that only held true in the sample data, or were merely coincidental, or simply have no practical value.Approaches like this are not cheap to set up and may in the end still only produce the same results as the existing entirely non-ML based system. What is the likely scale of work compared to the benefit is the first question I ask myself before working on anything. If I don't have objective data to answer that you have to do some research to find out. Never try to build a massive or complicated system you don't have objective reasons to expect will be worth the effort. That's precisely what people have been doing with ML constantly. It's little wonder most developers have such low opinions of ML projects.

dataqa超过 3 年前

I have seen first hand at small and large companies how problems have been tackled with ML without trying a simple rule or heuristic first. And then, further down the line, the system has been compared to a few business rules put together, to find that the difference in performance did not explain the deployment of an ML system in the first place.It's true that if your rules grow in complexity, this might make it harder to maintain, but the good thing about rules is that they tend to be fully explainable, and they can be encoded by domain experts. So the maintenance of such a system does not need to be done exclusively by an ML engineer anymore.Here is where I insert my plug: I have developed a tool to create rules to solve NLP problems: <a href="https://github.com/dataqa/dataqa" rel="nofollow">https://github.com/dataqa/dataqa</a>

vletal超过 3 年前

In the business and corporate world this is so underrated.In the past I attended several meetings with customers where I was actively discouraged asking questions which would help us deliver a good meaningful solution as long as the customer would be happy "investing in a ML solution". And they were...

incrudible超过 3 年前

I disagree. If you have the data, try throwing ML at it. It's probably less work than trying to "understand it" and building a heuristic. If you don't have the data, how are you going to validate your heuristic anyway?

lmilcin超过 3 年前

That's not how it works. People build ML solutions not because they went through rigorous analysis and figured out their problem needs ML solution.They just want to do ML and are looking for a problem that can be solved with it. Then they will likely ignore you when you say this problem has also neat traditional solution.This is further exacerbated by corporate actions like competitions for best AI (or Blockchain, etc.) project. Which you typically can't participate in if you have traditional solution even if it is way better.

评论 #28614875 未加载

arketyp超过 3 年前

I thought this was going to be about data preprocessing or domain transformation. The article does touch upon it. For instance, you can boost your image classifier by normalizing your images with simple statistics. Ironically, since neural networks are very good at finding basic (but non-trivial) feature correlations, the reverse is also true: for instance, you can boost your SVG classifier by adding to it the feature responses of a CNN pre-trained on Imagenet.

Humphrey超过 3 年前

Yes - and after many years, I'm yet to get past this first rule, and actually use ML. One day I hope to have a use case that's worth testing it out on!

OJFord超过 3 年前

I think a good rough guide is that if you consider it ML, if you're going to 'do ML', then.. it might be appropriate, but you're jumping to the solution and trying to make it fit (pun intended) the problem.If on the other hand you start from having some statistics to do on the data you have, then you might at some point find yourself doing the sexy subset of it that we call 'ML', and fine.

oakfr超过 3 年前

There are domains where the use of ML is not only valid but the best viable option (e.g. recommendation systems, computer vision, etc.)A few thoughts on how to maximize your chances of winning in this case:<a href="https://medium.com/criteo-engineering/making-your-company-ml-centric-5266809fbd26" rel="nofollow">https://medium.com/criteo-engineering/making-your-company-ml...</a>

评论 #28616841 未加载

rdevsrex超过 3 年前

So, I am a total ML noob. The thing I haven't found a straight answer to is, what is a model. I mean when it is in production? Is it just some random blob that you pipe data into and get data out?

评论 #28619914 未加载

评论 #28620455 未加载

nikanj超过 3 年前

Starting with Machine Learning gets you funded, though.

moedersmooiste超过 3 年前

I always have great success doing anomaly detection with basic standard deviation in some SQL queries...

lobo_tuerto超过 3 年前

Seems like antirez (from Redis fame) doesn't agree with this:<a href="https://twitter.com/antirez/status/1440711992038158336" rel="nofollow">https://twitter.com/antirez/status/1440711992038158336</a>

DrNuke超过 3 年前

Don’t bash the tools… just bash the fools!

lvl100超过 3 年前

ML really needs specification tests.

bongoman37超过 3 年前

A second point on that is, start with the simplest and most trivial models first, then add complexity as needed.

Ikerso115超过 3 年前

Ns que es esto yo soyb español