I'm not a data scientist but I've worked with a few over the past 10 years and I strongly agree with this article that the work has changed a lot over that time.<p>The first generation machine learning experts were proper scientists with proper Ph. D. degrees, academic track records, etc. that would typically be very opinionated on what algorithms (and quite possibly wrote a few of their own) to use but not necessarily experienced engineers. I saw a lot of clumsy engineering and convoluted testing and evaluation processes.<p>This explains a lot about the current state of the art which involves a lot of tools that are aimed at people who are not primarily engineers and need to be shielded from complex infrastructure and code but do know a lot about statistics, machine learning algorithms, and all the stuff that first generation machine learning experts would know.<p>The second generation of machine learning experts is basically riding an ongoing commoditization boom. They use toolkits from Google, Facebook and others pretty much as is. These tools are easy to use for them but not necessarily for non expert engineers that know a lot about pumping data around but not necessarily about machine learning algorithms. This is getting a lot easier. I've heard of high school kids getting ML jobs with no college training whatsoever and just high school math and a bit of online training. My impression is that you can get nice results with a little effort.<p>The next generation of machine learning engineers won't be scientists and they'll indeed mostly work on manipulating data. All the machine learning algorithms will be provided in the form of black box libraries and tools that will mostly work in a fully automated mode. IMHO the whole point of deep learning is that the algorithms figure things out by themselves. Even the job of picking the right algortithms and configuring them is ultimately going to be something that machine learning algorithms will be better at than a junior engineer with no relevant scientific background.<p>Or indeed an experienced software engineer with a classic computer science background, like myself. I have no clue what e.g. a tensor is. articles on the topic seem to be very math heavy and tend to give me headaches. But should I even have to care to be able to configure some black boxes that process data and produce models that I can plug into my runtime? My pet theory is that we're already past that point and that lots of companies are getting decent results not having to care about the underlying algorithms already.<p>I went to a great meetup at Soundcloud last week about how they used off the shelf machine learning tooling to improve their saerch ranking in elasticsearch. It was all about the training data, the parameters in the search query that they wanted to machine learn, their tooling for evaluating model performance in terms of being able to rank real queries against real data, tooling for annotating training data, integrating models with their software, the devops for retraining the models, etc.<p>My experience working with the machine learing team search group in Nokia Maps (now Here) eight years ago was that the tools were an obstacle to getting results fast and that iterations on model improvements were measured in months. A lot of engineering went into things like feature extraction, model tuning, and other stuff that scientists do as well as building essentially all of the tools from the ground up so that models could actually be generated evaluated, and integrated. Only problem: many of these people weren't experienced engineers so the tools were kind of clunky and there were lots of integration headaches, insanely long integration cycles, and lots of missed opportunities to fix (rather obvious) data problems due to a bias towards endless tweaking of algorithms instead of applying pragmatic fixes to the data. It kind of worked and the search wasn't horrible but the biggest problem was that the underlying data wasn't great to begin with (mis-categorized, full of duplicates, incomplete/stale, etc.).<p>The people at Soundcloud got it down to iterating in hours with a few months of engineering. That's from idea to proof of concept to having code in production that outperformed a manually crafted query.<p>That sounds like something I could do but it also sounds like a greenfield for proper tools to emerge that make all of this a lot less painful than it currently is. The next generation hopefully won't have to build a lot of in house tooling and reinvent a lot of wheels while doing so.