Beware the data science pin factory: The power of the data science generalist

195 pointsby ericcolsonabout 6 years ago

13 comments

reilly3000about 6 years ago

I really wish hiring managers read this. I am a data generalist, and have had no traction with obtaining even an interview for a data science job. I’ve setup a private JupyterHub where I run python ETL, interactive models, and dashboards. I deployed Metabase several times and have written hundreds of SQL queries. I’ve used Tableau with gigantic datasets. I built a front end serverless analytics pipeline from scratch with AWS that handles 30M events/mo. I've demonstrably grown revenue and margins in multiple contexts with my data products. I’m working on making a fully dynamic frontend for content recommendations. I have self-taught all of these skills in the past 3 years after a decade in sales, marketing, and entrepreneurship. What I haven’t done: a CS/math degree (mine was music), graduate work, or tech work at a household name. Lived in the Bay Area. Gotten an interview for any data job. Sigh.

评论 #19363409 未加载

评论 #19363026 未加载

评论 #19362421 未加载

评论 #19363064 未加载

评论 #19365277 未加载

评论 #19365368 未加载

评论 #19365122 未加载

评论 #19362396 未加载

评论 #19364042 未加载

评论 #19365043 未加载

评论 #19362213 未加载

评论 #19362223 未加载

评论 #19362281 未加载

评论 #19364721 未加载

评论 #19362169 未加载

评论 #19362341 未加载

评论 #19364097 未加载

评论 #19363946 未加载

评论 #19366275 未加载

评论 #19363416 未加载

评论 #19362954 未加载

评论 #19367686 未加载

评论 #19363962 未加载

jonathankorenabout 6 years ago

I’m not sure this is entirely true. The author is arguing for full stack scientists, and I prefer those people, but they’re hard to find, and even then you don’t necessarily want them doing everything. Worse yet, if you put someone in a full stack position, and they’re not already full stack, you need to budget a lot of mentoring, because if you don’t, you’re going to get a big pile of unmaintainable code.The author kind of builds a strawman of super specialized data scientists that constantly throw code over the wall to someone else. That doesn’t work, and you simply can’t do that unless your headcount is in the thousands. You have to have people that can productionize their work. At the same time, he’s arguing that scientists should should be maintaining their own data infrastructure, but that’s not good either.The best advice I was given was to hire people either to make you smarter, or to make you stronger/faster. You hire data scientists and ML experts to make you smarter. They should be working on problems that you can’t solve today. Infrastructure on the other hand, isn’t your product. It’s overhead. It’s a tool. Comparatively, it’s easier to hire people to build and maintain your infrastructure. Hire people to do that. All the time your scientists are dealing with infrastructure, is time they could be doing useful work.All that said, know when you should just shove the infrapeople aside and do it yourself.

评论 #19362090 未加载

sgt101about 6 years ago

A very good article, but I think that there is a missing concept - which is organisational maturity. In a fully mature data driven organisation (like... errm Google I guess - reading Jeff Deans papers anyway) there is a well developed data fabric, polished processes for providing credentials and authority, right sized resourcing pools and also substantial diversity of specialisation coupled with experience and domain insight. Specialists can flourish and deliver value out of proportion to their costs. In other, less developed, organisations there's no chance this will happen and specialists will be left floundering looking for the setting in which they can do their thang.

评论 #19362260 未加载

mmsimangaabout 6 years ago

Article's sentiments are also true for Business Intelligence. The most effective (I deliberately used work effective) BI developers have the following qualities interested in the business, able to chat to clients (emotional intelligence) and also able to code. The best BI people end up being generalists. Talkative nerds who can converse with business types and from the business end, you get the business people who are genuinely curious and willing to learn some SQL.Being able to communicate is key in BI because this enables you to focus on the right business problems.

opportuneabout 6 years ago

I agree and disagree with this post. I do think data scientists need to be better at data processing and do more of it. But I still think you do need a separation of labor between people setting up pipelines and people building models from the data. The real issue is that there are a lot of data science departments where they wittle away at their models in some notebook and then they're "done" once the notebook is showing the right metrics. Data scientists should be writing their models from the beginning so that they can productionize them once they are finished. There shouldn't be frequent hand off events requiring lots of communication between DS, pipelines, and data engineering teams, there should be an integration process set up so the flow of work continues to function without intervention.

thekhatribharatabout 6 years ago

Interestingly, the article doesn’t talk about the scale of production and its effects on productivity. When you produce lots of pins, division of labor is a known way to increase productivity.A data science generalist may work fine for a small data shop but as you grow and expand data science in your organization, we know the next step to increase productivity involves specialization (AKA division of labor). It happens not just in data science, but in all business functions and with all business roles.Marketing, Sales, Finance, Engineering, Operations - every business function uses specialization to get productivity gains. So while generalists may work for you if you’re a small business or a large business spinning up a new business function, specialization is a proven economic tool for productivity gains as you grow.Interestingly, as a business function grows, the communication costs and the ensuing delays increase and this is a known side-effect of specialization within that business function. This doesn’t mean one throws away specialization and runs to the other extreme of the spectrum with their use of generalists. There’s a tradeoff organizations make here and there’s been a lot of experimentation done in this space like - Amazon's two-pizza teams (<a href="https://zurb.com/word/two-pizza-team" rel="nofollow">https://zurb.com/word/two-pizza-team</a>), Spotify’s Squads, etc - these organizational structures are not universally applicable but they’re interesting developments to look at.Shameless Plug (on current state of data science market) - <a href="https://medium.com/open-factory/state-of-the-m-art-big-data-analytics-2396c321d7b9" rel="nofollow">https://medium.com/open-factory/state-of-the-m-art-big-data-...</a>

natalyarostovaabout 6 years ago

I generally agree with this article, and I am, and continue to aspire to be, a strong generalist data scientist. However, I do still enjoy/need to have 1 or 2 really really strong quants/statistician types on my team, since they are able to solve certain problems at a level of depth I can't reach. However, if they aren't supported by generalists, they also struggle to make impact.

评论 #19363439 未加载

bpyneabout 6 years ago

This sounds suspiciously like the battle software developers have been waging with people who want to run software development in a manufacturing model. The battle itself really sucks the love of making something right out of you.

mempkoabout 6 years ago

The author points this out at the end but I want to highlight it. Adam Smith also said that division of labor makes a person "as stupid and ignorant" as a person can become. <a href="https://www.pitt.edu/~syd/ASIND.html" rel="nofollow">https://www.pitt.edu/~syd/ASIND.html</a>

lincpaabout 6 years ago

I'm Financial Analyst, CPA, CIA, CTA, Statistician, Expert System Developer.I independently developed a financial analysis expert system, with a strong ability to innovate and execute.All my expertise is entirely self-taught.My Project: <a href="https://github.com/linpengcheng/fa" rel="nofollow">https://github.com/linpengcheng/fa</a>My technology Blog: <a href="https://github.com/linpengcheng/PurefunctionPipelineDataflow" rel="nofollow">https://github.com/linpengcheng/PurefunctionPipelineDataflow</a>

评论 #19366111 未加载

评论 #19365591 未加载

metakermitabout 6 years ago

Wow, this is a cool "interactive whitepaper" website :)<a href="https://algorithms-tour.stitchfix.com/" rel="nofollow">https://algorithms-tour.stitchfix.com/</a>

tomrodabout 6 years ago

This works in environments where infrastructure can support it. It can be downright blissful!

mlthoughts2018about 6 years ago

This article is terrible. You can’t make a case by putting a bunch of unsupported assertions into section-heading fonts and then just filling in paragraphs.This reads like a desperate business person wrote it, who wishes that one full-stack set of drives made sense and coexisted in a single person to make that labor cheaper and more commidity, despite the reality that it’s simply not true.The person who spent the time to master web service frameworks, query languages and product engineering necessarily did not also master professional level knowledge of deep learning or MCMC sampling or natural language processing.The two types of people need to coexist and work symbiotically, but it’s just asinine wishful thinking to pretend like they are the same person, let alone to write a baseless essay full of assertions that if they aren’t the same person it somehow results in first principles economic inefficiency.

评论 #19363886 未加载

评论 #19364463 未加载

评论 #19364428 未加载