Data scientists shouldn’t need to know Kubernetes

181 pointsby vtuulosover 3 years ago

30 comments

antmanover 3 years ago

There are people mostly with an IT background who think that for data science you don’t need to know math and just monkey see monkey do sutoml based on atutorial, inspirational MOOCs and libraries that appeared magically out of thin air.There are people with a math background who think data science is just an extension of statistics, so business, knowledge of scalable information storages, and productization is irrelevant.There are both kind of posts here on HN. My take has been to hire math people with some cs msc, cs people with datascience msc, and business people that also know sales.For me that has worked painlessly but your milage may vary. I haven’t seen that black swan CV capable in all three disciplines, but I have seen CVs that seem to think that they can tackle every problem because they have read all towardsds and kaggle tutorials. Marginalization? Kubeflow? POV?, 2 out of 3 are usually foreign concepts.

评论 #28652539 未加载

评论 #28653946 未加载

void_mintover 3 years ago

Most people involved in tech, including most devs, shouldn't need to know/care about Kubernetes. The reason anyone thinks otherwise is the massive amounts of marketing money vested parties have pumped into sales (read: DevRel/Dev Evangelism, dev influencers).

评论 #28652561 未加载

评论 #28652462 未加载

评论 #28650944 未加载

评论 #28651937 未加载

评论 #28651209 未加载

评论 #28654060 未加载

评论 #28652744 未加载

Jugurthaover 3 years ago

We "do ML" for large organizations as a tiny consultancy. The way we've been able to improve the working conditions for ourselves (developers and data scientists) was by focusing on two things:- Process: we analyzed what worked and what didn't in past projects. Continuously auditing and trying to extract learnings. We made sure people we built for at the client organization were involved. We scoped more thoroughly. We involved parts client organization that could torpedo the project downstream (legal, security, etc) upfront. Made fewer assumptions. Listened more.- Tooling: we built a machine learning platform[0] to make sure a data scientist doesn't tap on anyone's shoulder to troubleshoot their system, set up their computing environment, or deploy their model. They could do it themselves. Furthermore, it wasn't necessary to get people who could move across the stack.Changing our processes and the way we do consulting had a huge impact. A badly scoped project will in some way or another create toil downstream and create a situation where you need people to do full-stack and you need "all-hands-on-deck" constantly. That's just bad, and after we ruthlessly reworked the process, we had better results, better relations with clients, better cadence, etc. I emphasize on this because we were a larger team at some point running around working on so many projects simultaneously that everyone was practically burned out.-[1]: <a href="https://news.ycombinator.com/item?id=28373127" rel="nofollow">https://news.ycombinator.com/item?id=28373127</a>

评论 #28652222 未加载

Jenssonover 3 years ago

Data scientists wants salaries like software engineers which is why they get requirements like software engineers. There are plenty of data scientist positions where all you need to know is excel, but those doesn't pay nearly as well. And if you look at the typical software engineering position there is almost always a slew of adjacent technologies, it is hard to get a position today where you only have to know one thing.

评论 #28653524 未加载

评论 #28653117 未加载

评论 #28654578 未加载

jrockwayover 3 years ago

I think what's going on here is that tech leadership folks know that the models the scientists develop eventually need to feed into their live product (so need to be "production ready"), but there isn't enough work to have two teams; one to develop the models, and one to run them in production. Thus, the ideal employee is an expert in everything! That's valuable, but not likely to be something you find when both data science and SRE are deep fields where people are very successful only knowing one of them ;)I work on something called Pachyderm, which is a Kubernetes-based data storage and job execution system that tries to bridge this gap. We have a managed solution (<a href="https://hub.pachyderm.com" rel="nofollow">https://hub.pachyderm.com</a>) where we provision your Kubernetes cluster and do all the management (keeping the software up to date, authentication and authorization, etc.) and in fact don't even expose kubectl to you. You'll never see any of the Kubernetes stuff (though you might recognize certain error messages, I suppose). You just supply your code and a specification for how data flows around your pipelines, add your data, and we do the rest. Data scientists can interact with the versioned inputs and outputs through notebooks, but you're getting the full suite of production features behind the scenes -- a history of exactly which data inputs went into which data outputs, incremental processing, seamless autoscaling (set cpus: 8, gpus: 1 in your pipeline specification, and we find you a machine that meets that spec, add it to your cluster in less than a minute, schedule your work there, and remove the machine when the job finishes), etc.Sorry for the sales pitch. I pretty much never use HN to shill my paid work, but it seems especially relevant to this sort of problem. Maybe you don't need the unicorn employee that is an expert in multiple fields -- focus on the data science and let us actually deal with the ugliness of computers ;)(And if you do like Kubernetes but don't want to write your own orchestration system, Pachyderm itself is open source.)

评论 #28653272 未加载

评论 #28651843 未加载

评论 #28652770 未加载

urthorover 3 years ago

I don't think it's a particularly new feature of software development that a few highly paid employees who've got the entire stack in their brains are vastly more productive than a vast cross functional team.

评论 #28652398 未加载

tofflosover 3 years ago

It's a price data scientists have to pay in order to work in rapidly evolving business and solution spaces. Someone within the local organization has to experience all these tools before being able to reach similar conclusions. Many organizations are still struggling to get the data science infrastructure in place so they look for full-stack people to help get the ball rolling and start making progress on some initial set of prioritized business problems.A few organizations are further along on that journey enabling their data scientists to focus on things other than process and tooling. Full-stack will be in demand until the solution space stabilizes and the bulk of organizations catch up.

评论 #28651515 未加载

rjzzleepover 3 years ago

This is a pretty good post. I completely agree that a data scientist should not need to know Kubernetes.There is a section about Airflow and while the author doesn't advocate for it, I've very much like it many many times. People still recommend it, but I find it to be an absolute nightmare to deal with.One thing I have learned dealing with different data science teams is something else though. I have gone through every single pipelining tool(including pachyderm) and stream processing tool that was available at the time. The thing that people forget is that every single one of them has a thing that throws you off of what you actually want to accomplish or has some sort of caveats in your use case.The important thing to note is that the job of the architect or whatever you want to call that person, is to provide an infrastructure where the data scientist can just run their code. And no matter which one of these environments you use you still need to build glue code for your use case. Even if that glue code is python library with a good distribution mechanism.

评论 #28650958 未加载

评论 #28652742 未加载

FpUserover 3 years ago

I am a developer and do not know much about k8s. Well I know the theory and what they're for and could learn to use it in practice. However I have yet to find a single case amongst my clients where all this infrastructure overhead will provide positive ROI. I do not deal at Google scale and for normal businesses a single instance of properly written server deployed on dedicated hardware covers all their needs many times over. It serves as many requests as they can ever hope for without breaking a sweat.

kureikainover 3 years ago

I had extensive airflow and I generally agree that Airflow isn't a good solution. It good when you process a single atomic/"unit of work" per step, when each step process multiple files etc and if it's restart you have to write code to handle skip those processed file for example.But I want to point out a few things that are wrong in the artcile to help other evaluate airflow.> Second, Airflow’s DAGs are not parameterized, which means you can’t pass parameters into your workflows. So if you want to run the same model with different learning rates, you’ll have to create different workflows.You can pass the parameter to workflows by giving it a JSON config. When trigger on the UI, you can paste the JSON with the right argument/parameters into your DAGs. So you can train model with different arguments etc> Third, Airflow’s DAGs are static, which means it can’t automatically create new steps at runtime as needed.You can absolutely create new steps at run time. The point of airflow is everything is just Python code that is evaluate to generate DAGs, as long as you generate the DAGs and write the operator. It will happily run and log. It may have trouble rendered on the UIs and cause some weird issue (tasks won't advanced after certain steps regardless when I last work on them but they are bugs).You can write an operator, the operator in turn can initiate any other known operators, and point the next steps to those operators. Here is an example: <a href="https://stackoverflow.com/questions/41517798/proper-way-to-create-dynamic-workflows-in-airflow" rel="nofollow">https://stackoverflow.com/questions/41517798/proper-way-to-c...</a>

评论 #28683602 未加载

dudeinjapanover 3 years ago

Waiters shouldn't need to know anything about cooking.However, knowing a bit about cooking might one a better waiter.

thomover 3 years ago

Full stack data scientists exist. They have certain advantages over others. Specialists exist. They have certain advantages over others. Live your life, be free.

sandGorgonover 3 years ago

I'm kind of surprised at seeing kubeflow vs metaflow levels of abstraction honestly.If you are indeed talking from a data scientist POV - then the right abstractions here are Dask and Ray Distributed.Both can run on Kubernetes as the underlying orchestration layer - but are a pythonic interface to distributed data science primitives.

falcolasover 3 years ago

My opinion is simply: You should understand the environment your code runs in. Be it bare metal, Kubernetes, or anything in-between. How that environment works determines how your code works - or doesn’t work.Despite our best efforts, we have yet to abstract away the runtime environment. Despite Java’s best efforts.

评论 #28651067 未加载

评论 #28650934 未加载

评论 #28650570 未加载

评论 #28650993 未加载

评论 #28651039 未加载

评论 #28651059 未加载

mmarqover 3 years ago

These requests are not unreasonable in organisations that only need to run some simple (from a mathematical standpoint) operations against a complex (from an IT perspective) dataset. Quite often you don't need a full time statistician or mathematician, but you can make it a full time job if you hire a sysadmin or a developer that understand statistical distributions and hypothesis testing, and you put them in charge of the whole data infrastructure.I'm not saying this is the majority of data scientists jobs, but in some organisations I worked for the data analyst was a guy that run `SELECT MIN(v), MAX(v) AVG(v) from TableX` against a MySql DB, so they were also in charge of DB administration and data ingestion, otherwise it would not have been a full time job.

alxmrsover 3 years ago

My favorite infrastructure abstraction tool in this category is Apache Beam. I like that it lets you think in Python and an explicit Map Reduce DAG. Serialization errors are a bear to deal with. But, the power and composability of the framework make it nearly addictive.

ricklamersover 3 years ago

This post really resonates with why we created Orchest [0]From the article: "involve two full sets of tools: one for the dev environment, and another for the prod environment"This is what we think should change. We intend to bring dev and prod into a single cohesive environment. Initially it will be difficult to cover all types of production workloads (like the post mentioned, production is a spectrum). But what we've observed is that through container encapsulation we can create well defined production workloads that we can run on any container orchestrator while shielding the data scientists from that complexity during pipeline development _and_ deployment.With a container first approach to DAGs it becomes trivial not just to mix library versions but even languages (e.g. feature extraction in Scala and model fitting in Python). In practice, this flexibility has resulted in a significant productivity increase because existing code "just works". No "one virtual environment to rule them all" necessary.I like how the article does justice to the fact that there's a subtle yet important difference between mere workflow orchestrators and workflow orchestrators that take on meaningful responsibility when it comes to infrastructure. To really unburden the data scientist from having to be a full-stack unicorn you need to hide the underlying stack to the point where it's invisible. In that sense, the OS kernel analogy really works. Similarly, how many data analysts writing SQL have ever worried about database node sharding?A big problem we see in the space is that there are still way too many leaky abstractions and data scientists end up dealing with architecture & config yet again, for many a task out of their depth. We hope to contribute to a better ecosystem, one where data scientists spend their time looking at the data, relating it to the domain, shipping value generating data pipelines/models, and communicating about results with their stakeholders. Not fighting config & infra.[0] <a href="https://github.com/orchest/orchest" rel="nofollow">https://github.com/orchest/orchest</a>

spicyramenover 3 years ago

Very limited and unfair comparison between Kubeflow and metaflow. Metaflow is dependent on AWS (it is mentioned but not emphasized). To me this is a non-starter. It makes sense for Netflix but not for the rest of the world

评论 #28651493 未加载

m0zgover 3 years ago

Increasingly data scientists need to know a thing or two about underlying tech. Otherwise you're limiting yourself to stuff that can be built on a single machine, and that doesn't get you very far. That said, with that list of qualifications they'll be looking for a very long time, especially if they aren't prepared to hire a $400/hr contractor to do all that stuff. Such people exist, there are just very few of them, and they're booked solid months in advance.

评论 #28651792 未加载

lvl100over 3 years ago

This is laughable. 15 years of DL? I ran neural net models more than 15 years ago. It wasn’t even accepted back then. Heck people looked at you weird if you mentioned Python. As far as I am concerned if you tell me you did DL before 2013 as a “DATA SCIENTIST” you are full of shit.As far as OP, how do you learn Docker without Kubernetes these days? To me this is like saying you don’t need to learn Windows because all you do is run the solver in Excel.

TruthWillHurtover 3 years ago

What DO they know? Their Python code is sub-par, a procedural script not suitable for production use. They can't use Git, They don't write tests. They don't understand how to deploy/use CICD.Maybe they should stick to spreadsheets, or upskill a bit so they don't consume so much of the engineers time.

评论 #28653864 未加载

评论 #28652925 未加载

EastSmithover 3 years ago

Nobody that is not in a system administrator / dev ops role needs to know about it. I do not want to know about it. I am not explaining react reconciliation in my scrum updates, so stop giving me updates about Kubernetes.

justsomeuserover 3 years ago

Sure they don’t need to know how to schedule their computations on CPU’s as another team member can handle it, but I think the reality is that if you work in software you have to constantly be learning.

tuananhover 3 years ago

by that def, developers shouldn't need to know Kubernetes as well..however, with the raise of devops culture, everyone should know the stack so they can use the platform effectively. everyone needs to up skill.

sgt101over 3 years ago

I am really puzzled by "production is a spectrum". Production means that the code is run with a support team to an sla - the support team must have accepted it to service and be confident that they can deal with what might go wrong.That's production.

tedk-42over 3 years ago

Kubernetes, Linux, CI/CD pipelines, unit testing...<insert tech here/>To be honest the landscape is constantly changing and people should learn as much as they can.I call ignorance on these kinds of posts.

评论 #28651102 未加载

评论 #28651952 未加载

alexnewmanover 3 years ago

i’ve heard a lot about people don’t want to learn the stack they program on.

fithisuxover 3 years ago

Sooner or later DSes will need to become Full-Stack. Knowing Kubernetes will be an advantage.

streetcat1over 3 years ago

There is a reason that operating systems is a mandatory course in any respectable CS program. Kuberentes is no difference.Data scientist should know about kubernetes as much as they should know how to program.

phendrenad2over 3 years ago

Developers in general shouldn't need to know about Kubernetes, but it's become trendy to slash your IT/Ops teams to the bone and instead accept that your developers will just spend all of their time trying to configure GCP.

评论 #28652534 未加载