科技回声

1 comment

mlthoughts2018超过 4 年前

One of the big lessons from advice like this: your organization must invest heavily in data infrastructure and unique SRE needs for machine learning if you hope to get value from ML projects.If you don’t invest in data engineering or you yoke your ML engineers with job duties to also handle all the operations work, you’re not going to see meaningful returns and it won’t be because ML is just hype or is inapplicable to your business. It will because of poor leadership and overloading operations responsibilities.For ML, the SDLC is just very different.You need there to exist an auditable system of feature ingestion and serving as a foundation so that model training can standardize, and you don’t overwhelm an ML team with duties to constantly deal with schema changes, data ingestion job delays or outages, collating poorly maintained data across multiple teams.You also need a standardized way to define training jobs, with easy config interface to specify all the resources needed (GPUs, disk space, cached data sets, certain CPU properties, docker container with the training environment prepared) that connects to an experiment management system to track models and map the entire bundle of parameters (hyperparameters, version control commit hash, resource settings, etc) to each trained model output artifact and evaluation metrics.If you leave this stuff as free for alls for ML engineers to figure out, they will have to spend so much time fighting with infrastructure that is not their problem that either they’ll burn out and quit to get a job that doesn’t treat them this way, or they’ll resign themselves to being an “ops monkey” and you’ll see an endless churn of infra projects that are supposed to finally unlock the real potential of ML for your business but inevitably never do.Practices of healthy companies that lead to real value from ML projects:1. Make data producers responsible for their data delivery, data quality and data timeliness. If some team owns a search page, they also own everything about ingesting the data generated by that page. Data Platform teams might give them tools or provide infrastructure for it, but that application-specific team is ultimately responsible.2. SRE teams do not helicopter in with parochial best practice recommendations and then leave. SRE teams must co-own the challenging components where the rubber meets the road and really be down in the implementation weeds - this is critically true for unusual or long-tail special case systems, for example like GPU workflows or Spark infra for ML jobs. SRE cannot be either siloed to focus only on the biggest use cases or acting like they just dole out philosophy and best practices. They have to be tied to the delivery incentives of every team they serve.3. Clear the decks of ML engineer time. Your ability to get value from ML engineers directly relates to how much unblocked autonomy they are given. If they are overloaded with bureaucratic chores or maintenance tasks or on-call responsibilities, it is just directly subtracting from their comparative advantage of conducting model training and optimization. Nobody likes to admit it, but you can get other people to do the maintenance, the on-call shifts or the compliance tasks. Many engineers won’t be sacrificing a comparative advantage by being allocated to that stuff, but if you allocate ML engineers to it, you are flushing money down the toilet by wasting their comparative advantage. Remember: it’s a business - it’s not about what’s “fair” in terms of everyone sharing eg an on-call shift, it matters what creates the best result for your customer.Before you can apply the tactical steps like the 12 steps of this article, you have to resolve much bigger issues of engineering culture to reflect the 3 points above.If internal engineering politics won’t block you from these culture changes, then you can set things up to get value from ML. Otherwise, you’ll probably just waste money and have high turnover on the ML teams because of the unworkably broken culture issues upstream of the actual dev workflow tactics.

Reproducible Machine Learning in production

1 comment

Reproducible Machine Learning in production

1 comment