Hi,<p>I’m doing research for a project within my company to develop a framework to help MLEs develop models. It’ll handle things like fault tolerance, checkpointing, integration with internal service APIs, and should be mostly framework agnostic.<p>There is a lot of literature on the system design of ML training systems companies publish, but little about the SDK layer. Does anyone know of industry publications detailing this layer, preferably by larger companies?