I'm looking into switching over to using MLflow or Polyaxon for experiment management and tracking. We currently us a a custom built django app for experiment tracking and run experiments by hand on desktop workstations but we're starting to move some of that over to GCP.<p>For people who have used either of the projects, what are your opinions and are there any hidden issues that you ran into?<p>Ideally we'd like to have a platform that makes it easy to schedule runs on the desktops or GCP depending on requirements and available resources. Seems like kubernetes might be the best option for that and it doesn't look like MLflow supports it out of the box yet.
As an ML engineer, I’ve found MLFlow to be really a disastrously bad way to look at the problem. It’s something that managers or executives buy into without understanding it, and my team of engineers (myself included) have hated it.<p>There are many feature specific reasons, but the biggest thing is that reproduction of experiments needs to be synonymous with code review and the identically same version control system you use for other code or projects.<p>This way reproducibility is a genuine constraint on deployment and deployment of an experiment, whether just training a toy model, incorporating new data, or really launching a live experiment, is conditional on reproducibility and code review of the code, settings, runtime configs, etc., that embodies it totally.<p>This is much better solved with containers, so that both runtime details and software details are located in the same branch / change set, and a full runtime artifact like a container can be built from them.<p>Then deployment is just whatever production deployment already is, usually some CI tool that explains where a container (built from a PR of your experiment branch for example) is deployed to run, along with whatever monitoring or probe tracking tools you already use.<p>You can treat experiments just like any other deployable artifact, and monitor their health or progress exactly the same.<p>Once you think of it this way, you realize that tools like ML Flow are <i>categorically</i> the wrong tool for the job, almost by definition, and they exist mostly just to foster vendor lock-in or support reliance on some commercial entity, in this case Databricks.