Who needs MLflow when you have SQLite?

252 pointsby edublancasover 2 years ago

19 comments

LeanderKover 2 years ago

I think MLflow is a good idea (very) badly executed. I would like to have a library that combines:- simple logging of (simple) metrics during and after training- simple logging of all arguments the model was created with- simple logging of a textual representation of the model- simple logging of general architecture details (number of parameters, regularisation hyperparameters, learning rate, number of epochs etc.)- and of course checkpoints- simple archiving of the model (and relevant data)and all that without much (coding) overhead and only using a shared filesystem (!) And with an easy notebook integration. MLflow just has way to many unnecessary features and is unreliable and complicated. When it doesn't work it's so frustrating, it's also quite often super slow. But I always end up creating something like MLflow when working on an architecture for a long time.EDIT: having written this...I fell like trying to write my own simple library after finishing the paper. A few ideas have already accumulated in my notes that would make my life easier.EDIT2: I actually remember trying to use SQLite to manage my models! But the server I worked on was locked down and going through the process to get somebody to install me SQLite was just not worth it. It's also was not available on the cluster for big experiments, where it would be even more work to get it, so I gave up on the idea of trying SQLite.

评论 #33635097 未加载

评论 #33627016 未加载

评论 #33627281 未加载

评论 #33633827 未加载

评论 #33628305 未加载

评论 #33628163 未加载

benjaminwoottonover 2 years ago

The elephant in the room with data is that we don’t need a lot of the fancy and powerful technology. SQL against a relational database gets us extraordinarily far. Add some Python scripts where we need some imperative logic and glue code, and a sprinkle of CI/CD if we really want to professionalise the work of data scientists. I think this covers the vast majority of situations.Despite being around it for some time, I’m not sure big data or machine learning needed to be a thing for the vast majority of businesses.

评论 #33628060 未加载

评论 #33628993 未加载

评论 #33628220 未加载

评论 #33626105 未加载

navbakerover 2 years ago

I work in an environment where there are multiple tech teams developing models for multiple use cases on VMs and GPU clusters spread across our corporate intranet. Once you move beyond a single dev working on a model on their laptop, you absolutely need something that can handle not just metrics tracking, but making the model binaries available and providing a means to ensure reproducibility by the rest of the team. That's what MLFlow is providing for us. The API is a mess, but at least we didn't have to code up some bespoke in-house framework, we just put some engineers on task to play around with it for a few hours and figure out the nuances of basic interactions and deployed it.

评论 #33627335 未加载

cdongover 2 years ago

I don't get why a lot of people are calling mlflow a shitshow when it has done so much getting data scientist out of recording experiments via CSV. I can log models and parameters and use the UI to track different runs. After comparisons, I can use the registry to register different staging. If you have other model diagnostic charts you can log the artifact as well. I think mlflow v2 has auto logging included so why all the fuss?

评论 #33628618 未加载

评论 #33627869 未加载

afrnzover 2 years ago

You can also use mlflow locally with SQLite (<a href="https://www.mlflow.org/docs/latest/tracking.html#scenario-2-mlflow-on-localhost-with-sqlite" rel="nofollow">https://www.mlflow.org/docs/latest/tracking.html#scenario-2-...</a>). Even though I haven't tried querying the db directly ...

guangyeuover 2 years ago

Could you provide context on why SQLite would replace MLflow? From the standpoint of model tracking (record and query experiments), projects (package code for reproducibility on any platform), deploy models in multiple environments, registry for storing and managing models, and now recipes (to simplify model creation and deployment), MLflow helps with the MLOps life cycle.

评论 #33626107 未加载

评论 #33626113 未加载

alexpotatoover 2 years ago

I recently did the following:- had a giant pcap- wrote a perl script to output some of the key value from the dump (e.g. IP and UDP packet lengths) into csv- loaded the csv into sqlite3 database- ran several queries to identify microbursts of bandwidth etcThe younger/more junior folks were blown away that you could do this with <100 lines of code and it was pretty fast.Btw, above was inspired by this: <a href="https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html" rel="nofollow">https://adamdrake.com/command-line-tools-can-be-235x-faster-...</a>

评论 #33632277 未加载

isoprophlexover 2 years ago

Yeah, MLFlow is a shitshow. The docs seem designed to confuse, the API makes Pandas look good and the internal data model is badly designed and exposed, as the article says.But, hordes of architects and managers who almost have a clue have been conditioned to want l and expect mlflow. And it's baked into databricks too, so for most purposes you'll be stuck with it.Props to the author for daring to challenge the status quo.

评论 #33626244 未加载

评论 #33626116 未加载

评论 #33626827 未加载

bfungover 2 years ago

How about a side-by-side comparison?Far too often, these articles of X is bad, use my homebrew Y instead, without showing comparison to X doesn't help illustrate 'why Y instead'.You know... <cheeky>For science.</cheeky>

diptanuover 2 years ago

I think this is a neat solution for an engineer working on their own and wants to go back and look at the data from various experiments.I don't see this scaling to many engineers working in a team, who would want to see each others experiment data, or even store artifacts like checkpoints and such. And lastly, in many cases ACLs are required as well when certain models trained with sensitive data shouldn't be shared with engineers outside of a team/group.

mostdataisniceover 2 years ago

SQLite is literally a backend for MLflow, so the argument being made really is that you should just use SQL when you can, which is kind of adjacent to any criticisms of MLflow

评论 #33627362 未加载

antipaulover 2 years ago

Not convinced by the example. I don’t see how you can’t use standard scikit-learn for it.First, the example doesn’t take advantage of sklearn’s built in, super simple parallelization via n_jobsThen, the entire example could be better wrapped with sklearn’s own cross_validate() which gives you the same functionality: a table of results across experiments.If you use a different estimator, you can easily concatenate the results into a single dfThe rest is the same.Why you need SQLite for this? (SQLite is great of course for the right use cases)And if you're doing many orders more experiments (1000s instead of 10s) then that’s probably where MLflow is good (haven’t actually used MLflow)

frgtpsswrdlameover 2 years ago

Wow this looks perfect for what I need right now - just a bit of lightweight tracking.

评论 #33628679 未加载

评论 #33626141 未加载

praveenhmover 2 years ago

what is alternative to MLflow other than SQLite, like Kubeflow, Metaflow?

评论 #33637602 未加载

评论 #33663923 未加载

评论 #33629633 未加载

评论 #33635446 未加载

评论 #33630964 未加载

评论 #33631988 未加载

评论 #33630665 未加载

j0selit0over 2 years ago

I mean, come on, SQLite doesn't even support concurrency. Are people seriously considering using it in a production scenario?If you work in a DS team where you're the only DS, then it probably suits your needs. Otherwise I can't imagine how you could achieve anything production grade

phr0kover 2 years ago

Being able to use SQL for later analysis is definitely a good idea. For smaller models SQLite for sure is enough but as soon as you want to scale your HPO across multiple servers or even just processes, you will need something that supports a multi-user database. E.g. Optuna supports PostgreSQL and also defaults to SQLite as far as I know.

Grapefruitwoesover 2 years ago

I've found weights and biases extremely easy to use with minimal integration effort. Not sure if this provides any functionality that's better.

guangyeuover 2 years ago

As noted in an earlier comment, I think there is a false equivalence between end-to-end MLOps platforms like MLflow and tools for experiment tracking. The project looks like a solid tracking solution for individual data scientists, but it is not designed for collaboration among teams or organizations.> There were a few things I didn’t like: it seemed too much to have to start a web server to look at my experiments, and I found the query feature extremely limiting (if my experiments are stored in a SQL table, why not allow me to query them with SQL).While a relational database (like sqlite) can store hyperparameters and metrics, it cannot scale for the many aspects of experiment tracking for a team/organization, from visual inspection of model performance results to sharing models to lineage tracking from experimentation to production. As noted in the article, you need a GUI on top of a SQL database to make meaningful model experimentation. The MLflow web service allows you to scale across your teams/organizations with interactive visualizations, built-in search & ranking, shareable snapshots, etc. You can run it across a variety of production-grade relational dBs so users can query the data directly through the SQL database or through a UI that makes it easier to search for those not interested in using SQL.> I also found comparing the experiments limited. I rarely have a project where a single (or a couple of) metric(s) is enough to evaluate a model. It’s mostly a combination of metrics and evaluation plots that I need to look at to assess a model. Furthermore, the numbers/plots themselves have no value in isolation; I need to benchmark them against a base model, and doing model comparisons at this level was pretty slow from the GUI.The MLflow UI allows you to compare thousands of models from the same page in tabular or graphical format. It renders the performance-related artifacts associated with a model, including feature importance graphs, ROC & precision-recall curves, and any additional information that can be expressed in image, CSV, HTML, or PDF format.> If you look at the script’s source code, you’ll see that there are no extra imports or calls to log the experiments, it’s a vanilla Python script.MLflow already provides low-code solutions for MLOps, including autologging. After running a single line of code - mlflow.autolog() - every model you train across the most prominent ML frameworks, including but not limited to scikit-learn, XGBoost, TensorFlow & Keras, PySpark, LightGBM, and statsmodels is automatically tracked with MLflow, including all relevant hyperparameters, performance metrics, model files, software dependencies, etc. All of this information is made immediately available in the MLflow UI.Addendum: As noted, there is a false equivalence between an end-to-end MLOps lifecycle platform like MLflow and tools for experiment tracking. To succeed with end-to-end MLOps, teams/organizations also need projects to package code for reproducibility on any platform across many different package versions, deploy models in multiple environments, and a registry to store and manage these models - all of which is provided by MLflow.It is battle-tested with hundreds of developers and thousands of organizations using widely-adopted open source standards. I encourage you to chime in on the MLflow GitHub on any issues and PRs, too!

评论 #33631374 未加载

geminicoolafover 2 years ago

What about BentoML?

19 comments

LeanderKover 2 years ago

评论 #33635097 未加载

评论 #33627016 未加载

评论 #33627281 未加载

评论 #33633827 未加载

评论 #33628305 未加载

评论 #33628163 未加载

benjaminwoottonover 2 years ago

评论 #33628060 未加载

评论 #33628993 未加载

评论 #33628220 未加载

评论 #33626105 未加载

navbakerover 2 years ago

评论 #33627335 未加载

cdongover 2 years ago

评论 #33628618 未加载

评论 #33627869 未加载

afrnzover 2 years ago

guangyeuover 2 years ago

评论 #33626107 未加载

评论 #33626113 未加载

alexpotatoover 2 years ago

评论 #33632277 未加载

isoprophlexover 2 years ago

评论 #33626244 未加载

评论 #33626116 未加载

评论 #33626827 未加载

bfungover 2 years ago

diptanuover 2 years ago

mostdataisniceover 2 years ago

SQLite is literally a backend for MLflow, so the argument being made really is that you should just use SQL when you can, which is kind of adjacent to any criticisms of MLflow

评论 #33627362 未加载

antipaulover 2 years ago

frgtpsswrdlameover 2 years ago

Wow this looks perfect for what I need right now - just a bit of lightweight tracking.

评论 #33628679 未加载

评论 #33626141 未加载

praveenhmover 2 years ago

what is alternative to MLflow other than SQLite, like Kubeflow, Metaflow?

评论 #33637602 未加载

评论 #33663923 未加载

评论 #33629633 未加载

评论 #33635446 未加载

评论 #33630964 未加载

评论 #33631988 未加载

评论 #33630665 未加载

j0selit0over 2 years ago

phr0kover 2 years ago

Grapefruitwoesover 2 years ago

I've found weights and biases extremely easy to use with minimal integration effort. Not sure if this provides any functionality that's better.

guangyeuover 2 years ago

评论 #33631374 未加载

geminicoolafover 2 years ago

What about BentoML?