We've been hard at work for a few weeks and thought it's time for another update.<p>In case you missed our first post, PostgresML is an end-to-end machine learning solution, running alongside your favorite database.<p>This time we have more of a suite offering: project management, visibility into the datasets and the deployment pipeline decision making.<p>Let us know what you think!<p>Demo link is on the page, and also here: <a href="https://demo.postgresml.org" rel="nofollow">https://demo.postgresml.org</a>
Seems like a great idea. When you look at many ML frameworks half the code and learning overhead is data schlepping code and table like structures that "reinvent" the schema that already exists inside a database. Not to mention, there can be security concerns from dumping large amounts of data out of the primary store (how are you going to GDPR delete that stuff later on?). So why not use it natively where the data already is?<p>For anything substantive it seems like a bad idea to run this on your primary store since the last thing you want to do is eat up precious CPU and RAM needed by your OLTP database. But in a data warehouse or similar replicated setup, it seems like a really neat idea.
This is really cool, running ML workloads on top of SQL is a very practical way of doing ML for a lot of businesses. Many companies don't have the fancy ML workloads like you see at OpenAI, they just have a SQL database with some data that could greatly help their business with some simple ML models trained on it. This looks like a nice way to do it. A slightly different approach that I've been working on involves hooking data warehouses up to Pachyderm [0] so you can do offline training on it. Not as good for online stuff as this, but for longer running batch style jobs it works really well.<p>[0] <a href="http://github.com/pachyderm/pachyderm" rel="nofollow">http://github.com/pachyderm/pachyderm</a>
How do you deal with different dataset train/validation/test? How do you measure the degradation of the model? Is there any way to select the metric you target (accuracy, f1-score or any other)?
This is great! FYI for those who haven't seen, BigQuery can also run statistical learning methods directly on your data as part of the query. Really cool to see ML going this direction.
Hello
really nice !<p>Can you explane the differences with <a href="https://madlib.apache.org/" rel="nofollow">https://madlib.apache.org/</a> ?
Wouldnt an OLAP db better suited than pg for this kind of workload ?<p>Does being a postgreSQL module make it compatible with citus, greemplum or timescale ?
Congratulations on the launch!<p>This is the most exciting ML related project I've seen in a while, Mainly because the barrier for entry seems low as anyone with PG database could apply a model on them using PostgresML if I understood the premise correctly.<p>Most of the comments here seems to regarding separating the compute from the database machine which it seems isn't possible right now with PostgresML, But the GitHub reads at the start:<p>> The system runs Postgres with the pgml-extension installed on port 5433 by default, *just in case you happen to be running Postgres already*:<p><pre><code> $ psql -U postgres -h 127.0.0.1 -p 5433 -d pgml_development
</code></pre>
I think the second part needs to be clarified better, Is it installing PGML extension on a machine running a existing PG database and connecting to it (or) does it mean just starting the postgres session of the PGML docker package?
Great idea! I see this is implemented using the Python language interface supported by PostgreSQL and importing sklearn models. I always wonder how scalable this is considering the serialization-deserialization overhead between Postgres' core and Python. Do you see any significant performance difference between this and training the sklearn models directly on something like Dataframes?
Interesting concept, but I think Big Query ML [1] has been providing similar features for years now. Curious to learn what are the differences, other than offering this as a Postgres plugin.<p>[1] <a href="https://cloud.google.com/bigquery-ml/docs/introduction" rel="nofollow">https://cloud.google.com/bigquery-ml/docs/introduction</a>
Reminds me of <a href="https://riverml.xyz/latest/" rel="nofollow">https://riverml.xyz/latest/</a> (which is awesome) but the idea is even better because it skips all the copying and preprocessing yak shaving. Can't wait to kick the tires!
Cool approach. This nicely fits in the trend of SQL-as-much-as-possible because that makes it just a tiny bit more accessible. Definitely going to play with this in the next few days.
(edit:)
Being able to get training data from a SQL view is by far the nicest. Keep it up!
I feel like a lot of issues out of ML systems came from the fact that some person got a CSV dump of the data and then iterated for a month to build a fantastic model, which nobody knows how to integrate with the DB.<p>So, this is why I really like this idea and about 3 years ago I seriously thought about starting this thing as well. I went ahead and built a specific data company (so not a tooling one) and now I don't like this idea anymore.<p>To me this is a lot like proposing: "lets get rid of Rest Apis and Graphql and connect the frontend directly to the DB". (ignoring security issues for a bit).<p>In frontend: The view you like to display your data is a different one than how it should be saved.
Exactly the same in ML, the view your data can be trained / predicted on is a very different than it should be stored.<p>They are connected, but IMO there always has to be a transformation layer. (and Python is just a much better way to do that transformation, but that's an other story)
This is awesome. I’m guessing the models are executed on the database server and not a separate cluster? What about GPU training? How is that handled? I’d love to see more docs.
This looks awesome! I’m not an expert but wouldn’t the typical database hardware not be really optimal for running ML? Is this meant to run on a replica (which is quite straightforward to setup) that has ML optimised hardware?