>A machine learning algorithm which can be trained using SQL opens a world of possibilities. The model and the data live in the same space. This is as simple as it gets in terms of architecture. Basically, you only need a database which runs SQL.<p>First paragraph of the conclusion, and this very much fits with the mindset that's been growing in me in the data world over the past few years. Databases are much more powerful than we think, they're not going to go away, only get better, and having the data and the logic in the same space really removes tons of headaches. ML models, transformation of data, generating json for an API can all be done within the database rather than outside scripting language.<p>Are others seeing this? Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?
Title here is wrong. Title in article and headings in article are right: ONLINE gradient descent<p>It's specifically <i>not</i> stochastic. From the article:<p><i>Online gradient descent</i><p><i>Finally, we have enough experience to implement online gradient descent. To keep things simple, we will use a very vanilla version:</i><p><i>- Constant learning rate, as opposed to a schedule.</i><p><i>- Single epoch, we only do one pass on the data.</i><p><i>- Not stochastic: the rows are not shuffled.</i> ⇠ ⇠ ⇠ ⇠<p><i>- Squared loss, which is the standard loss for regression.</i><p><i>- No gradient clipping.</i><p><i>- No weight regularisation.</i><p><i>- No intercept term.</i>
Just don't.<p>SQL:<p>- does not allow for easy and clean importing of modules/libraries<p>- is not easily to write tests for<p>- has limited support for a debugger<p>- lacks a consistent style for such large queries (plus most textbook cover fairly simple stuff) which means it's hard for a developer to start reading someone else's code (more than in other languages)<p>- clearly indicates in its name that it is a Query language.<p>Save yourself the trouble and all your collaborators the pain of working with this code in the future, of trying to add new features, of trying to reuse it in another project.<p>If you want to operate near the data, use PL/Python for PostgreSQL.<p>EDIT: Fixed formatting.
This is great! Moving away from the proprietary nature of GPUs and complex math gatekeeping should help democratize AI.<p>Has anyone converted stuff like gradient descent to set theory?<p><a href="https://support.unicomsi.com/manuals/soliddb/7/SQL_Guide/2_GettingStartedSQL.03.04.html" rel="nofollow">https://support.unicomsi.com/manuals/soliddb/7/SQL_Guide/2_G...</a><p><a href="https://www.sqlshack.com/mathematics-sql-server-fast-introduction-set-theory/" rel="nofollow">https://www.sqlshack.com/mathematics-sql-server-fast-introdu...</a><p><a href="https://www.sqlshack.com/learn-sql-set-theory/" rel="nofollow">https://www.sqlshack.com/learn-sql-set-theory/</a><p>Right now AI algorithms kind of look imperative and stateful to me, like state machines. But there should be a functional or relational representation, where pattern matching would be used to derive the current state instead.<p>It's trivial to go from functional to imperative representation, but often nearly impossible to go the other way. That's why monadic (sorry if I'm mincing terms) logic is so troublesome. Mutable variables, futures/promises, async, etc can't be statically analyzed, which is why most code today is difficult to parallelize and stuff like C++ optimizers don't even increase speed to the next order of magnitude. But functional languages have nearly infinite optimization potential through divide and conquer approaches like sharding and scatter-gather arrays that can run internally without side effects. In other words, we can throw hardware at SQL for linear speedup since it's embarrassingly parallel, but might have limited success optimizing something like Python.
This is really interesting, but a basic part I don't understand: What would it actually look like to run this on a live dataset?<p>If I understand correctly: you'd run the recursive query, it produces results for every step, effectively showing you the progression of output over time, and then once it hits "present day", it completes and stops?<p>How would you generate results going forward? I.E. A minute elapses after the results return, do you have to re-run the whole query for all time?
From the start I assumed this is a nice playful "Hexing the technical interview" kinda joke. But given the tone of the article, and some of the comments here… Uh, this cannot be serious, right?
Incredible post. I laughed when I saw the title, snickered at the first paragraph, and then proceeded to be blown away by the rest of it. Thought I was in for a joke and instead I'm thinking about the nature of ML Ops and what it's become.
In the comments here so far, we see a pattern we've seen before. When someone suggests doing something in SQL, there's a lot of concern about SQL being a very limited programming language where it's hard to do proper engineering.<p>Here's I would really love to know: why is it that SQL is, to first order, the only language used to interact with databases, and SQL has about the same features as it did in the 70s? It seems analogous to if the general-purpose programming world stopped with C.
This is genius I love it.<p>On stream inference with something like "continuous" data in the same structure I think is the final material form for "AI" so this is a great step towards that<p>Thanks for the writeup
You might want to consider checking out ClickHouse which supports many ML functions natively:<p>- stochasticLinearRegression (<a href="https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/stochasticlinearregression/" rel="nofollow">https://clickhouse.com/docs/en/sql-reference/aggregate-funct...</a>)<p>- stochasticLogisticRegression (<a href="https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/stochasticlogisticregression/#agg_functions-stochasticlogisticregression" rel="nofollow">https://clickhouse.com/docs/en/sql-reference/aggregate-funct...</a>)
This is great. The only thing I dislike from this is using these variables to try to predict Adj Close when they are not at all correlated.<p>There are countless meaningful correlations in financial data that would have been just as easy to play around with. One truly valuable example would be to look at trading multiples of comparable companies. Sticking to P/E would be easier as P is easily observable and forward-looking EPS estimates are generally always available. This would limit the exercise to more mature companies than the ones commonly discussed on HN but would make it actually meaningful
Just to disambiguate, here machine learning = linear regression. So no auto-grad needed as gradients are explicit and simple.
Still interesting though.
From a comment by the author:<p>> Also, I can't justify why, but my gut feeling is that the database should be an order of magnitude faster than Python, provided the implementation were to be improved.<p>Would be curious how that could end up being the case. Perhaps if NumPy wasn't used at all? That would mean no vectorization or optimized storage.<p>Would be interesting to see how it scaled with length and dimensionality
We've been doing this type of gradient descent in production for a couple of years now, and we've been very happy with it. A few insights from my experience:<p>One of the main benefits of doing differentiable programming over a relational database is that you can use joins to traverse the normalized data schema, and this acts as a very effective prior for your model. For example, you can learn parameters associated to shirt colors by having a parameter in the color table, and joining it with the shirt sales table (through the shirts table). And knowledge of the table structure also lets you improve convergence, for example by adding a factor to compensate for the fact that not all colors have the same number of sales (and therefore have been updated fewer times during each epoch). Here's a paper from last year: <a href="https://arxiv.org/abs/2209.03771v1" rel="nofollow">https://arxiv.org/abs/2209.03771v1</a><p>Automatic differentiation is an absolute must-have. Once you start having a few complex joins and aggregations, it's too hard to derive the gradient by hand. Doing automatic differentiation on relational queries requires some adjustments (in particular, because you cannot use a tape), so you end up having to define a subset of relational algebra that is closed by automatic differentiation. The general ideas were presented in <a href="https://ceur-ws.org/Vol-2971/paper07.pdf" rel="nofollow">https://ceur-ws.org/Vol-2971/paper07.pdf</a><p>On the other hand, while it's possible to do SQL-to-SQL automatic differentiation, the resulting queries have poor performance, so it's better to go one step lower (at the level of the executor for an already-planned query) and perform the automatic differentiation there. And we had an excellent intern work with us on dedicated parallelization for gradient descent queries: <a href="https://blog.lokad.com/pdf/reproducible-parallel-sgd-ziyad-benomar-2022-09.pdf" rel="nofollow">https://blog.lokad.com/pdf/reproducible-parallel-sgd-ziyad-b...</a><p>Finally, in practice, it usually doesn't matter if the gradient descent is stochastic, so long as you run several passes over the data.<p>I'm very interested in differentiable programming on relational languages, as the mainstream research concentrates on unstructured, low-information-density data (images/sound/text) when there are many domains where the data is more dense and structured, and (because of that structure) is stored in a database.
There was also this neural network module in redis if you want to do training and inference in a redid DB. (quite old btw)
<a href="https://github.com/antirez/neural-redis">https://github.com/antirez/neural-redis</a>
I tried to replicate this in SQLite. The first few steps worked OK, e.g.<p><a href="https://lite.datasette.io/?json=https://gist.github.com/simonw/f750e08b33851a5ac8786d0b717a0152#/data?sql=WITH+RECURSIVE%0A++++stream+AS+%28%0A++++++++SELECT+ROW_NUMBER%28%29+OVER+%28%29+AS+step%2C+%22Adj+Close%22+AS+x%0A++++++++FROM+raw%0A++++++++ORDER+BY+step%0A++++%29%2C%0A++++state%28step%2C+x%2C+avg%29+AS+%28%0A++++++++--+Initialize%0A++++++++SELECT+step%2C+x%2C+x+AS+avg%0A++++++++FROM+stream%0A++++++++WHERE+step+%3D+1%0A++++++++UNION+ALL%0A++++++++--+Update%0A++++++++SELECT%0A++++++++++++stream.step%2C%0A++++++++++++stream.x%2C%0A++++++++++++state.avg+%2B+%28stream.x+-+state.avg%29+%2F+stream.step+AS+avg%0A++++++++FROM+stream%0A++++++++INNER+JOIN+state+ON+state.step+%2B+1+%3D+stream.step%0A++++%29%0A%0ASELECT+*%0AFROM+state%0AORDER+BY+step+DESC%0ALIMIT+5" rel="nofollow">https://lite.datasette.io/?json=https://gist.github.com/simo...</a><p>(I replaced "figures" with "raw" due to the way Datasette Lite assigns a default table name to the imported JSON)<p>But the more complex recursive queries gave me this error and I'm not sure how to work around it:<p><pre><code> recursive reference in a subquery: state
</code></pre>
E.g. <a href="https://lite.datasette.io/?json=https://gist.github.com/simonw/f750e08b33851a5ac8786d0b717a0152#/data?sql=WITH+RECURSIVE%0A++++stream+AS+%28%0A++++++++SELECT%0A++++++++++++ROW_NUMBER%28%29+OVER+%28%29+AS+step%2C%0A++++++++++++%22Adj+Close%22+AS+x%2C%0A++++++++++++%22Close%22+AS+y%0A++++++++FROM+raw%0A++++%29%2C%0A++++state%28step%2C+x%2C+x_avg%2C+y%2C+y_avg%2C+cov%29+AS+%28%0A++++++++--+Initialize%0A++++++++SELECT%0A++++++++++++step%2C%0A++++++++++++x%2C%0A++++++++++++x+AS+x_avg%2C%0A++++++++++++y%2C%0A++++++++++++y+AS+y_avg%2C%0A++++++++++++0.0+AS+cov%0A++++++++FROM+stream%0A++++++++WHERE+step+%3D+1%0A++++++++UNION+ALL%0A++++++++--+Update%0A++++++++SELECT%0A++++++++++++step%2C%0A++++++++++++x%2C%0A++++++++++++x_new_avg+AS+x_avg%2C%0A++++++++++++y%2C%0A++++++++++++y_new_avg+AS+y_avg%2C%0A++++++++++++cov+%2B+%28%28x+-+x_prev_avg%29+*+%28y+-+y_new_avg%29+-+cov%29+%2F+step+AS+cov%0A++++++++FROM+%28%0A++++++++++++SELECT%0A++++++++++++++++stream.step%2C%0A++++++++++++++++stream.x%2C%0A++++++++++++++++stream.y%2C%0A++++++++++++++++state.x_avg+AS+x_prev_avg%2C%0A++++++++++++++++state.x_avg+%2B+%28stream.x+-+state.x_avg%29+%2F+stream.step+AS+x_new_avg%2C%0A++++++++++++++++state.y_avg+AS+y_prev_avg%2C%0A++++++++++++++++state.y_avg+%2B+%28stream.y+-+state.y_avg%29+%2F+stream.step+AS+y_new_avg%2C%0A++++++++++++++++state.cov%0A++++++++++++FROM+stream%0A++++++++++++INNER+JOIN+state+ON+state.step+%2B+1+%3D+stream.step%0A++++++++%29%0A++++%29%0A%0ASELECT+step%2C+cov%0AFROM+state%0AORDER+BY+step+DESC%0ALIMIT+5&DOUBLE=" rel="nofollow">https://lite.datasette.io/?json=https://gist.github.com/simo...</a>