There is a lot of latency involved shuffling data for modern and complex ML systems in production. In our experience these costs dominate end-to-end user latency, rather than actual model or ANN algorithms, which unfortunately limits what is achievable for interactive applications.<p>We've extended Postgres w/ open source models from Huggingface, as well as vector search, and classical ML algos, so that everything can happen in the same process. It's significantly faster and cheaper, which leaves a large latency budget available to expand model and algorithm complexity. In addition open source models have already surpassed OpenAI's text-embedding-ada-002 in quality, not just speed. [1]<p>Here is a series of posts explaining how to accomplish the complexity involved in a typical ML powered application, as a single SQL query, that runs in a single process with memory shared between models and feature indexes, including learned embeddings and reranking models.<p>- Generating LLM embeddings with open source models in the database[2]<p>- Tuning vector recall [3]<p>- Personalize embedding results with application data [4]<p>This allows a single SQL query to accomplish what would normally be an entire application w/ several model services and databases<p>e.g. for a modern chatbot built across various services and databases<p><pre><code> -> application sends user input data to embedding service
<- embedding model generates a vector to send back to application
-> application sends vector to vector database
<- vector database returns associated metadata found via ANN
-> application sends metadata for reranking
<- reranking model prunes less helpful context
-> application sends finished prompt w/ context to generative model
<- model produces final output
-> application streams response to user
</code></pre>
[1]: https://huggingface.co/spaces/mteb/leaderboard<p>[2]: https://postgresml.org/blog/generating-llm-embeddings-with-open-source-models-in-postgresml<p>[3]: https://postgresml.org/blog/tuning-vector-recall-while-generating-query-embeddings-in-the-database<p>[4]: https://postgresml.org/blog/personalize-embedding-vector-search-results-with-huggingface-and-pgvector<p>Github: https://github.com/postgresml/postgresml