TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: We unified LLMs, vector memory, ranking, pruning models in one process

4 pointsby levkkabout 2 years ago
There is a lot of latency involved shuffling data for modern and complex ML systems in production. In our experience these costs dominate end-to-end user latency, rather than actual model or ANN algorithms, which unfortunately limits what is achievable for interactive applications.<p>We&#x27;ve extended Postgres w&#x2F; open source models from Huggingface, as well as vector search, and classical ML algos, so that everything can happen in the same process. It&#x27;s significantly faster and cheaper, which leaves a large latency budget available to expand model and algorithm complexity. In addition open source models have already surpassed OpenAI&#x27;s text-embedding-ada-002 in quality, not just speed. [1]<p>Here is a series of posts explaining how to accomplish the complexity involved in a typical ML powered application, as a single SQL query, that runs in a single process with memory shared between models and feature indexes, including learned embeddings and reranking models.<p>- Generating LLM embeddings with open source models in the database[2]<p>- Tuning vector recall [3]<p>- Personalize embedding results with application data [4]<p>This allows a single SQL query to accomplish what would normally be an entire application w&#x2F; several model services and databases<p>e.g. for a modern chatbot built across various services and databases<p><pre><code> -&gt; application sends user input data to embedding service &lt;- embedding model generates a vector to send back to application -&gt; application sends vector to vector database &lt;- vector database returns associated metadata found via ANN -&gt; application sends metadata for reranking &lt;- reranking model prunes less helpful context -&gt; application sends finished prompt w&#x2F; context to generative model &lt;- model produces final output -&gt; application streams response to user </code></pre> [1]: https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;mteb&#x2F;leaderboard<p>[2]: https:&#x2F;&#x2F;postgresml.org&#x2F;blog&#x2F;generating-llm-embeddings-with-open-source-models-in-postgresml<p>[3]: https:&#x2F;&#x2F;postgresml.org&#x2F;blog&#x2F;tuning-vector-recall-while-generating-query-embeddings-in-the-database<p>[4]: https:&#x2F;&#x2F;postgresml.org&#x2F;blog&#x2F;personalize-embedding-vector-search-results-with-huggingface-and-pgvector<p>Github: https:&#x2F;&#x2F;github.com&#x2F;postgresml&#x2F;postgresml

1 comment

levkkabout 2 years ago
Links:<p>[1]: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;mteb&#x2F;leaderboard" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;mteb&#x2F;leaderboard</a><p>[2]: <a href="https:&#x2F;&#x2F;postgresml.org&#x2F;blog&#x2F;generating-llm-embeddings-with-open-source-models-in-postgresml" rel="nofollow">https:&#x2F;&#x2F;postgresml.org&#x2F;blog&#x2F;generating-llm-embeddings-with-o...</a><p>[3]: <a href="https:&#x2F;&#x2F;postgresml.org&#x2F;blog&#x2F;tuning-vector-recall-while-generating-query-embeddings-in-the-database" rel="nofollow">https:&#x2F;&#x2F;postgresml.org&#x2F;blog&#x2F;tuning-vector-recall-while-gener...</a><p>[4]: <a href="https:&#x2F;&#x2F;postgresml.org&#x2F;blog&#x2F;personalize-embedding-vector-search-results-with-huggingface-and-pgvector" rel="nofollow">https:&#x2F;&#x2F;postgresml.org&#x2F;blog&#x2F;personalize-embedding-vector-sea...</a><p>Github: <a href="https:&#x2F;&#x2F;github.com&#x2F;postgresml&#x2F;postgresml">https:&#x2F;&#x2F;github.com&#x2F;postgresml&#x2F;postgresml</a>