TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Better Search Using AI

1 点作者 AmuVarma将近 3 年前
Semantic Search and Answering is not widely adopted, because of the cost, complexity, and lack of accuracy. At Semantic.app we’ve built the best hosted semantic search engine and here’s how we’ve done it.<p>Embedding: the process of converting text into a tensor (read vector). This can be done on documents, sentences, or words depending on what you are trying to accomplish. When you upload data. We create a tiered embedding system. We use word embeddings on key words using T5 (google) to create keywords, and also dense vector embeddings for each document using GPT-3 (the model depends on document size).<p>Search; Keywords are important for search, and how you choose to index can affect search speeds, especially when they are computationally complex. Rather than dotting potentially millions of vectors with an embedded query (using gpt3 asymmetric search), we’ve found using a keyword matching algorithm then a nearest neighbor search (like FAISS) to produce the best results.<p>Next we cluster the top n results using a Bayesian classifier - realistically the answer can be in any of the top few documents and choosing the top one naively is “lucky”, as vector embeddings no matter how good are at best approximate. We take our top cluster and break it down into semantic chunks (we’ve coined “chunkify”, using a custom GPT 3 model) and use a sentence level embedder to find the most relevant chunk, which we return for search.<p>For answering we use a classifier to detect what length of answer we expect (like “when..” may anticipate a short date, but “why” might indicate a longer response needed) and we pick the right model. We then use word embeddings to find the most relevant terms in the answer as well as the sentence embeddings in the previous paragraph to weight certain words and phrases the model (usually GPT or BERT) should spit out. We also use a custom model to ensure that the answer is only contained within the text (Large Models can leak answers) and filter&#x2F;reask the question to the model.

暂无评论

暂无评论