TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Infinite Context LLMs: Going Beyond RAG with Extended Minds

145 pointsby telotortiumover 1 year ago

12 comments

mike_hearnover 1 year ago
The blog post format is so much nicer than a PDF paper!<p>More seriously, this does feel like a real advance. Vector search+context stuffing (RAG) is clearly a hack that doesn&#x27;t resemble how we actually think or do things in reality. I&#x27;ve been wondering for the last year whether it&#x27;s possible to extend the attention mechanism to more naturally connect to a bigger set of per-session weights or activations. The moment you encounter the query&#x2F;key&#x2F;value analogy it&#x27;s an obvious idea, the problem being that you need a very strong grip on the low level details of neural architecture to actually do it. Now apparently it is possible! And the way the topk knob actually maps to abstraction is quite amazing.<p>Still, this doesn&#x27;t eliminate context window constraints. The memories themselves have a form of context window in this technique. Context size (what they call sequence length) does still matter.<p>Additionally, the memories have to actually fit in GPU memory (at least in their implementation). And the memories appear to be nearly full snapshots of the network, so they will get quite large (much larger than text grabbed using RAG). So there&#x27;s going to be a painful tradeoff here for the forseeable future where you&#x27;ll have to decide whether you want a bigger smarter base model with a bigger context window but less space for memory, or a smaller model with a smaller context window but bigger memory.<p>This is the first I&#x27;ve heard of Normal Computing, who are these guys&#x2F;gals exactly?<p><i>&gt; Normal is a deep-tech startup founded by former Google Brain &amp; X engineers</i><p>Ah. That explains it. &quot;X&quot; here also refers to Google, not Twitter&#x2F;Musk devs.
评论 #38264190 未加载
评论 #38263054 未加载
评论 #38263965 未加载
isoprophlexover 1 year ago
The most important bits about what they do to make this happen:<p>&gt; In addition to the causal self-attention integral to transformers, we also allow each query token to attend to a fixed number of “external memories”. These memories are stored in a non-differentiable cache. The choice of which memories to attend to is made using cosine similarity within each decoder layer and attention head.<p>[...]<p>&gt; We create our external memories (at each layer) by passing those external contexts through our model, just like inference. Then we save the internal representations the model generated, and attend to them later.<p>What an extremely clever approach!<p>If, in a chatbot setting, you update the external memory cache during inference, this means the thing immediately retains memory of the discussion.<p>Maybe this is an alternative (quicker? more exact?) to LoRA finetuning, to give a foundational model some specific personality and experiental history?
评论 #38264128 未加载
kristiandupontover 1 year ago
With almost all of these papers, &quot;RAG&quot; is mentioned as a well-defined, absolute strategy. Is there an agreed upon implementation for it? Because as I have been building my own implementation, I have found that finding the right things to retrieve and augment the prompt with is incredibly challenging.<p>It seems to me that a <i>great</i> RAG would almost always outperform other strategies because it will be like giving the student a note with the answer right before the exam, compared to letting them read the whole curriculum, but I am very much still learning..
评论 #38261246 未加载
评论 #38262419 未加载
itissidover 1 year ago
This is great. Ive been using GPT based tools extensively for my side project which I have a demo for(<a href="http:&#x2F;&#x2F;drophere.co" rel="nofollow noreferrer">http:&#x2F;&#x2F;drophere.co</a> see code at <a href="https:&#x2F;&#x2F;github.com&#x2F;itissid&#x2F;drop_webdemo">https:&#x2F;&#x2F;github.com&#x2F;itissid&#x2F;drop_webdemo</a>) one thing I think that is missing is a couple of agents that can pick out the right context from my code depending on what needs to get done. Let me explain:<p>They are a couple of patterns that I used to design chunks of the web app: One is to do a high level design brainstorming with an agent, write some pseudo code. Which is fine with regular GPTs with limited context lengths.<p>As I got into the implementation details, there are several sub components which <i>themselves</i> require a dedicated agent because they become 1 week sprints to complete with 100&#x27;s of loc. But I need them to pull context from other agents of the project, which is missing.,<p>This latter part could be done by along the lines of this research.<p>I think a couple of projects are trying to do this using traditional RAG like GPTEngineer can benefit from this.
cs702over 1 year ago
Very nice... but this still has quadratic compute cost, so in practice you can&#x27;t achieve infinite context with it.
ilakshover 1 year ago
Their experiment seems to handle less than 100,000 tokens of text. I wonder if this method could scale to more than 1,000,000 or 10,000,000 tokens.<p>It seems like it&#x27;s limited to context length? So with 512 token chunks at 4096 context length that is 2 million tokens. Which could be good for some things but is not going to help for a really large knowledgebase.<p>Edit: I see the source code is in the Files section on the Hugging Face page.
评论 #38264760 未加载
runnedrunover 1 year ago
Oh interesting, from what I understand then, this is great for small context size models, compared to RAG? Is there research into how to make this more effective for large context size models (since context size of major models seems to be 4xing every 6 months at this point)?
评论 #38260915 未加载
superb-owlover 1 year ago
&gt; Finetuning seeks to extend the length of the context window itself.<p>Does it? I&#x27;d thought fine-tuning was more like transfer learning, adding highly specific data to the training set (e.g. your internal codebase) rather than actually modifying the architecture.
tj-teejover 1 year ago
Wouldn&#x27;t this require retraining the model each time your &quot;extended mind&quot; needs more information? One benefit of RAG in my mind is that the &quot;database&quot; can be updated in real-time without needing any model retraining
评论 #38278375 未加载
moelfover 1 year ago
For the replacing “Lee Hazlewood” with “Terry Allen” in the Wikipedia entry example, how is it possible that the Baseline has a &gt;0 accuracy %?
itissidover 1 year ago
Hi is the benchmarking code available anywhere? I could not see it.
inciampatiover 1 year ago
Is there an implementation of this anywhere?
评论 #38264292 未加载