The blog post format is so much nicer than a PDF paper!<p>More seriously, this does feel like a real advance. Vector search+context stuffing (RAG) is clearly a hack that doesn't resemble how we actually think or do things in reality. I've been wondering for the last year whether it's possible to extend the attention mechanism to more naturally connect to a bigger set of per-session weights or activations. The moment you encounter the query/key/value analogy it's an obvious idea, the problem being that you need a very strong grip on the low level details of neural architecture to actually do it. Now apparently it is possible! And the way the topk knob actually maps to abstraction is quite amazing.<p>Still, this doesn't eliminate context window constraints. The memories themselves have a form of context window in this technique. Context size (what they call sequence length) does still matter.<p>Additionally, the memories have to actually fit in GPU memory (at least in their implementation). And the memories appear to be nearly full snapshots of the network, so they will get quite large (much larger than text grabbed using RAG). So there's going to be a painful tradeoff here for the forseeable future where you'll have to decide whether you want a bigger smarter base model with a bigger context window but less space for memory, or a smaller model with a smaller context window but bigger memory.<p>This is the first I've heard of Normal Computing, who are these guys/gals exactly?<p><i>> Normal is a deep-tech startup founded by former Google Brain & X engineers</i><p>Ah. That explains it. "X" here also refers to Google, not Twitter/Musk devs.
The most important bits about what they do to make this happen:<p>> In addition to the causal self-attention integral to transformers, we also allow each query token to attend to a fixed number of “external memories”. These memories are stored in a non-differentiable cache. The choice of which memories to attend to is made using cosine similarity within each decoder layer and attention head.<p>[...]<p>> We create our external memories (at each layer) by passing those external contexts through our model, just like inference. Then we save the internal representations the model generated, and attend to them later.<p>What an extremely clever approach!<p>If, in a chatbot setting, you update the external memory cache during inference, this means the thing immediately retains memory of the discussion.<p>Maybe this is an alternative (quicker? more exact?) to LoRA finetuning, to give a foundational model some specific personality and experiental history?
With almost all of these papers, "RAG" is mentioned as a well-defined, absolute strategy. Is there an agreed upon implementation for it? Because as I have been building my own implementation, I have found that finding the right things to retrieve and augment the prompt with is incredibly challenging.<p>It seems to me that a <i>great</i> RAG would almost always outperform other strategies because it will be like giving the student a note with the answer right before the exam, compared to letting them read the whole curriculum, but I am very much still learning..
This is great. Ive been using GPT based tools extensively for my side project which I have a demo for(<a href="http://drophere.co" rel="nofollow noreferrer">http://drophere.co</a> see code at <a href="https://github.com/itissid/drop_webdemo">https://github.com/itissid/drop_webdemo</a>) one thing I think that is missing is a couple of agents that can pick out the right context from my code depending on what needs to get done. Let me explain:<p>They are a couple of patterns that I used to design chunks of the web app:
One is to do a high level design brainstorming with an agent, write some pseudo code. Which is fine with regular GPTs with limited context lengths.<p>As I got into the implementation details, there are several sub components which <i>themselves</i> require a dedicated agent because they become 1 week sprints to complete with 100's of loc. But I need them to pull context from other agents of the project, which is missing.,<p>This latter part could be done by along the lines of this research.<p>I think a couple of projects are trying to do this using traditional RAG like GPTEngineer can benefit from this.
Their experiment seems to handle less than 100,000 tokens of text. I wonder if this method could scale to more than 1,000,000 or 10,000,000 tokens.<p>It seems like it's limited to context length? So with 512 token chunks at 4096 context length that is 2 million tokens. Which could be good for some things but is not going to help for a really large knowledgebase.<p>Edit: I see the source code is in the Files section on the Hugging Face page.
Oh interesting, from what I understand then, this is great for small context size models, compared to RAG? Is there research into how to make this more effective for large context size models (since context size of major models seems to be 4xing every 6 months at this point)?
> Finetuning seeks to extend the length of the context window itself.<p>Does it? I'd thought fine-tuning was more like transfer learning, adding highly specific data to the training set (e.g. your internal codebase) rather than actually modifying the architecture.
Wouldn't this require retraining the model each time your "extended mind" needs more information? One benefit of RAG in my mind is that the "database" can be updated in real-time without needing any model retraining