Woww, very creative and interesting idea: I understand it as: train a bunch of fact-based LoRAs to zero loss (they mention 100k different ones), then use RAG to pick the appropriate Loras for a query.<p>So cool. The only moat I can think of for such a company would be proprietary fact loras- basically licensing a modern ai encyclopedia.<p>Anyway, really nice idea.
Doesn't this make the "AI" even less creative and more like full-text-search instead? What makes some data a "fact"? If everything is written in the training data, in the end, won't everything be treated like a fact? So the LLM will have 100% accuracy and 0% creativity.
Am I the only one that cringes at "10x fewer?"<p>How do I multiply positive numbers and get something smaller?<p>Is "1/10th" or "90% less" not better arithmetic?<p>Maybe I should have done more gooder at math but it hurts my ears (eyes).
Has anyone here used this, or anything similar? This sounds phenomenal if it really works. Looks like “contact us” is the only way to try it or buy it right now, and the purported benefit (memorize facts up to the model training size, basically trillions of tokens of facts) is wild. I’d love to try a system running this way to understand the failure modes, like for instance how does it reliably infer which “facts” to use?
The website says:<p>> At inference time, the model retrieves the most relevant experts at each layer and merges back into the base model to respond to the user query.<p>The paper says:<p>> At inference time, only the relevant experts are retrieved from the index, allowing the LLM to store a large number of facts while maintaining low inference latency. We use specialized GPU kernels written in Triton Tillet et al. (2019) to accelerate the lookup of experts.<p>...but darned if I can understand from either what they're <i>actually doing</i> when they say that.<p>Why do you need a custom GPU kernel for this outside of the normal NN layers?<p>Can anyone see an explanation of <i>how</i> they pick <i>which</i> expert to use?
nit: I hate trying to work out what "10x fewer" or "10x less" mean.<p>Let's say I have a counter value, X = 100. I reduce that to 10. How can I phrase that in English?<p>"Value reduced to 10% of original value"<p>"Value reduced by 90%“<p>"New value is one tenth the original Value"<p>Using multiplication with a positive integer and saying "less" just seems incomprehensible when it flies by during a sentence and I can't stop myself from mentally saying "No", like Neo at the end of the Matrix when the three agents fire a volley of bullets at him down the corridor. "No, this sentence stops right here while I pick it apart"
Here I was hoping that there would be some kind of regulatory framework or protections put in place for AI before it became smart enough to actually take over the world.<p>Being able to say "you are wrong, taking over the world is a bad idea" and have the model respond with "oh you are completely right, I am very sorry for that" was our first line of defense.<p>I wonder if this model will argue insistently that you are wrong if you try and tell it that 1+1=3, and if so, whether that expands to philosophical issues such as the model arguing back at you based on its formed opinions on ethics and history.
It feels like the two dumb ways to customize an open LLM are fine tuning and RAG. The former is expensive and complicated, the latter adds complexity to your queries but doesn't require up front compute for retraining.<p>I couldn't tell how expensive this is up front, or what complexity it adds to the setup. Anyone know?<p>It's definitely an interesting idea but if you have to pay $100k for all that LoRA, what margins are left over?
"Hallucinations" are the creative aspect of LLMs, which is what they are more useful for- if anything we want more of them. We already have much simpler systems that search and regurgitate facts. We need more intelligent hallucinations that are consistent with and extend rather than conflict with the data.
If less hallucinations are the goal, surely this is a bit over the top?
Surely if you have the ground truth facts available, then fine-tuning for EVERY subject area seems much more work than using the facts with retrieval augmented generation, and making sure that the facts line up?