TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Transformer^2: Self-Adaptive LLMs

160 点作者 hardmaru4 个月前

14 条评论

verdverm4 个月前
This sounds like MoE and maybe a bit of chain-of-thought. Curious what someone with more domain expertise thinks about this<p>If they can test against Llama 70B and Mistral 7B, they ought to compare against Mistral 8x7b imho
评论 #42710435 未加载
RevEng4 个月前
Does anyone else find their results don&#x27;t match their claims? In many cases the base model or a simple LoRa beats their proposed method. The few times theirs wins, the difference is very small. I feel like some of these &quot;wins&quot; are more sampling error than any significant improvement.<p>I&#x27;m always happy to see publishing of negative results, but it seems like they are selling what are negative results as positive results.
wildermuthn4 个月前
Great research here. Contextual real-time weight modification is definitely one of the breakthroughs required for AGI. Why create a LoRA when you can generate one on the fly suited to the task at hand?
评论 #42707322 未加载
评论 #42708018 未加载
评论 #42708471 未加载
评论 #42709229 未加载
liuliu4 个月前
One weakness of this method is the storage of decomposed UV from W. My linear algebra is rusty, but it seems required if you want to scale in that U projected subspace, hence double your weight memory footprint (that has been said, U &#x2F; V should be easier to quantize from information theory perspective). I also think MoE is more principled if you want to have experts activations. But I understand that Sakana&#x27;s research focus mostly is about adapting existing pretrained models, not to do it from scratch.
E_Bfx4 个月前
&gt; Transformer² represents a significant milestone in the evolution of AI systems.<p>Coming from a math background, it always amazes me to see how people in AI&#x2F;ML brag about their papers. If someone wrote:<p>&gt; My paper represents a significant milestone in the evolution of algebraic geometry&#x2F;ergodic theory&#x2F;combinatorics<p>it would be a laughing stock for the math community.
评论 #42710413 未加载
评论 #42709424 未加载
评论 #42709789 未加载
评论 #42709640 未加载
评论 #42710421 未加载
评论 #42710110 未加载
verdverm4 个月前
The code: <a href="https:&#x2F;&#x2F;github.com&#x2F;SakanaAI&#x2F;self-adaptive-llms">https:&#x2F;&#x2F;github.com&#x2F;SakanaAI&#x2F;self-adaptive-llms</a>
ghc4 个月前
Can someone please enlighten me how this is any different from Mixture of Experts? Because I don&#x27;t see any difference at all.
评论 #42711791 未加载
mdp20214 个月前
It is discomforting to read, in the first paragraph, that &quot;dynamical adjustment of weights&quot; is justified as &quot;adaptation&quot;. Clearly it is a sought milestone to have «a future where AI models are no longer static»: but the chief reason remains, &quot;intelligent systems reprocesses their body of knowledge and change it to improve it&quot; - it is anterior to &quot;adaptation to environment&quot;, it is &quot;maintenance of the body of knowledge (of the world model)&quot;: it is the continuous practice of &quot;thinking about things&quot;, &quot;pondering&quot;, &quot;reflecting&quot;, &quot;using judgement&quot;...<p>There is not just a simple «lifelong learning»: the whole past experience is still productive, requiring analysis, not &quot;solved&quot;.<p>Anyway: the directions seem good.<p>Edit: equally interesting in another direction is the automated analysis of the internal subagents, «break[ing] down the vast, complex knowledge stored in the LLM into smaller, meaningful, and independent pieces (e.g., the different pathways or components for math, language understanding, etc)». Should not there be a general study of the dissection of systems with seemingly emergent intelligence, doing on LLMs like we do on C. Elegans?
qoez4 个月前
Worth noting is that the original inventor of the transformer is part of this team
qrsjutsu4 个月前
&gt; <a href="https:&#x2F;&#x2F;sakana.ai&#x2F;" rel="nofollow">https:&#x2F;&#x2F;sakana.ai&#x2F;</a><p>I like that background animation. Seems like there&#x27;s an opportunity for tiny logic gates and some punny swarm behavior.
justanotherjoe4 个月前
Is this real? Or is this a hustler type paper&#x2F;company.
评论 #42715387 未加载
评论 #42711508 未加载
anticensor4 个月前
Obvious next step: use this kind of model in AI-Scientist, Sakana AI&#x27;s AI-powered automated researcher project.
Vampiero4 个月前
It&#x27;s all very interesting but those pictures look pretty bad. Clear visible artifacts, awful shapes.
tzury4 个月前
The ideas in the paper have been implemented and tested. The authors conducted experiments on several tasks (math, coding, reasoning, and visual question answering) and showed that their approach works better than previous methods like LoRA.<p>Key ideas (in simple terms):<p>1. What’s the problem?<p><pre><code> - Fine-tuning LLMs for every new task is slow, expensive, and often doesn&#x27;t generalize well. - Models trained on one task may perform poorly on others, especially unseen ones. - Current methods (like LoRA) can add new capabilities but aren&#x27;t efficient enough.</code></pre> 2. The solution:<p><pre><code> - Transformer² uses a new fine-tuning method called Singular Value Fine-tuning (SVF). This focuses on adjusting only certain parts of the model’s &quot;weight matrices&quot; rather than changing everything. - By tweaking specific components (called &quot;singular values&quot;), it trains smaller, efficient &quot;expert&quot; modules that specialize in particular types of tasks.</code></pre> 3. How it works:<p><pre><code> - Training phase: Train these smaller expert modules offline using reinforcement learning (RL) to specialize in tasks like coding, math, or reasoning. - Inference phase: When a new input is given, the system analyzes the task (e.g., “Is this a math or coding problem?”) in the first pass. Based on this, it combines the right expert modules and adapts the model’s behavior in the second pass.</code></pre> 4. Three adaptation strategies:<p><pre><code> - Prompt-based: Use a cleverly designed text prompt to figure out the task type and pick the right expert module. - Classifier-based: Train a separate model to classify tasks and match them to experts. - Few-shot adaptation: Look at a small number of examples (few-shot learning) to dynamically combine expert modules for the best results.</code></pre> 5. Efficiency:<p><pre><code> - The system uses fewer parameters than traditional fine-tuning methods like LoRA. - Adaptation works even on small datasets without overfitting or forgetting older tasks.</code></pre>