TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

State-space models can learn in-context by gradient descent

86 点作者 dsalaj7 个月前

7 条评论

quantadev7 个月前
&gt; can reproduce the outputs of an implicit linear model with least squares loss after one step of gradient descent.<p>Makes you wonder if we&#x27;re training LLMs the hard way. For example, if computers had been invented before Calculus, we&#x27;d have been using &quot;Numerical Integration&quot; (iterating the differential squares to sum up areas, etc) and &quot;Numerical Differentiation&quot; (ditto for calculating slopes).<p>So I wonder if we&#x27;re simply in a pre-Calculus-like phase of NN&#x2F;Perceptrons, where we haven&#x27;t yet realized there&#x27;s a mathematical way to &quot;solve&quot; a bunch of equations simultaneously and arrive at the best (or some local minima) model weights for a given NN architecture and set of training data.<p>From a theoretical standpoint it <i>IS</i> a black box problem like this where the set of training data goes in, and an array of model weights comes out. If I were to guess I&#x27;d bet there&#x27;ll be some kind of &quot;random seed&quot; we can add as input, and for each seed we&#x27;ll get a different (local minima&#x2F;maxima for model weights).<p>But I&#x27;m not a mathematician and there may be some sort of PROOF that what I just said can definitely never be done?
评论 #41974820 未加载
评论 #41976848 未加载
评论 #41982681 未加载
billconan7 个月前
&gt; We show that SSMs with local self-attention, a form of input-dependent input processing, can perform in-context learning analogously to transformers, i.e. through gradient descent steps on an implicit linear regression problem.<p>I don&#x27;t understand. The benefit of SSMs is better scalability than self-attention. Now this adds self-attention back?
评论 #41972780 未加载
eli_gottlieb7 个月前
&gt;Our key insight is that the diagonal linear recurrent layer can act as a gradient accumulator<p>So they&#x27;re sort of reinventing the discrete-time differentiator from signal processing, but parameterized neurally?
评论 #41983000 未加载
agnosticmantis7 个月前
These papers don’t explain how pertained LLMs learn in-context, because the simplified models in these papers are either pretrained for the same task that’s tested in-context, or the weights are handpicked by humans to do GD at inference time.<p>See this video for a good discussion: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;-yo2672UikU" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;-yo2672UikU</a>
roger_7 个月前
I&#x27;d love to see SSMs replace transformers but adapting them to non-causal, 2D+ inputs doesn&#x27;t seem that straightforward.<p>Is there a non-autoregressive future?
derefr7 个月前
So, I&#x27;m just a layman when it comes to AI&#x2F;ML, but I do understand computability — what&#x27;s possible to do with a given machine, and how we can build higher-computational-power primitives out of lower-computational-power primitives by plugging those primitives together with &quot;glue&quot; like parallel feed-forward chains (e.g. an ALU adder&#x27;s carry bits) and loops over static sub-states of execution.<p>My own mental model for what Transformers <i>must necessarily</i> be doing, in order to be able to compute what they compute, given:<p>1. the primitives they&#x27;re made of (for Transformers: matmul a learned matrix; vector-add a learned bias vector; normalize; softmax)<p>2. what those primitives can compute over a single layer<p>3. the low-ish total number of layers in a Transformer model<p>...is that they were already effectively &quot;state space models&quot; in practice. So this doesn&#x27;t really surprise me!<p>(To be explicit, my assertion is that, for a given latent space between layers N and N+1 in a Transformer model, that latent space encodes a set of state variables [think CPU registers] used by the Nth serial computation steps of an arbitrary set of learned algorithms — where these algorithms are limited to those where every computation step is possible to encode in the form of a fused-matmul-plus-vadd, such that the algorithm itself can be learned as a depthwise-extruded sequence of weights across the layers; and where the learned algorithms can and do share state variables, both as inputs and as outputs; and where these state variables are all attenuated by an activation probability [in a Transformer: attention] such that the algorithms&#x27; outputs form a pre-multiplied <i>conditional probability</i> of the output given the confidence of the inputs — in turn such that the same state variable can be a low-confidence output for one algorithm, and a high-confidence output for another algorithm, and the high-confidence component of the output will swamp the low-confidence output.)
评论 #41975360 未加载
dsalaj7 个月前
Deep state-space models (Deep SSMs) have shown capabilities for in-context learning on autoregressive tasks, similar to transformers. However, the architectural requirements and mechanisms enabling this in recurrent networks remain unclear. This study demonstrates that state-space model architectures can perform gradient-based learning and use it for in-context learning.