TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Scalable MatMul-Free Language Modeling

205 点作者 lykahb11 个月前

12 条评论

cpldcpu11 个月前
The quantization approach is basically identical to the 1.58bit LLM paper:<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2402.17764" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2402.17764</a><p>The main addition of the new paper seems to be the implementation of optimized and fused kernels using triton, as seen here:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;ridgerchu&#x2F;matmulfreellm&#x2F;blob&#x2F;master&#x2F;mmfreelm&#x2F;ops&#x2F;fusedbitnet.py">https:&#x2F;&#x2F;github.com&#x2F;ridgerchu&#x2F;matmulfreellm&#x2F;blob&#x2F;master&#x2F;mmfre...</a><p>This is quite useful, as this should make training this type of LLMs much more efficient.<p>So this is a ternary weight LLM using quantization aware training (QAT). The activations are quantized to 8 bits. The matmal is still there, but it is multiplying the 8 bit activations by one bit values.<p>Quantization aware training with low bit weights seems to lead to reduced overfitting by an intrensic tendency to regularize. However, also the model capacity should be reduced compared to a model with the same number of weights and a higher number of bits per weights. It&#x27;s quite possible that this only becomes apparent after the models have been trained with a significant number of tokens, as LLMs seem to be quite sparse.<p>Edit: In addition to the QAT they also changed the model architecture to use a linear transformer to reduce reliance on multiplications in the attention mechanism. Thanks to logicchains for pointing this out.
评论 #40625294 未加载
buildbot11 个月前
Wow - This seems at first read to be really impressive work. They got scaling laws up to a reasonable size, 2.7B, and also run a few downstream tasks. Would be interesting to see how a comparable model trained by someone else does, to check their scores against those.<p>They get real (61%!?) memory savings during training, and inference too.<p>On top of all that, they then go build an FPGA core which is programmed with a custom assembler. And their code is posted and works seamlessly with huggingface transformers?! Absolutely going to test this out.
评论 #40629736 未加载
评论 #40623546 未加载
jph0011 个月前
There was another matmul-free language model paper released a year ago FYI:<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2305.17190" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2305.17190</a>
评论 #40624590 未加载
naasking11 个月前
I feel like all of these transformer reductions to binary or ternary bits are basically constructing an implicit decision tree, where any stage of the process is basically answering a question with yes&#x2F;no&#x2F;I don&#x27;t know answers, where &quot;I don&#x27;t know&quot; basically invokes a continuation for further processing with more context.
WithinReason11 个月前
Not sure if it&#x27;s fair to call binary multiplication &quot;multiplication free&quot;, you can express any multiplication as a sequence of additions&#x2F;subtractions.
throwaway7127111 个月前
the github link in the paper: <a href="https:&#x2F;&#x2F;github.com&#x2F;ridgerchu&#x2F;matmulfreellm">https:&#x2F;&#x2F;github.com&#x2F;ridgerchu&#x2F;matmulfreellm</a><p>it is super easy to try it out, the 2.7B, 1.3B, 0.37B models are on huggingface, and the generate.py example just works if you have triton 2.2 installed
amluto11 个月前
One thing I didn’t figure out from just the paper: how does one train these parameters that are not even approximately real numbers? Specifically, most of the parameters are ternary (i.e. -1, 0, or 1). The approximate gradient discussed in the paper will (I think) give some <i>real</i> gradient on each parameter, and that can be further processed by the learning rate schedule, but the result is still a real number g_i for each parameter a_i. Normally one would update a_i to a_i + g_i, but with these ternary parameters, a_i + g_i isn’t ternary!<p>So what’s the extra trick to make the model stay quantized? Does one evaluate the gradients on a whole bunch of training inputs, add them up, apply some randomness, and then re-quantize the model? Or is it something else?
评论 #40623031 未加载
sva_11 个月前
The FPGA in question, Intel FPGA PAC D5005 seems to cost around $8k
评论 #40626308 未加载
评论 #40625575 未加载
PaulHoule11 个月前
This is why the NPU built into your processor could quickly become a liability instead of a benefit.
评论 #40633873 未加载
评论 #40623927 未加载
hisoka4411 个月前
has someone tried to do binary Hopfield networks like this? in an LLM like massive scale way?
nuz11 个月前
Oh this is by the inventor of RWKV, cool
评论 #40624270 未加载
gabesullice11 个月前
Reminds me of ghotz&#x27;s interview: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;wE1ZoMGIZHM" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;wE1ZoMGIZHM</a>
评论 #40622763 未加载