TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The Era of 1-bit LLMs: ternary parameters for cost-effective computing

1040 点作者 fgfm大约 1 年前

72 条评论

cs702大约 1 年前
There are two findings I find <i>shocking</i> in this work:<p>* In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).<p>* In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (a₁b₁ + a₂b₂ ...) with elementwise <i>additions</i> (a₁+b₁ + a₂+b₂ ...), in which signs depend on each value. See the paper for exact details.<p>On existing hardware, the gains in compute and memory efficiency are significant, without performance degradation (as tested by the authors).<p>If the proposed methods are implemented in hardware, we will see <i>even greater gains</i> in compute and memory efficiency.<p>Wow.
评论 #39544500 未加载
评论 #39541965 未加载
评论 #39538626 未加载
评论 #39538470 未加载
评论 #39544181 未加载
评论 #39540074 未加载
评论 #39540439 未加载
评论 #39543010 未加载
评论 #39543920 未加载
评论 #39538936 未加载
评论 #39544359 未加载
评论 #39549854 未加载
评论 #39540069 未加载
评论 #39539472 未加载
评论 #39547082 未加载
评论 #39543846 未加载
评论 #39552867 未加载
评论 #39543933 未加载
评论 #39551562 未加载
评论 #39545149 未加载
评论 #39544228 未加载
评论 #39549799 未加载
评论 #39543768 未加载
评论 #39550870 未加载
评论 #39545656 未加载
anon373839大约 1 年前
&gt; BitNet b1.58 can match the performance of the full precision baseline starting from a 3B size. ... This demonstrates that BitNet b1.58 is a Pareto improvement over the state-of-the-art LLM models.<p>&gt; BitNet b1.58 is enabling a new scaling law with respect to model performance and inference cost. As a reference, we can have the following equivalence between different model sizes in 1.58-bit and 16-bit based on the results in Figure 2 and 3.<p>&gt; • 13B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 3B FP16 LLM.<p>&gt; • 30B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 7B FP16 LLM.<p>&gt; • 70B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 13B FP16 LLM.<p>This paper seems to represent a monumental breakthrough in LLM efficiency, as the efficiency gains come with zero (or negative) performance penalty.<p>Does it seem at all likely that existing models could be converted?
评论 #39537394 未加载
评论 #39536724 未加载
评论 #39537369 未加载
osigurdson大约 1 年前
I have often mused that, in some ways, it seems like the transistor is really being wasted in AI applications. We use binary states in normal computing to reduce entropy. In AI this is less of a concern, so why not use more of the available voltage range? Basically, re-think the role of the transistor and re-design from the ground up - maybe NAND gates are not the ideal fundamental building block here?
评论 #39537308 未加载
评论 #39545817 未加载
评论 #39537481 未加载
评论 #39537356 未加载
评论 #39538445 未加载
评论 #39543378 未加载
评论 #39537346 未加载
评论 #39538729 未加载
评论 #39549474 未加载
评论 #39548431 未加载
评论 #39537693 未加载
评论 #39544849 未加载
评论 #39537843 未加载
评论 #39542077 未加载
评论 #39542027 未加载
评论 #39537284 未加载
w-m大约 1 年前
I was reading <i>Exposing Floating Point</i> today (as Airfoil is on the HN front page and I was perusing the archive of the author). It&#x27;s a blog explaining the inner workings of floating point representations. About zero values it says [0]:<p>&gt; Yes, the floating point standard specifies both +0.0 and −0.0. This concept is actually useful because it tells us from which “direction” the 0 was approached as a result of storing value too small to be represented in a float. For instance -10e-30f &#x2F; 10e30f won’t fit in a float, however, it will produce the value of -0.0.<p>The authors of the LLM paper use the values {-1, 0, -1}. Connecting the two ideas, I&#x27;m now wondering whether having a 2-bit {-1, -0, 0, 1} representation might have any benefit over the proposed 1.58 bits. Could the additional -0 carry some pseudo-gradient information, (&quot;the 0 leaning towards the negative side&quot;)?<p>Also, I&#x27;ve seen 2-bit quantizations being proposed in other LLM quantization papers. What values are they using?<p>[0] <a href="https:&#x2F;&#x2F;ciechanow.ski&#x2F;exposing-floating-point&#x2F;#zero" rel="nofollow">https:&#x2F;&#x2F;ciechanow.ski&#x2F;exposing-floating-point&#x2F;#zero</a>
评论 #39538658 未加载
评论 #39538642 未加载
评论 #39545771 未加载
评论 #39545849 未加载
lucubratory大约 1 年前
After reading the results I skipped back to the comment section to ask if this was real because it looks a little too good to be true, but figured I should check authors and it&#x27;s Microsoft research and UCAS so yeah, real. This is going to change a lot of things, obviously the edge computing applications they point out, but also this is going to bottom out the cost of providing high-performance LLMs in the cloud. I don&#x27;t know what that means for the economics long term, naively way less costs maybe means new entrants without an entire cloud available can compete easier? I do wonder if something like this has already been found and implemented by either OpenAI or Google.
评论 #39537113 未加载
评论 #39537166 未加载
评论 #39537532 未加载
评论 #39537099 未加载
gojomo大约 1 年前
That&#x27;s not a &#x27;bit&#x27; (&quot;Binary digIT&quot;). It&#x27;s closer to a &#x27;trit&#x27; (&quot;TeRnary-digIT&quot;). Specifically, ternary digits spanning {-1, 0, 1} (rather than the usual {0, 1, 2} in a base-3 numbering system) are &#x27;balanced ternary&#x27;.<p>A great intro to the theoretical reasons ternary might have some promise in computing is this 2001 article from &#x27;American Scientist&#x27;, &quot;Third Base&quot;, which quotes Knuth calling balanced-ternary &quot;perhaps the prettiest numbering system of all&quot; and also discusses an abortive Soviet effort in the direction of ternary computing:<p><a href="http:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20011205185830&#x2F;http:&#x2F;&#x2F;americanscientist.org&#x2F;Issues&#x2F;Comsci01&#x2F;Compsci2001-11.html" rel="nofollow">http:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20011205185830&#x2F;http:&#x2F;&#x2F;americansci...</a><p>In an aside, the article hints that <i>e</i>-nary digits (base 2.718…) if somehow made practical&#x2F;meaningful, might actually be better than ternary (or perhaps even optimal?).<p>So maybe this paper&#x27;s observation that ~&quot;1.58 bits&quot; (ln2(3) binary-digits) is a sweet-spot could be further refined into some method for representing the state of a e-nary-modeled algorithm in ln2(e) binary-digits (~&quot;1.44 bits&quot;) per underlying e-it.<p>(As it may be of renewed interest, I&#x27;ve also put this 2001 &quot;American Scientist&quot; base-3 intro as a new HN submission for discussion: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=39541756">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=39541756</a>)
评论 #39542206 未加载
评论 #39542048 未加载
评论 #39542823 未加载
评论 #39542101 未加载
评论 #39542399 未加载
评论 #39542434 未加载
ulnarkressty大约 1 年前
Take this with a grain of salt until someone reproduces it. Improvements such as these require extraordinary evidence. Not to mention extreme quantization has been tried before.
tuananh大约 1 年前
Major breakthrough in LLM scene. Achieve performance and perplexity equivalent to full FP16 models of same parameter size.<p>And you can fit 120B model with a single card 24GB VRAM. This is mind blowing.
评论 #39536889 未加载
Klipper3大约 1 年前
The theoretical capacity of a binary network is 69% of the capacity of a full-weight network, so it makes sense that LLM would converge to 1-bit networks in the long term.<p>It&#x27;s nice to finally see practical networks reach the theoretical limits found in the statistical mechanics of Ising models. A good pointer to efficient 1-bit training, from the statistical mechanics point of view, is here:<p><a href="https:&#x2F;&#x2F;www.pnas.org&#x2F;doi&#x2F;full&#x2F;10.1073&#x2F;pnas.0700324104" rel="nofollow">https:&#x2F;&#x2F;www.pnas.org&#x2F;doi&#x2F;full&#x2F;10.1073&#x2F;pnas.0700324104</a>
评论 #39544061 未加载
esha_manideep大约 1 年前
These models will are compatible with llama.cpp out of the box, we (GigaML - <a href="https:&#x2F;&#x2F;gigaml.com">https:&#x2F;&#x2F;gigaml.com</a>) are planning to train a small model (3-4B, 1-bit, opensource) with the latest stack-v2 dataset released today. Let me know if anyone is interested in collaborating with us.
评论 #39548419 未加载
评论 #39551293 未加载
评论 #39550574 未加载
fgfm大约 1 年前
It&#x27;s funny how discoveries in NLP &amp; computer vision complement each other. The replacement of multiplication by additions made me think about the AdderNet paper (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1912.13200" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1912.13200</a>), which concluded as you had to suffer almost no performance drop.<p>Perhaps the accumulators in current hardware cannot leverage this to its full potential, but combined with such a strict quantization, this would open LLM to the wider ML community much earlier than expected (when consumer hardware allows you to train near SOTA LLMs from scratch on your machine).
oxxoxoxooo大约 1 年前
Prior art:<p>Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1602.02830" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1602.02830</a><p>Ternary Neural Networks for Resource-Efficient AI Applications<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1609.00222" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1609.00222</a>
评论 #39543596 未加载
alexey-salmin大约 1 年前
Also from Microsoft in 2021: Make Every feature Binary: A 135B parameter sparse neural network for massively improved search relevance [1]<p>[1] <a href="https:&#x2F;&#x2F;www.microsoft.com&#x2F;en-us&#x2F;research&#x2F;blog&#x2F;make-every-feature-binary-a-135b-parameter-sparse-neural-network-for-massively-improved-search-relevance&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.microsoft.com&#x2F;en-us&#x2F;research&#x2F;blog&#x2F;make-every-fea...</a>
imjonse大约 1 年前
Too bad there seem to be no pretrained models to download. This is not a quantization method to apply on existing models, so having the pretrained weights is needed if one wants to test it.
评论 #39537315 未加载
rapatel0大约 1 年前
The mathematics of the BNNs are sound. The shannon entropy of a word is really small (I vaguely remember ~2 bits). Also all neural networks are ridiculously over provisioned.<p>I worked on 7 years ago trying to efficiently binarize CNNs from existing models. It the difficult was getting training running without the losses going to high. I think that vision models will be much more difficult to binarize, but you might not need to with clip if the vision encoder stays in regular math {fp16,int8}
评论 #39548678 未加载
londons_explore大约 1 年前
Powers of 3 don&#x27;t pack well into binary memory...<p>A 1 bit multiplier in silicon is a single logic gate, but a ternary decoder to decode a packed tri-state &#x27;weight&#x27; is bigger.<p>I therefore suspect that this method will be extended to make all weights simple 1 or 0 (ie. Binary). Perhaps that will be done by having half the weights have 1 or 0 values, while the other half are -1 or 0.
评论 #39537436 未加载
评论 #39537371 未加载
评论 #39545683 未加载
评论 #39541035 未加载
评论 #39543988 未加载
jdthedisciple大约 1 年前
People have been doing this 6 years ago.<p><pre><code> https:&#x2F;&#x2F;github.com&#x2F;yashkant&#x2F;quantized-nets https:&#x2F;&#x2F;github.com&#x2F;TropComplique&#x2F;trained-ternary-quantization https:&#x2F;&#x2F;github.com&#x2F;buaabai&#x2F;Ternary-Weights-Network </code></pre> I too find it very interesting.<p>But why this sudden, renewed fuzz?
评论 #39550033 未加载
评论 #39552317 未加载
dindobre大约 1 年前
Refreshing paper in terms of machine learning papers, simple explanation, easy to replicate, no alchemy-tier interpretations. Can&#x27;t wait to see this paper replicated or disproved when it comes to real-life production tasks.
评论 #39536961 未加载
评论 #39536875 未加载
stormfather大约 1 年前
How does backprop work here? I can&#x27;t imagine flipping bits of everything upstream of an error is effective.
评论 #39538702 未加载
评论 #39537619 未加载
bilsbie大约 1 年前
This really just sounds absurd. How can ternary possibly encode enough information?<p>Anyone willing to explain it like I’m a Django developer who watched half a karpathy video?
评论 #39546815 未加载
评论 #39562142 未加载
评论 #39545414 未加载
评论 #39543560 未加载
naasking大约 1 年前
Interesting return to ternary. Effectively, each weight says only whether it&#x27;s correlated (+1), uncorrelated (0), or anti-correlated (-1) with the input, and the structure of the network is the actual computation over that information.
eigenvalue大约 1 年前
Is it really so surprising that something like this works given how human brain neurons work? My admittedly basic understanding is that these operate through an all-or-nothing principle for their action potentials (firing): they either fire or they don&#x27;t, based on whether the input signals reach a certain threshold. So the output is already sort of binary in biological neurons. The inputs are more like continuous values, since they are the sum of many different neurons sending signals into each neuron, but in this paper the activations are 8-bit, not binary&#x2F;ternary. Can any neuroscientists here comment?
评论 #39545562 未加载
评论 #39544544 未加载
joelthelion大约 1 年前
Assuming this is confirmed, what&#x27;s the impact on training?<p>Inference is definitely an issue for LLMs right now. But if training were suddenly possible for lone hackers (or maybe smaller companies), it would open up a lot of new possibilities as well.
评论 #39546645 未加载
nutate大约 1 年前
Triggered by the use of 1-bit to describe a trit.
sp332大约 1 年前
1-bit LLMs remind me of a random forum post I read about SACD and limitations of the 1-bit DSD audio format. <a href="https:&#x2F;&#x2F;www.audiosciencereview.com&#x2F;forum&#x2F;index.php?threads&#x2F;dac-types-and-their-sonic-signature.7959&#x2F;page-10#post-198394" rel="nofollow">https:&#x2F;&#x2F;www.audiosciencereview.com&#x2F;forum&#x2F;index.php?threads&#x2F;d...</a> Accumulating approximate values in one bit leads to being &quot;constantly overloaded&quot;, with any error correction overwriting all of your real signal from the next step. I think this trinary system might leave enough room to avoid this problem.
smaddox大约 1 年前
Damn. Well, I guess I better hurry up and write and publish a paper on the Ternary Neural Network research that I&#x27;ve been doing (part-time) for the last several months, before it all gets scooped.
评论 #39550186 未加载
raghavtoshniwal大约 1 年前
Sooo, short Nvidia?
评论 #39537054 未加载
评论 #39537657 未加载
评论 #39537085 未加载
the8472大约 1 年前
What does it mean for future hardware if it&#x27;s not using floating point matrix multiplication units?
评论 #39544979 未加载
评论 #39536877 未加载
评论 #39537076 未加载
elromulous大约 1 年前
So for the uninitiated (me), does this mean the input is not a float (i.e. is quantized on input), such that all the math can be done with int operations?<p>This seems almost too good to be true.<p>Edit: Answering my own question, yes. The details are in the original bitnet paper: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2310.11453" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2310.11453</a>
ein0p大约 1 年前
How is it a 1 bit LLM if 2 bits are required for each weight (and one of the 4 possible states is wasted to be able to represent 0)
评论 #39543554 未加载
Animats大约 1 年前
Well, that&#x27;s 2 bits, but still...<p>LLMs have gone from 32-bit floating point numbers down to 16 and 8 bit values. Now 2 bits. It&#x27;s a hint as to how evolution did it. The basic component is simple and has very wide tolerances. There are just a lot of them. That&#x27;s something biology can evolve.
rafaelero大约 1 年前
Looks like we have finally rediscovered a biological neuron.
评论 #39541420 未加载
fl0ki大约 1 年前
Would there be value in distinguishing -0 and +0? If a 0 was quantized from a small negative or a small positive, it seems like retaining the sign is better than forgetting it.<p>The question remains whether the benefit and the simpler design are worth the loss of density.
transfire大约 1 年前
Shouldn’t that be “1-trit”?
评论 #39536595 未加载
评论 #39537127 未加载
BenoitEssiambre大约 1 年前
Low bit parameters is always talked about in terms of performance benefits but I wonder if allowing the LLM to combine parameters to represent values, means it can select the resolution of each value, that is use a kind of internal scientific notation to track the uncertainty of values. More low bit parameters combined together means more precision and resolution, less can mean more uncertainty. This might allow the LLM to better calibrate the uncertainty of it&#x27;s knowledge in a Bayesian way, to prevent hallucinations from the overconfidence you get from overfitting on too many bits.
bilsbie大约 1 年前
How would you use this in something like PyTorch? There’s no ternary data type.
评论 #39546324 未加载
modeless大约 1 年前
Maybe a silly question but nonlinearity is important for neural nets. Wouldn&#x27;t it make more sense for the three values to be e.g. (2, 0, -1) so they are not colinear?<p>Also, what are the prospects for FPGA implementations of this?
hoseja大约 1 年前
Balanced ternary, my beloved.
Avisite大约 1 年前
Does quantization need to be an all or nothing? with the kind of low bit models we have seen, my assumption would be that only certain weights would benefit from the extra precision. A mixture of precision with 2-bit, 3-bit, to 8-bit weights might perform well, but I am unsure if any training process could identify the weights that need the extra precision.
评论 #39545006 未加载
anon291大约 1 年前
This is something that&#x27;s been tried many times before. 1-bit to 2-bit models and binary NNs have a long history.
ryeguy_24大约 1 年前
How does gradient descent work with these discrete ternary parameters? If you compute the partial differential for a parameter, how do you determine what to nudge the parameter when updating on back propagation? Do you only update if the &quot;nudging amount&quot; meets a threshold?
评论 #39546741 未加载
jcarrano大约 1 年前
Strictly speaking it should say &quot;1-trit LLM&quot;, or, as they later mention 1.58 bit.
karmasimida大约 1 年前
This is exciting news, if the 8B numbers are true, we can already use model like Mixtral 8x7, even with a single GPU?<p>But further into the development, we need comparison to large model sizes. 70B might be too much to ask, but 13B should be there at least.
评论 #39543567 未加载
elijahbenizzy大约 1 年前
There&#x27;s an interesting mental model I&#x27;ve been toying with. At what point do LLMs just become circuit-shaped NNs with stochastic gradient descent backing them?<p>E.G. are we just determining the best program by rearranging 1s and 0s?
nborwankar大约 1 年前
“Integer arithmetic is all you need” ? NVIDIA stock arrow up or down?
评论 #39545739 未加载
farhanhubble大约 1 年前
What&#x27;s the benefit of using ternary encoding over just a binary representation? And if we have come so far is there potential for a more efficient algorithm than gradient descent?
TriangleEdge大约 1 年前
How do you train these? Or is it only for already trained models?
simonvc大约 1 年前
The paper talks about LLMs a lot, but would this result hold for all Transformers? Are Ternary Transformers going to make things like Whisper faster&#x2F;better?
bilsbie大约 1 年前
Could there be some value in recognizing areas where the model needs finer grained weights and somehow using a different data type just in certain areas?
评论 #39543505 未加载
Blackthorn大约 1 年前
Is there any rigorous way to answer the question of how much information (be it entropy or some other measurement) is contained in a model&#x27;s weights?
评论 #39550572 未加载
kouru225大约 1 年前
Ok can someone catch me up to speed on LLM hardware requirements? Last I looked I needed a 20 gb vram card to run a good one. Is that not true anymore?
评论 #39543713 未加载
llm_trw大约 1 年前
So are there any details on the algorithms they used for backprop? I&#x27;m not seeing any in the paper other than &quot;we used a lot of tokens&quot;.
评论 #39537197 未加载
评论 #39537186 未加载
superdisk大约 1 年前
Is there anything about this specific to LLMs, or could you use it for any transformer based model? It seems like they made a modified transformer.
评论 #39545030 未加载
Mizza大约 1 年前
I hope somebody gives this team access to the good data and a lot of crunch, I&#x27;d love to see what happens when you train the big fella.
wenyuanyu大约 1 年前
If this turns out to be true. It could indeed be a game changer... Given the advanced AI chip shortage... Also, for the chip ban on China...
rossjudson大约 1 年前
I predict Daniel Lemire will build the most efficient training and inferencing systems, close to theoretical performance limits.
lavp大约 1 年前
What does “perform slightly better than Llama” mean exactly? A model like this needs to be trained from scratch right?
dr_dshiv大约 1 年前
Wondering if this might have any impact on the use of quantum computers in LLM training&#x2F;distillation…
brunooliv大约 1 年前
Do the implications at a practical level mean that the size of gguf files will become smaller?
Havoc大约 1 年前
If true then I&#x27;m guessing this would make ASICs for this far more simple too, right?
K0IN大约 1 年前
when can we expect the first ~100+ million parameter models to run on raspberry pi Pico?
Alifatisk大约 1 年前
If this paper (especially the results on Table 4) is true, then this is a game changer!
checker659大约 1 年前
If all the weights are either 1, 0 or -1, isn&#x27;t this what biological neurons do?
评论 #39538584 未加载
yieldcrv大约 1 年前
This is great, my employer just gave me a M1 laptop with only 16gb ram and I had to downgrade my 7B parameter local LLM’s to 3 bit quantizing, they’ve been surprisingly okay!<p>In my personal machine at 64gb ram, I usually use 8x7B at Q5 or 70B at Q4<p>Its Mistral all the way down! Imagining Q1.58 that’s doing well makes me happy
评论 #39537870 未加载
评论 #39537247 未加载
评论 #39537979 未加载
yousif_123123大约 1 年前
Any models published as well?
评论 #39543013 未加载
1ba9115454大约 1 年前
A tenary is all you need.
singularity2001大约 1 年前
So we almost go back full circle to human (animal) brain binary spikes?
评论 #39542155 未加载
klysm大约 1 年前
Does this mean we can compile LLMs to run on FPGAs directly?
评论 #39542304 未加载
m3kw9大约 1 年前
How much of a waste is using NVidia hardware for this?
leroman大约 1 年前
Can someone versed in the ways of math explain how this is different from previous quantization methods?<p>And specifically, seeing how going from 16fp to 8bit mostly gives same perplexity while anything further seems to lose quality &#x2F; dumb down the model, how is this even less precise method is able to achieve this?
评论 #39537217 未加载
评论 #39545080 未加载
评论 #39537145 未加载
wenyuanyu大约 1 年前
I wonder how the training process works...
arunk47大约 1 年前
Okay wait, can I train my own llm yet?