TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Exponentially faster language modelling

301 pointsby born-jreover 1 year ago

21 comments

WithinReasonover 1 year ago
Link to previous paper:<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2308.14711" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2308.14711</a><p>An attempt at a summary: They use a sigmoid function to make differentiable &quot;soft&quot; branches, and stack them to construct a binary tree, with the goal of only taking one branch at inference time (but training the whole tree) leading to log(W) instead of W inference cost. They gradually harden the branches so they become hard branches at the end of training.<p>A branch is computed as <i>branch(input, N)</i>, with a neural network N computing a scalar <i>c=N(input)</i>, then using a sigmoid to do a soft branch by returning the weighted sum of the recursive call <i>s(c)*branch(input, N_left) + (1-s(c)) * branch(input, N_right)</i> (the two weights <i>s(c)</i> and <i>1-s(c)</i> sum to 1). They only do &quot;proper processing&quot; using the leaf nodes.<p>Then they add a new loss term that encourages hard decisions by minimising the entropy of the Bernoulli distribution, making the 2 weights converge to 0 and 1, at which point only one branch needs to be taken at inference. They also state that this hardening often happens automatically though.<p>It&#x27;s a simple idea but the loss formulation is nice, you usually want your loss terms to be a measure of information.
评论 #38380998 未加载
评论 #38385440 未加载
评论 #38379818 未加载
评论 #38384384 未加载
评论 #38381158 未加载
sdrg822over 1 year ago
Cool. Important note:<p>&quot;&quot;&quot; One may ask whether the conditionality introduced by the use of CMM does not make FFFs incompatible with the processes and hardware already in place for dense matrix multiplication and deep learning more broadly. In short, the answer is “No, it does not, save for some increased caching complexity.&quot; &quot;&quot;&quot;<p>It&#x27;s hard to beat the hardware lottery!
评论 #38378516 未加载
fgfmover 1 year ago
This approach feels like pruning, but the speedup is considerably higher. Interestingly, I&#x27;m curious how this will play out on more recent transformer architectures though: I guess the speedup will be more important for the largest architectures, but even if we can get 2x or 10x speedup on Mistral&#x2F;Zephyr, Orca 2 or OpenChat3.5, that would be a tremendous achievement!
评论 #38375368 未加载
rsolvaover 1 year ago
I find running 7B models on my 6 year old small form factor HP EliteDesk to be fast enough for casual everyday use. If this speedup can be applied generally to commonly used models, I can serve a local ChatGPT experience for both friends and family from my tiny homelab in my basement.<p><i>mind blown</i>
评论 #38384611 未加载
vorticalboxover 1 year ago
hugging face model<p><a href="https:&#x2F;&#x2F;huggingface.co&#x2F;pbelcak&#x2F;UltraFastBERT-1x11-long" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;pbelcak&#x2F;UltraFastBERT-1x11-long</a>
评论 #38380784 未加载
评论 #38384983 未加载
baqover 1 year ago
Mix this with yesterday&#x27;s matmul approximation (maddness) in HW for a casual... three orders of magnitude speed increase?
评论 #38382202 未加载
评论 #38381655 未加载
评论 #38385373 未加载
tokaiover 1 year ago
Why not use the real title? Its short and precise.
millisecondover 1 year ago
Could this be applied to other models like Llama2 or Mistral?
评论 #38378983 未加载
评论 #38381766 未加载
评论 #38382313 未加载
itissidover 1 year ago
Another noob Question: So a 50% size reduction in BERT? let&#x27;s see if I am getting these numbers right. At inference time you need a fraction of the neurons in the FF layer to do the inference based on the input data and the previous dot product. Here some quick math for BERT-Base which has 110M params according to the original paper:<p>----<p><pre><code> L (Number of Layers): 12 transformer blocks. H (Hidden Size): 768 units in the hidden layers. A (Number of Attention Heads): 12 attention heads. </code></pre> Embedding Layers:<p><pre><code> WordPiece Embeddings: 768 (hidden size) * 30,522 (vocab size) = 23,440,896 parameters. Positional Embeddings: 768 * 512 (max sequence length) = 393,216 parameters. Segment Embeddings: 768 * 2 (number of segments) = 1,536 parameters. Total Embedding Parameters: 23,440,896 + 393,216 + 1,536 = 23,835,648 parameters. </code></pre> Transformer Blocks:<p><pre><code> Each transformer block has the following components: Self-Attention Layer: Each attention head has 768 &#x2F; 12 = 64 units. Query (Q), Key (K), Value (V) matrices: 3 * (64 * 768) = 147,456 parameters per head. Across 12 heads: 147,456 * 12 = 1,769,472 parameters. Output layer of the attention mechanism: 768 * 768 = 589,824 parameters. Feed-Forward Network (FFN): First layer: 768 (input) * 3,072 (intermediate size) = 2,359,296 parameters. Second layer: 3,072 * 768 = 2,359,296 parameters. Total FFN parameters per block: 2,359,296 + 2,359,296 = 4,718,592 parameters. -----------------&gt; *This is the number to keep in mind.* Total Parameters per Block: 1,769,472 (self-attention) + 589,824 (output) + 4,718,592 (FFN) = 7,077,888 parameters. Total for 12 Blocks: 7,077,888 * 12 = 84,934,656 parameters. Layer Norm and Other Parameters: Each transformer block also includes layer normalization and other small components, which add a relatively small number of parameters. </code></pre> Total Parameters:<p><pre><code> Embeddings: 23,835,648 Transformer Blocks: 84,934,656 Layer Norm and Others: A small number, completing the total to around 110 million.</code></pre> --------------------------------------<p>4.718M FF Params per block * 12 ~ 56.6 Million&#x2F;110M Params which is a staggering ~50% reduction in size at inference time if you use 0.3% of the FF neurons for FFF??
itissidover 1 year ago
Noob Question: So is the idea to load only specific branches (and by extension log(n) order neurons), right based on the input data. Would this be something that a compiler would do using a JIT trick(because the input needs to be known to get the right branch) to issue a call to the right neurons into memory(SIMD?) to do the Feed Forward?
itissidover 1 year ago
For those not familiar with Bert transformer arch. You can read a bunch of their torch benchmark code to measure speed up in just the FFF: <a href="https:&#x2F;&#x2F;github.com&#x2F;pbelcak&#x2F;UltraFastBERT&#x2F;blob&#x2F;main&#x2F;benchmark_pytorch&#x2F;main.py">https:&#x2F;&#x2F;github.com&#x2F;pbelcak&#x2F;UltraFastBERT&#x2F;blob&#x2F;main&#x2F;benchmark...</a><p>some of which is from the pytorch docs here: <a href="https:&#x2F;&#x2F;pytorch.org&#x2F;tutorials&#x2F;intermediate&#x2F;torch_compile_tutorial.html" rel="nofollow noreferrer">https:&#x2F;&#x2F;pytorch.org&#x2F;tutorials&#x2F;intermediate&#x2F;torch_compile_tut...</a>, e.g. the `timed` function and how they generate data.<p>Also its not just the same 12 neurons, its the 12 neurons based on the previous dot product. So some kind of JIT is needed to load the right ones?
Klaster_1over 1 year ago
What are the potential consequences? Does this open doors to faster edge inference or improved capabilities?
评论 #38378622 未加载
评论 #38379027 未加载
measured_stepover 1 year ago
How would this scale for a use case like writing code? I could imagine that some inputs would require a large number of neurons. Would this architecture be able to do that if it were scaled up?<p>I&#x27;m also curious if this model architecture would achieve the grokking of more complex concepts at scale.
jasonjmcgheeover 1 year ago
Does anyone understand why they are using B x H instead of B x S x H?<p>Why is the context size and batch size represented as a single parameter?
评论 #38383249 未加载
quickthrower2over 1 year ago
If anyone is on the ball enough to turn this into a colab or notebook that would be appreciated! Would love to see the code
dartosover 1 year ago
AFAIK GPU core are quite slow with branching logic.<p>I wonder if the conditional in this would hurt performance at scale
ndrover 1 year ago
Abstract:<p>&gt; Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.<p>Conclusions<p>&gt; We present UltraFastBERT, a modified version of the (crammed)BERT architecture that uses fast feedforward instead of feedforward networks in its intermediate layers. UltraFastBERT serves as proof that large language models only really need to engage an exponential fraction of their parameters to perform individual inferences. UltraFastBERT-1x11, our deepest model with the highest promise of acceleration, uses only 0.3% of its neurons during inference and already achieves a 78x CPU speedup over the inference time of the corresponding feedforward layer. With a theoretical speedup promise of 341x at the scale of BERT-base models, we hope that our work will inspire an effort to implement primitives for conditional neural execution as a part of device programming interfaces.
评论 #38378602 未加载
评论 #38380960 未加载
qnttyover 1 year ago
According to scientists, we only use 0.3% of our neural networks. Imagine if we could use 100%.
评论 #38381008 未加载
评论 #38406296 未加载
评论 #38384666 未加载
评论 #38387410 未加载
评论 #38384603 未加载
评论 #38381742 未加载
评论 #38380964 未加载
OneOffAskover 1 year ago
Is this similar to what iOS 17 uses for its new autocomplete?
matmulbroover 1 year ago
Timewaster<p>All valuable AI research is secret now, they just churn out papers to waste time
vouaobrasilover 1 year ago
This is rather scary. I feel we are witnessing the evolution of language models and artificial intelligence, which seems intellectually laudable until you realize that the underlying evolutionary framework for this evolution is the global capitalistic system whose only criteria for selection in short-term monetary gain.<p>We are creating a monster.
评论 #38380337 未加载
评论 #38379662 未加载
评论 #38380051 未加载