TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Continuous batching to increase LLM inference throughput and reduce p50 latency

110 pointsby michellezzzalmost 2 years ago

5 comments

okalldalalmost 2 years ago
As this article is from some weeks ago and Huggingface has now implemented Paged Attention in text-generation-inference[1], I would assume the benchmark results would be quite different if done today. Would be very interesting to see more recent benchmarks if anyone has done any!<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;text-generation-inference&#x2F;issues&#x2F;478">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;text-generation-inference&#x2F;iss...</a>
评论 #37131816 未加载
评论 #37132090 未加载
sakexalmost 2 years ago
I was a bit surprised by this:<p>&gt; For example, suppose you prompt with a sentence &quot;What is the capital of California: &quot;, it would take ten forward pass iterations to get back the full response of [&quot;S&quot;, &quot;a&quot;, &quot;c&quot;, &quot;r&quot;, “a”, &quot;m&quot;, &quot;e&quot;, &quot;n&quot;, &quot;t&quot;, &quot;o&quot;]<p>From the content I have been reading trying to understand LLMs, I thought that the output was a token and not a string of chars. What am I missing here?
评论 #37131738 未加载
评论 #37131737 未加载
评论 #37131724 未加载
评论 #37131752 未加载
评论 #37133467 未加载
评论 #37131746 未加载
pptralmost 2 years ago
If you achieve 23x throughput without reducing the amount of work being done, does that mean GPU utilization was ~4% before?<p>Kinda hard to believe, but I have no intuition about this.
hashtag-tilalmost 2 years ago
It&#x27;s very hard to filter blatant self-promotion and marketing jargon from actual innovation (which goes at a much slower pace) these days.<p>The speed people are just creating wrappers or minor changes and using words like &quot;disruptive&quot;, &quot;game-changing&quot;, &quot;democratising &lt;something&gt;&quot; just feel so inflated and boring at this point.<p>I hope this gets very soon into the next phase of the hype cycle[1], so the we can talk about something else. [1] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Gartner_hype_cycle" rel="nofollow noreferrer">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Gartner_hype_cycle</a>
评论 #37133494 未加载
评论 #37132155 未加载
gvdalmost 2 years ago
So what? TGI also supports this.
评论 #37138107 未加载
评论 #37132150 未加载