As this article is from some weeks ago and Huggingface has now implemented Paged Attention in text-generation-inference[1], I would assume the benchmark results would be quite different if done today. Would be very interesting to see more recent benchmarks if anyone has done any!<p>[1] <a href="https://github.com/huggingface/text-generation-inference/issues/478">https://github.com/huggingface/text-generation-inference/iss...</a>
I was a bit surprised by this:<p>> For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", “a”, "m", "e", "n", "t", "o"]<p>From the content I have been reading trying to understand LLMs, I thought that the output was a token and not a string of chars. What am I missing here?
If you achieve 23x throughput without reducing the amount of work being done, does that mean GPU utilization was ~4% before?<p>Kinda hard to believe, but I have no intuition about this.
It's very hard to filter blatant self-promotion and marketing jargon from actual innovation (which goes at a much slower pace) these days.<p>The speed people are just creating wrappers or minor changes and using words like "disruptive", "game-changing", "democratising <something>" just feel so inflated and boring at this point.<p>I hope this gets very soon into the next phase of the hype cycle[1], so the we can talk about something else.
[1] <a href="https://en.wikipedia.org/wiki/Gartner_hype_cycle" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Gartner_hype_cycle</a>