TechEcho

5 comments

okalldalalmost 2 years ago

As this article is from some weeks ago and Huggingface has now implemented Paged Attention in text-generation-inference[1], I would assume the benchmark results would be quite different if done today. Would be very interesting to see more recent benchmarks if anyone has done any![1] <a href="https://github.com/huggingface/text-generation-inference/issues/478">https://github.com/huggingface/text-generation-inference/iss...</a>

评论 #37131816 未加载

评论 #37132090 未加载

sakexalmost 2 years ago

I was a bit surprised by this:> For example, suppose you prompt with a sentence "What is the capital of California: ", it would take ten forward pass iterations to get back the full response of ["S", "a", "c", "r", “a”, "m", "e", "n", "t", "o"]From the content I have been reading trying to understand LLMs, I thought that the output was a token and not a string of chars. What am I missing here?

评论 #37131738 未加载

评论 #37131737 未加载

评论 #37131724 未加载

评论 #37131752 未加载

评论 #37133467 未加载

评论 #37131746 未加载

pptralmost 2 years ago

If you achieve 23x throughput without reducing the amount of work being done, does that mean GPU utilization was ~4% before?Kinda hard to believe, but I have no intuition about this.

hashtag-tilalmost 2 years ago

It's very hard to filter blatant self-promotion and marketing jargon from actual innovation (which goes at a much slower pace) these days.The speed people are just creating wrappers or minor changes and using words like "disruptive", "game-changing", "democratising <something>" just feel so inflated and boring at this point.I hope this gets very soon into the next phase of the hype cycle[1], so the we can talk about something else. [1] <a href="https://en.wikipedia.org/wiki/Gartner_hype_cycle" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Gartner_hype_cycle</a>

评论 #37133494 未加载

评论 #37132155 未加载

gvdalmost 2 years ago

So what? TGI also supports this.

评论 #37138107 未加载

评论 #37132150 未加载

5 comments

okalldalalmost 2 years ago

评论 #37131816 未加载

评论 #37132090 未加载

sakexalmost 2 years ago

评论 #37131738 未加载

评论 #37131737 未加载

评论 #37131724 未加载

评论 #37131752 未加载

评论 #37133467 未加载

评论 #37131746 未加载

pptralmost 2 years ago

If you achieve 23x throughput without reducing the amount of work being done, does that mean GPU utilization was ~4% before?Kinda hard to believe, but I have no intuition about this.

Continuous batching to increase LLM inference throughput and reduce p50 latency

5 comments

Continuous batching to increase LLM inference throughput and reduce p50 latency

5 comments