TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Accelerating Generative AI with PyTorch II: GPT, Fast

306 pointsby polyrandover 1 year ago

16 comments

chilleeover 1 year ago
Hey, author of the blog post here. It&#x27;s mentioned in the blog post, but one of the intentions of this repo is that it&#x27;s more of a &quot;tutorial&quot; than it is a library&#x2F;framework. My hope is that people will copy-paste and modify it for their own needs :)<p>Code can also be found here: <a href="https:&#x2F;&#x2F;github.com&#x2F;pytorch-labs&#x2F;gpt-fast">https:&#x2F;&#x2F;github.com&#x2F;pytorch-labs&#x2F;gpt-fast</a><p>And a twitter thread summary here: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;cHHillee&#x2F;status&#x2F;1730293330213531844" rel="nofollow noreferrer">https:&#x2F;&#x2F;twitter.com&#x2F;cHHillee&#x2F;status&#x2F;1730293330213531844</a>
评论 #38478280 未加载
评论 #38478483 未加载
评论 #38479671 未加载
评论 #38495388 未加载
评论 #38485425 未加载
评论 #38477978 未加载
评论 #38483941 未加载
评论 #38478138 未加载
deepGemover 1 year ago
&quot;This may sound implausible to many of you, considering how hard it is to write efficient matrix multiplication&#x2F;attention kernels, and how much manpower has been put into CuBLAS and FlashAttention. The key here, however, is that transformer decoding has very unusual computational properties. In particular, because of the KV-cache, for BS=1 every single matrix multiplication in a transformer is actually a matrix vector multiplication.&quot;<p>How did they unlock this key ? In retrospect it seems so simple, but without the KV-cache this possibility would not have emerged at all. Hats off !
评论 #38484855 未加载
andy99over 1 year ago
This is a great article. Regarding<p>&gt; While these projects are performant, they often come with tradeoffs in ease of use, such as requiring model conversion to specific formats or building and shipping new dependencies.<p>I think it should be acknowledged that (at least IMO) pytorch model formats are not very portable and this is a big part of the problem. It would be nice to see industry move towards a better format (gguf?) that can easily be ported between frameworks and not leave you stuck using torch to load it. Likewise, pytorch is a massive dependency to include with a project, especially for simple inference, so while other projects have new dependencies, they can often be a lot lighter than for a pytorch model, again particularly for inference code.
评论 #38478207 未加载
maxloo1976over 1 year ago
I&#x27;m wondering if gpt-fast has a version that can be run from Windows Command Prompt or Powershell?<p><a href="https:&#x2F;&#x2F;github.com&#x2F;pytorch-labs&#x2F;gpt-fast&#x2F;issues&#x2F;45">https:&#x2F;&#x2F;github.com&#x2F;pytorch-labs&#x2F;gpt-fast&#x2F;issues&#x2F;45</a>
kosolamover 1 year ago
Hey we are exactly dealing now with all that is related to the performance of running models from HF. Yesterday we ran vicuna-13b-q8-gguf using llamacpp on a vast.ai A40 45GB VRAM It gave us 4 tokens&#x2F;s generation rate. This seems a bit slow for that GPU and a 13b model. Does anyone know where the problem could be in llamacpp, gpu, the model, something else?<p>Also… where are all the people like us that work on applications on top of HF models congregate?
sva_over 1 year ago
Offtopic, but what software do they use to create the benchmark flamegraphs? I&#x27;ve been using cProfile with snakeviz, but am curious to try alternatives.
评论 #38481024 未加载
评论 #38481053 未加载
claytonjyover 1 year ago
One of the notable tricks the various LLM serving frameworks provide is a special approaches to batching, e g. continuous, persistent, or in-flight batching depending on the inference framework. At some level they each allow you to start a new generation while in the middle of one or more previous generations.<p>Is that possible with &quot;just&quot; pytorch? Could it be added to gpt-fast?
评论 #38480270 未加载
syrusakbaryover 1 year ago
I&#x27;d love to see how this compares against Llama.cpp on speed. Do anyone any benchmarks that compare PyTorch.compile vs Llama.cpp?
评论 #38483990 未加载
dnnssl2over 1 year ago
What are some of the better use cases of fast inference? From my experience using ChatGPT, I don&#x27;t need it to generate faster than I can read, but waiting for code generation is painful because I&#x27;m waiting for the whole code block to format correctly, be available to copy or execute (in the case of code interpreter). Anything else fall under this pattern?
评论 #38478650 未加载
评论 #38482921 未加载
评论 #38478638 未加载
评论 #38478737 未加载
评论 #38478194 未加载
dnnssl2over 1 year ago
If you were to serve this on a datacenter server, is the client to server roundtrip networking the slowest part of the inference? Curious if it would be faster to run this cloud GPUs on better hardware but farther compute, or locally with worse hardware.
评论 #38478227 未加载
dnnssl2over 1 year ago
How does one select a good candidate for the draft model in speculative decoding? I imagine that there&#x27;s some better intuition than just selecting the next parameter count down (i.e 70B -&gt; 13B, 13B -&gt; 7B).<p>Also how does that interact with MoE models? Do you have a mini version of the MoE, with smaller experts?
评论 #38478560 未加载
two_in_oneover 1 year ago
Great work. Actually it&#x27;s sort of breakthrough because it makes interesting things possible. (if it&#x27;s not too late...)
xmichael909over 1 year ago
Holy hotdogs, this look amazing. So ahh. I&#x27;ll jump right to it - where can I run this online without having to do a bunch of work setting it up? I have several python projects that could take advantage of this! (;
AmazingTurtleover 1 year ago
240tok&#x2F;s is crazy
brucethemoose2over 1 year ago
This is similar to exllamav2, and exllamav2&#x27;s quantization is also excellent.
bicepjaiover 1 year ago
Can we get same performance on Apple gpu ?