TechEcho

16 comments

chilleeover 1 year ago

Hey, author of the blog post here. It's mentioned in the blog post, but one of the intentions of this repo is that it's more of a "tutorial" than it is a library/framework. My hope is that people will copy-paste and modify it for their own needs :)Code can also be found here: <a href="https://github.com/pytorch-labs/gpt-fast">https://github.com/pytorch-labs/gpt-fast</a>And a twitter thread summary here: <a href="https://twitter.com/cHHillee/status/1730293330213531844" rel="nofollow noreferrer">https://twitter.com/cHHillee/status/1730293330213531844</a>

评论 #38478280 未加载

评论 #38478483 未加载

评论 #38479671 未加载

评论 #38495388 未加载

评论 #38485425 未加载

评论 #38477978 未加载

评论 #38483941 未加载

评论 #38478138 未加载

deepGemover 1 year ago

"This may sound implausible to many of you, considering how hard it is to write efficient matrix multiplication/attention kernels, and how much manpower has been put into CuBLAS and FlashAttention. The key here, however, is that transformer decoding has very unusual computational properties. In particular, because of the KV-cache, for BS=1 every single matrix multiplication in a transformer is actually a matrix vector multiplication."How did they unlock this key ? In retrospect it seems so simple, but without the KV-cache this possibility would not have emerged at all. Hats off !

评论 #38484855 未加载

andy99over 1 year ago

This is a great article. Regarding> While these projects are performant, they often come with tradeoffs in ease of use, such as requiring model conversion to specific formats or building and shipping new dependencies.I think it should be acknowledged that (at least IMO) pytorch model formats are not very portable and this is a big part of the problem. It would be nice to see industry move towards a better format (gguf?) that can easily be ported between frameworks and not leave you stuck using torch to load it. Likewise, pytorch is a massive dependency to include with a project, especially for simple inference, so while other projects have new dependencies, they can often be a lot lighter than for a pytorch model, again particularly for inference code.

评论 #38478207 未加载

maxloo1976over 1 year ago

I'm wondering if gpt-fast has a version that can be run from Windows Command Prompt or Powershell?<a href="https://github.com/pytorch-labs/gpt-fast/issues/45">https://github.com/pytorch-labs/gpt-fast/issues/45</a>

kosolamover 1 year ago

Hey we are exactly dealing now with all that is related to the performance of running models from HF. Yesterday we ran vicuna-13b-q8-gguf using llamacpp on a vast.ai A40 45GB VRAM It gave us 4 tokens/s generation rate. This seems a bit slow for that GPU and a 13b model. Does anyone know where the problem could be in llamacpp, gpu, the model, something else?Also… where are all the people like us that work on applications on top of HF models congregate?

sva_over 1 year ago

Offtopic, but what software do they use to create the benchmark flamegraphs? I've been using cProfile with snakeviz, but am curious to try alternatives.

评论 #38481024 未加载

评论 #38481053 未加载

claytonjyover 1 year ago

One of the notable tricks the various LLM serving frameworks provide is a special approaches to batching, e g. continuous, persistent, or in-flight batching depending on the inference framework. At some level they each allow you to start a new generation while in the middle of one or more previous generations.Is that possible with "just" pytorch? Could it be added to gpt-fast?

评论 #38480270 未加载

syrusakbaryover 1 year ago

I'd love to see how this compares against Llama.cpp on speed. Do anyone any benchmarks that compare PyTorch.compile vs Llama.cpp?

评论 #38483990 未加载

dnnssl2over 1 year ago

What are some of the better use cases of fast inference? From my experience using ChatGPT, I don't need it to generate faster than I can read, but waiting for code generation is painful because I'm waiting for the whole code block to format correctly, be available to copy or execute (in the case of code interpreter). Anything else fall under this pattern?

评论 #38478650 未加载

评论 #38482921 未加载

评论 #38478638 未加载

评论 #38478737 未加载

评论 #38478194 未加载

dnnssl2over 1 year ago

If you were to serve this on a datacenter server, is the client to server roundtrip networking the slowest part of the inference? Curious if it would be faster to run this cloud GPUs on better hardware but farther compute, or locally with worse hardware.

评论 #38478227 未加载

dnnssl2over 1 year ago

How does one select a good candidate for the draft model in speculative decoding? I imagine that there's some better intuition than just selecting the next parameter count down (i.e 70B -> 13B, 13B -> 7B).Also how does that interact with MoE models? Do you have a mini version of the MoE, with smaller experts?

评论 #38478560 未加载

two_in_oneover 1 year ago

Great work. Actually it's sort of breakthrough because it makes interesting things possible. (if it's not too late...)

xmichael909over 1 year ago

Holy hotdogs, this look amazing. So ahh. I'll jump right to it - where can I run this online without having to do a bunch of work setting it up? I have several python projects that could take advantage of this! (;

AmazingTurtleover 1 year ago

240tok/s is crazy

brucethemoose2over 1 year ago

This is similar to exllamav2, and exllamav2's quantization is also excellent.

bicepjaiover 1 year ago

Can we get same performance on Apple gpu ?

16 comments

chilleeover 1 year ago

评论 #38478280 未加载

评论 #38478483 未加载

评论 #38479671 未加载

评论 #38495388 未加载

评论 #38485425 未加载

评论 #38477978 未加载

评论 #38483941 未加载

评论 #38478138 未加载

deepGemover 1 year ago

评论 #38484855 未加载

andy99over 1 year ago

评论 #38478207 未加载

maxloo1976over 1 year ago

kosolamover 1 year ago

sva_over 1 year ago

Offtopic, but what software do they use to create the benchmark flamegraphs? I've been using cProfile with snakeviz, but am curious to try alternatives.

评论 #38481024 未加载

评论 #38481053 未加载

claytonjyover 1 year ago

评论 #38480270 未加载

syrusakbaryover 1 year ago

I'd love to see how this compares against Llama.cpp on speed. Do anyone any benchmarks that compare PyTorch.compile vs Llama.cpp?

评论 #38483990 未加载

dnnssl2over 1 year ago

评论 #38478650 未加载

评论 #38482921 未加载

评论 #38478638 未加载

评论 #38478737 未加载

评论 #38478194 未加载

dnnssl2over 1 year ago

评论 #38478227 未加载

dnnssl2over 1 year ago

评论 #38478560 未加载

two_in_oneover 1 year ago

Great work. Actually it's sort of breakthrough because it makes interesting things possible. (if it's not too late...)

xmichael909over 1 year ago

AmazingTurtleover 1 year ago

240tok/s is crazy

brucethemoose2over 1 year ago

This is similar to exllamav2, and exllamav2's quantization is also excellent.

bicepjaiover 1 year ago

Can we get same performance on Apple gpu ?

Accelerating Generative AI with PyTorch II: GPT, Fast

16 comments

Accelerating Generative AI with PyTorch II: GPT, Fast

16 comments