Hey, author of the blog post here. It's mentioned in the blog post, but one of the intentions of this repo is that it's more of a "tutorial" than it is a library/framework. My hope is that people will copy-paste and modify it for their own needs :)<p>Code can also be found here: <a href="https://github.com/pytorch-labs/gpt-fast">https://github.com/pytorch-labs/gpt-fast</a><p>And a twitter thread summary here: <a href="https://twitter.com/cHHillee/status/1730293330213531844" rel="nofollow noreferrer">https://twitter.com/cHHillee/status/1730293330213531844</a>
"This may sound implausible to many of you, considering how hard it is to write efficient matrix multiplication/attention kernels, and how much manpower has been put into CuBLAS and FlashAttention. The key here, however, is that transformer decoding has very unusual computational properties. In particular, because of the KV-cache, for BS=1 every single matrix multiplication in a transformer is actually a matrix vector multiplication."<p>How did they unlock this key ? In retrospect it seems so simple, but without the KV-cache this possibility would not have emerged at all. Hats off !
This is a great article. Regarding<p>> While these projects are performant, they often come with tradeoffs in ease of use, such as requiring model conversion to specific formats or building and shipping new dependencies.<p>I think it should be acknowledged that (at least IMO) pytorch model formats are not very portable and this is a big part of the problem. It would be nice to see industry move towards a better format (gguf?) that can easily be ported between frameworks and not leave you stuck using torch to load it. Likewise, pytorch is a massive dependency to include with a project, especially for simple inference, so while other projects have new dependencies, they can often be a lot lighter than for a pytorch model, again particularly for inference code.
I'm wondering if gpt-fast has a version that can be run from Windows Command Prompt or Powershell?<p><a href="https://github.com/pytorch-labs/gpt-fast/issues/45">https://github.com/pytorch-labs/gpt-fast/issues/45</a>
Hey we are exactly dealing now with all that is related to the performance of running models from HF.
Yesterday we ran vicuna-13b-q8-gguf using llamacpp on a vast.ai A40 45GB VRAM
It gave us 4 tokens/s generation rate.
This seems a bit slow for that GPU and a 13b model.
Does anyone know where the problem could be in llamacpp, gpu, the model, something else?<p>Also… where are all the people like us that work on applications on top of HF models congregate?
Offtopic, but what software do they use to create the benchmark flamegraphs? I've been using cProfile with snakeviz, but am curious to try alternatives.
One of the notable tricks the various LLM serving frameworks provide is a special approaches to batching, e g. continuous, persistent, or in-flight batching depending on the inference framework. At some level they each allow you to start a new generation while in the middle of one or more previous generations.<p>Is that possible with "just" pytorch? Could it be added to gpt-fast?
What are some of the better use cases of fast inference? From my experience using ChatGPT, I don't need it to generate faster than I can read, but waiting for code generation is painful because I'm waiting for the whole code block to format correctly, be available to copy or execute (in the case of code interpreter). Anything else fall under this pattern?
If you were to serve this on a datacenter server, is the client to server roundtrip networking the slowest part of the inference? Curious if it would be faster to run this cloud GPUs on better hardware but farther compute, or locally with worse hardware.
How does one select a good candidate for the draft model in speculative decoding? I imagine that there's some better intuition than just selecting the next parameter count down (i.e 70B -> 13B, 13B -> 7B).<p>Also how does that interact with MoE models? Do you have a mini version of the MoE, with smaller experts?
Holy hotdogs, this look amazing. So ahh. I'll jump right to it - where can I run this online without having to do a bunch of work setting it up? I have several python projects that could take advantage of this! (;