科技回声

11 条评论

This is (and was) the dream of Cerebras and I am very glad to see it embraced if even in small part on a GPU. Wild to see how much performance is left on the table for these things, it's crazy to think how much can be done by a few bold individuals when it comes to pushing the SOTA of these kinds of things (not just in kernels either -- in other areas as well!)My experience has been that getting over the daunting factor of feeling afraid of a big wide world with a lot of noise and marketing and simply committing to a problem, learning it, and slowly bootstrapping it over time, tends to yield phenomenal results in the long run for most applications. And, if not, then there's often an applicable one/side field that can be pivoted to for still making immense/incredible progress.The big players may have the advantage of scale, but there is so, so much that can be done still if you look around and keep a good feel for it. <3 :)

hardwaresofton7 天前

Meta note but this paper is wonderfully written and incredibly approachable — excellent work by the authors.

评论 #44112643 未加载

评论 #44112237 未加载

ryao7 天前

After presenting their numbers, they mention that CUDA graphs also do much of this, but then say that the launch time is higher for them. It would have been more interesting if they had included comparison numbers.Without numbers, I am left wondering whether they omitted CUDA graph benchmarks due to a lack of effort, or because they actually did the benchmarks and did not want to admit that their approach was not as much of a performance advance as they portray it to be.

评论 #44112596 未加载

评论 #44112597 未加载

评论 #44112491 未加载

kcorbitt7 天前

It seems like the speedups here are most useful for small models, since on larger models a smaller fraction of the total time would be spent swapping between kernels? Would be interesting to see at least theoretical results for LLMs in the 14-70B parameter range, which is what most folks deploy in practice.And of course the effect on throughput at larger batch sizes, which they allude to at the end.Overall a very interesting result!

评论 #44112324 未加载

评论 #44112610 未加载

xixihaha5 天前

Very bold direction and I love it. Looks like a lot of CUDA expertise engineering. I am thinking why set batch size to 1? Hope to see comparison with real production with larger batch size. Also wondering how to extend it to other models, like MOE, expert parallel, CUDA kernel is not supported across GPUs?

saagarjha7 天前

The thing I find really disappointing about CUDA is that Nvidia could provide the synchronization primitives needed to do this easily, but they don't. Scheduling on their cores remains really dumb, even though I know there is a bunch of work being done behind the scenes to service whatever async warp-specialized matrix multiplication instruction they added in this generation. It's just that there's no way to access it directly and you have to use the little bespoke bits that get exposed in each generation :(

rudedogg7 天前

On a similar note:I wonder if we'll see OS level services/daemons to try and lower the time to first token as these things get used more. And the interface for application developers will be a simple system prompt.In some ways the idea sounds nice, but there would be a lot of downsides:- Memory eaten up by potentially unused models- Less compute available to software running specialized models for specific tasks

评论 #44112108 未加载

评论 #44113114 未加载

terhechte7 天前

Would this also be possible with other LLM engines / GPUs? E.g. Llama / Apple Silicon or Radeon?

评论 #44114619 未加载

WhitneyLand7 天前

Why all the trouble to speed things up while at the same time using bfloat16?

motomoto51885 天前

Wondering how much it would improve on prefilling?

Stem00377 天前

I wonder how much of this overhead (like the 250µs for activations/consistency on B200) could be further chipped away with even finer-grained control or different sync primitives.

11 条评论

tbalsam7 天前

hardwaresofton7 天前

Meta note but this paper is wonderfully written and incredibly approachable — excellent work by the authors.

评论 #44112643 未加载

评论 #44112237 未加载

ryao7 天前

评论 #44112596 未加载

评论 #44112597 未加载

评论 #44112491 未加载

kcorbitt7 天前

评论 #44112324 未加载

评论 #44112610 未加载

xixihaha5 天前

saagarjha7 天前

rudedogg7 天前

评论 #44112108 未加载

评论 #44113114 未加载

terhechte7 天前

Would this also be possible with other LLM engines / GPUs? E.g. Llama / Apple Silicon or Radeon?

评论 #44114619 未加载

WhitneyLand7 天前

Why all the trouble to speed things up while at the same time using bfloat16?

motomoto51885 天前

Wondering how much it would improve on prefilling?

Stem00377 天前

I wonder how much of this overhead (like the 250µs for activations/consistency on B200) could be further chipped away with even finer-grained control or different sync primitives.

Look Ma, No Bubbles: Designing a Low-Latency Megakernel for Llama-1B

11 条评论

Look Ma, No Bubbles: Designing a Low-Latency Megakernel for Llama-1B

11 条评论