Actually, the only numbers every LLM developer should know are their accelerator specs.
For example:<p>A100 specs:<p>- 312e12 BF16 FLOPS<p>- 1555e9 GB/s HBM bandwidth<p>H100:<p>- 1000e12/2000e12 BF16/INT8 FLOPS<p>(apply ~0.7 flops efficiency multiplier because h100s power throttle extremely quickly)<p>- 3000 GB/s HBM bandwidth<p>---<p>For a 13B model on an A100, this nets:<p>13e9 * 2 bytes per param = 26 GB HBM required (at bf16)<p>26e9/1555e9 = 17ms / token small-batch latency (~60 tokens / second)<p>What about large batches?<p>latency for some batch size B is 13e9 * 2 FLOP per param * B / 312e12<p>We want B such that we're just about no longer HBM bound:
26e9/312e12 * B = 17ms<p><=> 17e-3/(26e9/312e12)<p>giving a batch size of 204.<p>At that batch size (and all larger batch sizes), the a100 delivers a throughput of
B * 1/17ms = 12000 tokens / second<p>---<p>KV caching, multi-gpu and -node comms and matmul efficiencies left as an exercise to the reader :)