I would add the following two numbers if you're generating realtime text or speech for human consumption:<p>- Human Reading Speed (English): ~250 words per minute<p>- Human Speaking Speed (English): ~150 words per minute<p>Should be treated like the Doherty Threshold [1] for generative content.<p>[1] <a href="https://lawsofux.com/doherty-threshold/" rel="nofollow">https://lawsofux.com/doherty-threshold/</a>
> There’s usually no need to go beyond 16-bit accuracy, and most of the time when you go to 8-bit accuracy there is too much loss of resolution.<p>I'm not sure this is accurate. From what I have seen, 8-bit quantization is usually fine, and even 4-bit is a viable tradeoff. Here are some benchmarks from TextSynth showing no significant degradation between 16 and 8 bit:<p><a href="https://textsynth.com/technology.html" rel="nofollow">https://textsynth.com/technology.html</a><p>8-bit uses half as much memory and doubles the throughput for limited quality loss.
> Of course there are efforts to reduce this, notably llama.cpp which runs a 13 billion parameter model on a 6GB GPU by quantizing aggressively down to 4 bits (and 8 bits without too much impact), but that’s atypical.<p>No, 4bit quantization is the typical case.<p>At 4bit you can fit twice the parameters of 8bit in the same space for far better performance/perplexity/quality.<p>Running LLMs higher than 4bit is atypical and almost always sub-optimal (compared to running a model half the size in 8bit).<p>Even pretraining and finetuning in 4bit is likely to become the norm soon as fp4 becomes more well understood.
> <i>~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens</i><p>MosaicML claims they trained a 7 billion parameter on 1 trillion tokens with a budget of $200k.<p><a href="https://www.mosaicml.com/blog/mpt-7b" rel="nofollow">https://www.mosaicml.com/blog/mpt-7b</a><p>Does training cost scale linearly with model size and token count? If so, that suggests a lower bound of $600k to train the 13 billion params model. (Still roughly the same magnitude)
RANDOM THOUGHT:<p>i wonder when we are getting docker for llm ... a Modelfile ?<p>FROM "PAAMA/16b"<p>APPLY "MNO/DATASET"<p>each layer could be lora adapter like thing maybe.<p>maybe when AI chips are finally here.
I think parts of the write-up are great.<p>There are some unique assumptions being made in parts of the gist<p>> 10: Cost Ratio of OpenAI embedding to Self-Hosted embedding<p>> 1: Cost Ratio of Self-Hosted base vs fine-tuned model queries<p>I don't know how useful these numbers are if you take away the assumptions that self-hosted will work as well as API.<p>> 10x: Throughput improvement from batching LLM requests<p>I see that the write up mentions memory being a caveat to this, but it also depends on the card specs as well. Memory Bandwidth / TFLOPs offered by say 4090 is superior while having the same amount of VRAM as 3090. The caveat mentioned with token length in the gist itself makes the 10x claim not a useful rule of thumb.
I think that it would be helpful to add a fine-tuning costs for an open source model (think LLaMA to Alpaca).<p>From the phrasing around fine tuning right now it seems like it's using openai's fine tuning api to determine that cost, but it's not very clear.<p>Also this would be helpful for other foundation models if that doesn't already exist - how much VRAM to run Stable Diffusion v2.1 at different resolutions, running Whisper or Bark for audio, etc.
How come the token to word ratio is smaller than 1 if tokens are either words or part of words? Shouldn't you expect <i>more</i> tokens than words?
I'm surprised not to see anything about data-to-parameter ratios for optimal scaling. My superficial understanding per the Chinchilla paper is to target 20 to 1.<p>I'm also confused about this:<p>> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens<p>This is apparently related to the LLaMa paper, but that paper seems to cite 1.0T tokens (rather than 1.4T tokens) for the 13B model. Also, if 20 to 1 is in fact optimal for the data-to-parameter ratio, then using a 100 to 1 ratio doesn't seem like an appropriate way to arrive at a magic number for training costs. The magic number should really be based on an optimal configuration. Or, perhaps, my superficial understanding here leads me to miss some important distinctions.
> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens<p>Llama paper mentioned 135,168 A100 hours for training 13 billion model on 1 trillion tokens, which means ~$150k for lambdalabs on demand instance.
I’m confused. If I am an LLM <i>developer</i> why do I need to know the cost per token? That’s not the GPU cost, that’s a business decision from a company.<p>If I am an LLM <i>user</i> maybe that’s relevant but prone to being out of date. I’m not going to use this page as the source of truth on that anyways.<p>Since the article seems to be targeted at developers who <i>use</i> LLMs to e.g. generate Embeddings for semantic search, the title is about as accurate as saying a software engineer is a “keyboard developer” because they use a keyboard.
> LLM developer<p>This is the first time I heard this term, and when I Google search "LLM developer" in an incognito tab, different device, this article is one of the first results.<p>Seems like we should first establish what exactly is an LLM developer.<p>> When I was at Google, there was a document put together by Jeff Dean, the legendary engineer, called Numbers every Engineer should know.<p>The personal plug and appeal to authority of "When I was a Google" is unnecessary. "Numbers every Engineer should know" is public and literally linked there. It's a weird way to start a engineering blog post and makes it feel like marketing of one's resume. Then again, I guess that's what most of these engineering blog posts are nowadays.<p>Indeed Jeff Dean is a legend and needing to add the "legendary engineer" qualifier detracts from this point. Let these things speak for themselves.