Numbers every LLM developer should know

428 pointsby richardliawabout 2 years ago

19 comments

abetlenabout 2 years ago

I would add the following two numbers if you're generating realtime text or speech for human consumption:- Human Reading Speed (English): ~250 words per minute- Human Speaking Speed (English): ~150 words per minuteShould be treated like the Doherty Threshold [1] for generative content.[1] <a href="https://lawsofux.com/doherty-threshold/" rel="nofollow">https://lawsofux.com/doherty-threshold/</a>

评论 #35982546 未加载

评论 #35980604 未加载

jncratonabout 2 years ago

> There’s usually no need to go beyond 16-bit accuracy, and most of the time when you go to 8-bit accuracy there is too much loss of resolution.I'm not sure this is accurate. From what I have seen, 8-bit quantization is usually fine, and even 4-bit is a viable tradeoff. Here are some benchmarks from TextSynth showing no significant degradation between 16 and 8 bit:<a href="https://textsynth.com/technology.html" rel="nofollow">https://textsynth.com/technology.html</a>8-bit uses half as much memory and doubles the throughput for limited quality loss.

评论 #35981091 未加载

评论 #35980176 未加载

评论 #35982042 未加载

评论 #35980197 未加载

评论 #35980196 未加载

MacsHeadroomabout 2 years ago

> Of course there are efforts to reduce this, notably llama.cpp which runs a 13 billion parameter model on a 6GB GPU by quantizing aggressively down to 4 bits (and 8 bits without too much impact), but that’s atypical.No, 4bit quantization is the typical case.At 4bit you can fit twice the parameters of 8bit in the same space for far better performance/perplexity/quality.Running LLMs higher than 4bit is atypical and almost always sub-optimal (compared to running a model half the size in 8bit).Even pretraining and finetuning in 4bit is likely to become the norm soon as fp4 becomes more well understood.

评论 #35980236 未加载

评论 #35980368 未加载

评论 #35980242 未加载

评论 #35980875 未加载

PoignardAzurabout 2 years ago

> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokensMosaicML claims they trained a 7 billion parameter on 1 trillion tokens with a budget of $200k.<a href="https://www.mosaicml.com/blog/mpt-7b" rel="nofollow">https://www.mosaicml.com/blog/mpt-7b</a>Does training cost scale linearly with model size and token count? If so, that suggests a lower bound of $600k to train the 13 billion params model. (Still roughly the same magnitude)

评论 #35980506 未加载

born-jrealmost 2 years ago

RANDOM THOUGHT:i wonder when we are getting docker for llm ... a Modelfile ?FROM "PAAMA/16b"APPLY "MNO/DATASET"each layer could be lora adapter like thing maybe.maybe when AI chips are finally here.

评论 #35981240 未加载

评论 #35981327 未加载

zenogantneralmost 2 years ago

> 40-90%: Amount saved by appending “Be Concise” to your promptLooks to me like "numbers every LLM user needs to know".

ramesh1994almost 2 years ago

I think parts of the write-up are great.There are some unique assumptions being made in parts of the gist> 10: Cost Ratio of OpenAI embedding to Self-Hosted embedding> 1: Cost Ratio of Self-Hosted base vs fine-tuned model queriesI don't know how useful these numbers are if you take away the assumptions that self-hosted will work as well as API.> 10x: Throughput improvement from batching LLM requestsI see that the write up mentions memory being a caveat to this, but it also depends on the card specs as well. Memory Bandwidth / TFLOPs offered by say 4090 is superior while having the same amount of VRAM as 3090. The caveat mentioned with token length in the gist itself makes the 10x claim not a useful rule of thumb.

评论 #35981337 未加载

Flux159about 2 years ago

I think that it would be helpful to add a fine-tuning costs for an open source model (think LLaMA to Alpaca).From the phrasing around fine tuning right now it seems like it's using openai's fine tuning api to determine that cost, but it's not very clear.Also this would be helpful for other foundation models if that doesn't already exist - how much VRAM to run Stable Diffusion v2.1 at different resolutions, running Whisper or Bark for audio, etc.

评论 #35980046 未加载

contravariantabout 2 years ago

How come the token to word ratio is smaller than 1 if tokens are either words or part of words? Shouldn't you expect more tokens than words?

评论 #35979933 未加载

评论 #35980313 未加载

评论 #35979949 未加载

评论 #35979947 未加载

评论 #35979991 未加载

crosen99almost 2 years ago

I'm surprised not to see anything about data-to-parameter ratios for optimal scaling. My superficial understanding per the Chinchilla paper is to target 20 to 1.I'm also confused about this:> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokensThis is apparently related to the LLaMa paper, but that paper seems to cite 1.0T tokens (rather than 1.4T tokens) for the 13B model. Also, if 20 to 1 is in fact optimal for the data-to-parameter ratio, then using a 100 to 1 ratio doesn't seem like an appropriate way to arrive at a magic number for training costs. The magic number should really be based on an optimal configuration. Or, perhaps, my superficial understanding here leads me to miss some important distinctions.

评论 #35982516 未加载

YetAnotherNickalmost 2 years ago

> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokensLlama paper mentioned 135,168 A100 hours for training 13 billion model on 1 trillion tokens, which means ~$150k for lambdalabs on demand instance.

评论 #35981885 未加载

throwaway888abcabout 2 years ago

Excellent! Thank you so much for making/posting this

评论 #35980256 未加载

thundalmost 2 years ago

> 1.3: Average tokens per wordthis is so US centric :-(for billions of people, arguably the majority of the world, that’s incorrect

diatonealmost 2 years ago

> Running an LLM query through a GPU is very high latency: it may take, say, 5 seconds, with a throughput of 0.2 queries per second.Why?

EvgeniyZhalmost 2 years ago

Talks about throughput but doesn't mention memory I/O speed, which should be a bottleneck for LLMs

cwkossabout 2 years ago

Are there any open source host-your-own LLMs that have licensing that allows for commercial use?

评论 #35980930 未加载

评论 #35980345 未加载

评论 #35980473 未加载

curiousgalabout 2 years ago

> LLM DeveloperThis is the fastest I've rolled my eyes in a long time!

评论 #35980308 未加载

janalsncmalmost 2 years ago

I’m confused. If I am an LLM developer why do I need to know the cost per token? That’s not the GPU cost, that’s a business decision from a company.If I am an LLM user maybe that’s relevant but prone to being out of date. I’m not going to use this page as the source of truth on that anyways.Since the article seems to be targeted at developers who use LLMs to e.g. generate Embeddings for semantic search, the title is about as accurate as saying a software engineer is a “keyboard developer” because they use a keyboard.

cornfutesalmost 2 years ago

> LLM developerThis is the first time I heard this term, and when I Google search "LLM developer" in an incognito tab, different device, this article is one of the first results.Seems like we should first establish what exactly is an LLM developer.> When I was at Google, there was a document put together by Jeff Dean, the legendary engineer, called Numbers every Engineer should know.The personal plug and appeal to authority of "When I was a Google" is unnecessary. "Numbers every Engineer should know" is public and literally linked there. It's a weird way to start a engineering blog post and makes it feel like marketing of one's resume. Then again, I guess that's what most of these engineering blog posts are nowadays.Indeed Jeff Dean is a legend and needing to add the "legendary engineer" qualifier detracts from this point. Let these things speak for themselves.

评论 #35985341 未加载

评论 #35985445 未加载