TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Numbers every LLM Developer should know

95 点作者 davidwu将近 2 年前

8 条评论

hiddencost将近 2 年前
This is honestly a bit gross, as it&#x27;s just a marketing piece.<p>The original numbers every programmer should know is a profound piece of pedagogy, aimed at helping programmers be better at their craft.<p>This is just an excerpt from a pitch deck.
评论 #37107190 未加载
评论 #37108767 未加载
评论 #37107045 未加载
crashocaster将近 2 年前
Actually, the only numbers every LLM developer should know are their accelerator specs. For example:<p>A100 specs:<p>- 312e12 BF16 FLOPS<p>- 1555e9 GB&#x2F;s HBM bandwidth<p>H100:<p>- 1000e12&#x2F;2000e12 BF16&#x2F;INT8 FLOPS<p>(apply ~0.7 flops efficiency multiplier because h100s power throttle extremely quickly)<p>- 3000 GB&#x2F;s HBM bandwidth<p>---<p>For a 13B model on an A100, this nets:<p>13e9 * 2 bytes per param = 26 GB HBM required (at bf16)<p>26e9&#x2F;1555e9 = 17ms &#x2F; token small-batch latency (~60 tokens &#x2F; second)<p>What about large batches?<p>latency for some batch size B is 13e9 * 2 FLOP per param * B &#x2F; 312e12<p>We want B such that we&#x27;re just about no longer HBM bound: 26e9&#x2F;312e12 * B = 17ms<p>&lt;=&gt; 17e-3&#x2F;(26e9&#x2F;312e12)<p>giving a batch size of 204.<p>At that batch size (and all larger batch sizes), the a100 delivers a throughput of B * 1&#x2F;17ms = 12000 tokens &#x2F; second<p>---<p>KV caching, multi-gpu and -node comms and matmul efficiencies left as an exercise to the reader :)
vikp将近 2 年前
I clicked because I thought they were defining LLM developer as &quot;someone training LLMs&quot;, but instead they define it as &quot;someone integrating LLMs into their application&quot;.<p>If you also had the same initial thought as me, this is an excellent article - <a href="https:&#x2F;&#x2F;blog.eleuther.ai&#x2F;transformer-math&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;blog.eleuther.ai&#x2F;transformer-math&#x2F;</a> .
评论 #37107199 未加载
marban将近 2 年前
This will not age well.
regecks将近 2 年前
What’s this “neural information retrieval system” thing about?<p>I’m just hacking away and presenting the LLM with some JSON data from our metrics database and making it answer user questions as a completion.<p>Is this embedding thing relevant for what I’m doing? Where should I start reading?
fullstackchris将近 2 年前
I&#x27;m curious about the point on the embedding lookup cost... in my experience for an embedding lookup to be accurate, you have to include your entire document dataset to be queried against... obviously this can be just as expensive as querying a full cloud model if your dataset is very large. Interested if anyone had thoughts about this.
评论 #37107126 未加载
评论 #37107062 未加载
Roark66将近 2 年前
The main thing every LLM should know is that ARM will eat x86_64&#x27;s lunch in ML. Why? Because of the shared&#x2F;unified memory model. M2 Ultra from apple can use up to 192GB of RAM. Even your smartphone thanks to this model can run networks a lot bigger than you would expect.
评论 #37108863 未加载
Havoc将近 2 年前
Don&#x27;t think I&#x27;ve ever heard anyone call it &quot;GRAM&quot; instead of VRAM.<p>Another cost saving tip: On API, do combo calls where possible to dual use the input tokens. e.g.<p>&quot;&quot;&quot;You are an AI assistant that summarizes text given.<p>After the summarized text, add the word END.<p>After that answer the following questions with Yes or NO:<p>Is the text about Donald Trump?<p>Is the text about Space? &quot;&quot;&quot;<p>Down side is now you need code to parse the output pieces &amp; error handling around that
评论 #37108079 未加载