TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Llama Is Expensive

14 点作者 razcle将近 2 年前

3 条评论

rvz将近 2 年前
&gt; As a massive disclaimer, a reason to use LLama over gpt-3.5 is finetuning. In this post, we only explore cost and latency. I don&#x27;t compare LLama-2 to GPT-4, as it is closer to a 3.5-level model. Given the discourse on twitter, it seems Llama-2 still trails behind gpt-3.5-turbo. Benchmark performance also supports this claim:<p>Well one other massive disclaimer is that the author is <i>&quot;Backed by OpenAI&quot;</i>&#x27;s Startup Fund which they failed to disclose in the post.<p>So of course they would speculate that. This post is essentially a paid marketing piece by OpenAI, who is the lead investor in Anysphere (creators of Cursor)
评论 #36807174 未加载
brucethemoose2将近 2 年前
&gt; We serve Llama on 2 80-GB A100 GPUs, as that is the minumum required to fit Llama in memory (with 16-bit precision)<p>Well there is your problem.<p>LLaMA quantized to 4 bits fits in 40GB. And it gets similar throughput split between dual consumer GPUs, which likely means much better throughput on a single 40GB A100 (or a cheaper 48GB Pro GPU)<p><a href="https:&#x2F;&#x2F;github.com&#x2F;turboderp&#x2F;exllama#dual-gpu-results">https:&#x2F;&#x2F;github.com&#x2F;turboderp&#x2F;exllama#dual-gpu-results</a><p>And this is without any consideration of batching (which I am not familiar with TBH).<p>Also, I&#x27;m not sure which model was tested, but Llama 70B chat should have better performance than the base model if the prompting syntax is right. That was only reverse engineered from the Meta demo implementation recently.<p>There are other &quot;perks&quot; from llama too, like manually adjusting various generation parameters, custom grammar during generation and extended context.
评论 #36807628 未加载
joefourier将近 2 年前
You don’t have to run Llama 70B on a rented 2xA100 80GB which is of course going to be quite pricy. Quantising it to 4-bit as brucethemoose2 mentioned allows you to run it on far cheaper hardware - it’ll fit on a single A6000 which can be rented for as low as $0.44&#x2F;h, 10x cheaper than the $4.42&#x2F;h they mentioned for their 2x A100 80GB (speed might be impacted but it shouldn’t be 10x slower).<p>And if you’re running it on your own machine, then the cost of using Llama is just your electricity bill - you can theoretically run it on 2x 3090 which are now quite cheap to buy, or on a CPU with enough RAM (but it will be very very slow).
评论 #36809094 未加载