TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s

147 点作者 campers7 个月前

13 条评论

simonw7 个月前
It turns out someone has written a plugin for my LLM CLI tool already: <a href="https:&#x2F;&#x2F;github.com&#x2F;irthomasthomas&#x2F;llm-cerebras">https:&#x2F;&#x2F;github.com&#x2F;irthomasthomas&#x2F;llm-cerebras</a><p>You need an API key - I got one from <a href="https:&#x2F;&#x2F;cloud.cerebras.ai&#x2F;" rel="nofollow">https:&#x2F;&#x2F;cloud.cerebras.ai&#x2F;</a> but I&#x27;m not sure if there&#x27;s a waiting list at the moment - then you can do this:<p><pre><code> pipx install llm # or brew install llm or uv tool install llm llm install llm-cerebras llm keys set cerebras # paste key here </code></pre> Then you can run lightning fast prompts like this:<p><pre><code> llm -m cerebras-llama3.1-70b &#x27;an epic tail of a walrus pirate&#x27; </code></pre> Here&#x27;s a video of that running, it&#x27;s very speedy: <a href="https:&#x2F;&#x2F;static.simonwillison.net&#x2F;static&#x2F;2024&#x2F;cerebras-is-fast.mp4" rel="nofollow">https:&#x2F;&#x2F;static.simonwillison.net&#x2F;static&#x2F;2024&#x2F;cerebras-is-fas...</a>
评论 #41943149 未加载
评论 #41942605 未加载
obviyus7 个月前
Wonder if they&#x27;ll eventually release Whisper support. Groq has been great for transcribing 1hr+ calls at a significnatly lower price compared to OpenAI ($0.36&#x2F;hr vs. $0.04&#x2F;hr).
评论 #41943124 未加载
评论 #41943268 未加载
maz1b7 个月前
Cerebras really has impressed me with their technicality and their approach in the modern LLM era. I hope they do well, as I&#x27;ve heard they are en-route to IPO. It will be interesting to see if they can make a dent vs NVIDIA and other players in this space.
评论 #41943174 未加载
GavCo7 个月前
When Meta releases the quantized 70B it will give another &gt; 2X speedup with similar accuracy: <a href="https:&#x2F;&#x2F;ai.meta.com&#x2F;blog&#x2F;meta-llama-quantized-lightweight-models&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ai.meta.com&#x2F;blog&#x2F;meta-llama-quantized-lightweight-mo...</a>
评论 #41942745 未加载
评论 #41942773 未加载
asabla7 个月前
Damn, that&#x27;s some impressive speeds.<p>At that rate it doesn&#x27;t matter if the first try resulted in an unwanted answer, you&#x27;ll be able to run once or twice more in a fast succession.<p>I hope their hardware stays relevant as this field continues to evolve
评论 #41942503 未加载
d4rkp4ttern7 个月前
For those looking to easily build on top of this or other OpenAI-compatible LLM APIs -- you can have a look at Langroid[1] (I am the lead dev): you can easily switch to cerebras (or groq, or other LLMs&#x2F;Providers). E.g. after installing langroid in your virtual env, and setting up CEREBRAS_API_KEY in your env or .env file, you can run a simple chat example[2] like this:<p><pre><code> python3 examples&#x2F;basic&#x2F;chat.py -m cerebras&#x2F;llama3.1-70b </code></pre> Specifying the model and setting up basic chat is simple (and there are numerous other examples in the examples folder in the repo):<p><pre><code> import langroid.language_models as lm import langroid as lr llm_config = lm.OpenAIGPTConfig(chat_model= &quot;cerebras&#x2F;llama3.1-70b&quot;) agent = lr.ChatAgent( lr.ChatAgentConfig(llm=llm_config, system_message=&quot;Be helpful but concise&quot;)) ) task = lr.Task(agent) task.run() </code></pre> [1] <a href="https:&#x2F;&#x2F;github.com&#x2F;langroid&#x2F;langroid">https:&#x2F;&#x2F;github.com&#x2F;langroid&#x2F;langroid</a> [2] <a href="https:&#x2F;&#x2F;github.com&#x2F;langroid&#x2F;langroid&#x2F;blob&#x2F;main&#x2F;examples&#x2F;basic&#x2F;chat.py">https:&#x2F;&#x2F;github.com&#x2F;langroid&#x2F;langroid&#x2F;blob&#x2F;main&#x2F;examples&#x2F;basi...</a> [3] Guide to using Langroid with non-OpenAI LLM APIs <a href="https:&#x2F;&#x2F;langroid.github.io&#x2F;langroid&#x2F;tutorials&#x2F;local-llm-setup&#x2F;" rel="nofollow">https:&#x2F;&#x2F;langroid.github.io&#x2F;langroid&#x2F;tutorials&#x2F;local-llm-setu...</a>
fancyfredbot7 个月前
Wow, software is hard! Imagine an entire company working to build an insanely huge and expensive wafer scale chip and your super smart and highly motivated machine learning engineers get 1&#x2F;3 of peak performance on their first attempt. When people say NVIDIA has no moat I&#x27;m going to remember this - partly because it does show that they do, and partly because it shows that with time the moat can probably be crossed...
评论 #41947401 未加载
a21287 个月前
I wonder at what point does increasing LLM throughput only start to serve negative uses of AI. This is already 2 orders of magnitude faster than humans can read. Are there any significant legitimate uses beyond just spamming AI-generated SEO articles and fake Amazon books more quickly and cheaply?
评论 #41945793 未加载
评论 #41943572 未加载
odo12427 个月前
What made it so much faster based on just a software update?
评论 #41942673 未加载
评论 #41942822 未加载
评论 #41942732 未加载
majke7 个月前
I wonder if there is a token&#x2F;watt metric. Afaiu cerebras uses plenty of power&#x2F;cooling.
评论 #41943304 未加载
neals7 个月前
So what is inference?
评论 #41943366 未加载
anonzzzies7 个月前
Demo, API?
评论 #41942516 未加载
andrewstuart7 个月前
Could someone please bring Microsoft&#x27;s Bitnet into the discussion and explain how its performance relates to this announcement, if at all?<p><a href="https:&#x2F;&#x2F;github.com&#x2F;microsoft&#x2F;BitNet">https:&#x2F;&#x2F;github.com&#x2F;microsoft&#x2F;BitNet</a><p>&quot;bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. &quot;
评论 #41942498 未加载
评论 #41942496 未加载