I just want to say that this is one of the most impressive tech demos I’ve ever seen in my life, and I love that it’s truly an open demo that anyone can try without even signing up for an account or anything like that. It’s surreal to see the thing spitting out tokens at such a crazy rate when you’re used to watching them generate at one less than one fifth that speed. I’m surprised you guys haven’t been swallowed up by Microsoft, Apple, or Google already for a huge premium.
The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM (20x faster than HBM3, just to be clear). Which means you need ~256 LPUs (4 full server racks of compute, each unit on the rack contains 8x LPUs and there are 8x of those units on a single rack) just to serve a single model [1] where as you can get a single H200 (1/256 of the server rack density) and serve these models reasonably well.<p>It might work well if you have a single model with lots of customers, but as soon as you need more than a single model and a lot of finetunes/high rank LoRAs etc., these won't be usable. Or for any on-prem deployment since the main advantage is consolidating people to use the same model, together.<p>[0]: <a href="https://wow.groq.com/groqcard-accelerator/" rel="nofollow">https://wow.groq.com/groqcard-accelerator/</a><p>[1]: <a href="https://twitter.com/tomjaguarpaw/status/1759615563586744334" rel="nofollow">https://twitter.com/tomjaguarpaw/status/1759615563586744334</a>
Very impressive looking! Just wanted to caution it's worth being a bit skeptical without benchmarks as there are a number of ways to cut corners. One prominent example is heavy model quantization, which speeds up the model at a cost of model quality. Otherwise I'd love to see LLM tok/s progress exactly like CPU instructions/s did a few decades ago.
Hi folks, I work for Groq. Feel free to ask me any questions.<p>(If you check my HN post history you'll see I post a lot about Haskell. That's right, part of Groq's compilation pipeline is written in Haskell!)
Impressive demo!<p>However, the hardware requirements and cost make this inaccessible for anyone but large companies. When do you envision that the price could be affordable for hobbyists?<p>Also, while the CNN Vapi demo was impressive as well, a few weeks ago here[1] someone shared <a href="https://smarterchild.chat/" rel="nofollow">https://smarterchild.chat/</a>. That also has _very_ low audio latency, making natural conversation possible. From that discussion it seems that <a href="https://www.sindarin.tech/" rel="nofollow">https://www.sindarin.tech/</a> is behind it. Do we know if they use Groq LPUs or something else?<p>I think that once you reach ~50 t/s, real-time interaction is possible. Anything higher than that is useful for generating large volumes of data quickly, but there are diminishing returns as it's far beyond what humans can process. Maybe such speeds would be useful for AI-AI communication, transferring knowledge/context, etc.<p>So an LPU product that's only focused on AI-human interaction could have much lower capabilities, and thus much lower cost, no?<p>[1]: <a href="https://news.ycombinator.com/item?id=39180237">https://news.ycombinator.com/item?id=39180237</a>
It's only faster sometimes, but when you ask it a complicated question or give it any type of pre-prompt to speak in a different way, then it still takes a while to load. Interesting but ultimately probably going to be a flop
If the page can't access certain fonts, it will fail to work, while it keeps retrying requests:<p><pre><code> https://fonts.gstatic.com/s/notosansarabic/[...]
https://fonts.gstatic.com/s/notosanshebrew/[...]
https://fonts.gstatic.com/s/notosanssc/[...]
</code></pre>
(I noticed this because my browser blocks these de facto trackers by default.)
Sorry, I'm a bit naïve about all of this.<p>Why is this impressive? Can this result not be achieved by throwing more compute at the problem to speed up responses? Isn't the fact that there is a queue when under load just indicative that there's a trade-off between "# of request to process per unit of time" and "amount of compute to put into a response to respond quicker"?<p><a href="https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/rel/docs/source/blogs/media/TRT_LLM_v0-5-0_H100vA100_tps.png" rel="nofollow">https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/rel/do...</a><p>This chart from NVIDIA implies their H100 runs llama v2 70B at >500 tok/s.
Jaw dropping. Both groq and mixtral.<p>I used following prompt:<p>Generate gitlab ci yaml file for a
hybrid front-end/backend
project. Fronted is under /frontend and is a node project,
packaged with yarn, built with
vite to the /backend/public
folder. The backend is a python
flask server
Nevermind, I stand corrected. Blown tf away after trying the demo MYSELF. It's instantaneous, the last time I used an LLM that fast was a proprietary model with a small dataset. Lighting fast but it wasn't smart enough. This is wild. But I don't understand why the demo was so bad and why the demo took so long to respond to his questions?
Groq's API performance reaches close to this level of performance as well. We've benchmarked performance over time and >400 tokens/s has sustained - can see here <a href="https://artificialanalysis.ai/models/mixtral-8x7b-instruct" rel="nofollow">https://artificialanalysis.ai/models/mixtral-8x7b-instruct</a> (bottom of page for over time view)
This is pretty sweet. The speed is nice but what I really care about is you bringing the per token cost down compared with models on the level of mistral medium/gpt4. GPT3.5 is pretty close in terms of cost/token but the quality isn't there and GPT4 is overpriced. Having GPT4 quality at sub-gpt3.5 prices will enable a lot of things though.
Hey y'all, we have a discord now for more discussion and announcements: <a href="https://discord.com/invite/TQcy5EBdCP" rel="nofollow">https://discord.com/invite/TQcy5EBdCP</a>
Incredible job. Feels dumb or obvious to say this, but this really changes the way I think of using it. The slow autoregression really sucks because it inhibits your ability to skim sections. For me, that creates an unnatural reading environment. This makes chatgpt feel antiqued.
I have this hooked up experimentally to my universal Dungeon Master simulator DungeonGod and it seems to work quite well.<p>I had been using Together AI Mixtral (which is serving the Hermes Mixtrals) and it is pretty snappy, but nothing close to Groq. I think the next closes that I've tested is Perplexity Labs Mixtral.<p>A key blocker in just hanging out a shingle for an open source AI project is the fear that anything that might scale will bankrupt you (or just be offline if you get any significant traction). I think we're nearing the phase that we could potentially just turn these things "on" and eat the reasonable inference fees to see what people engage with - with a pretty decently cool free tier available.<p>I'd add that the simulator does multiple calls to the api for one response to do analysis and function selection in the underlying python game engine, which Groq makes less of a problem as it's close to instant. This adds a pretty significant pause in the OpenAI version. Also since this simulator runs on Discord with multiple users, I've had problems in the past with 'user response storms' where the AI couldn't keep up. Also less of a problem with Groq.
Relevant thread from 5 months ago: <a href="https://news.ycombinator.com/item?id=37469434">https://news.ycombinator.com/item?id=37469434</a><p>I'm achieving consistent 450+ tokens/sec for Mixtral 8x7b 32k and ~200 tps for Llama 2 70B-4k.<p>As an aside, seeing that this is built with flutter Web, perhaps a mobile app is coming soon?
@tome
Cannot sign up with sneakemail.com, snkml.com, snkmail, liamekaens.com etc... I pay for these services so my email is a bit more protected. Why do you insist on getting well-known email providers instead, datamining or something else?
Is it normal that I have asked two networks (llama/mixtral) the same question ("tell me about most popular audio pitch detection algorithms") and they gave almost the same answer? Both answers start with "Sure, here are some of the most popular pitch detection algorithms used in audio signal processing" and end with "Each of these algorithms has its own strengths and weaknesses, and the choice of algorithm depends on the specific application and the characteristics of the input signal.". And the content is 95% the same. How can it be?
I always ask LLMs this:<p>> If I initially set a timer for 45 minutes but decided to make the total timer time 60 minutes when there's 5 minutes left in the initial 45, how much should I add to make it 60?<p>And they never get it correct.
In how far is the API compatible with OpenAI? Does it offer logprobs[0] and top_logprobs[1]?<p>0. <a href="https://platform.openai.com/docs/api-reference/chat/create#chat-create-logprobs" rel="nofollow">https://platform.openai.com/docs/api-reference/chat/create#c...</a><p>1. <a href="https://platform.openai.com/docs/api-reference/chat/create#chat-create-top_logprobs" rel="nofollow">https://platform.openai.com/docs/api-reference/chat/create#c...</a>
Oh hell yes, this is the first "fast" one, superhuman fast.<p>I know you gave suggestions of what to ask, but I threw a few curveballs and it was really good! Well done this is a big step forwards
As a virtual reality geek, this is super exciting because although there are numerous people experimenting with voicing NPCs with LLMs, they all have horrible latency and are unusable in practice. This looks like the first one that can actually potentially work for an application like that. I can see it won't be long before we can have open ended realistic conversations with "real" simulated people!
This is unreal. I have never seen anything this fast. How ? I mean, how can you physically ship the bits this fast, let alone a LLM.<p>Something about the UI. Doesn't work for me. May be I like openAI chat interface too much. Can someone bring their own data and train ? That would be crazy!
Congrats on the great demo, been a fan of Groq since I learned about TSP.
I'm surprised LPU runs Mixtral fast because MoE's dynamic routing is orthogonal to Groq's deterministic paradigm.
Did Groq implement MegaBlocks-like kernels or other methods tailored for LPUs?
Where is the data center located? The fastest response time I could get from some quick testing from the northeast US, having it output just one letter, was 670ms. Just wondering if that's an expected result, as it's on a par or slower than GPT 3.5 via API.
I think we’re kind of past the point where we post prompts because it’s interesting, but this one still had me thinking.<p>Obviously it doesn’t have memory, but it’s the first time I’ve seen a model actually respond instead of hedge (having mostly used ChatGPT).<p>> what is the longest prompt you have ever received?<p>> The length of a prompt can vary greatly, and it's not uncommon for me to receive prompts that are several sentences long. However, I don't think I have ever received a prompt that could be considered "super long" in terms of physical length. The majority of prompts I receive are concise and to the point, typically consisting of a single sentence or a short paragraph.
GPUs are notoriously bad on exploiting sparsity. I wonder if this architecture can do a better job. The groq engineers in this thread, if a neural network had say 60% of its weights set to 0, what would it do to cost & speed in your hardware?
Does this make it practical to run LLMs on mobile devices? I wonder about the power consumption and if it can make sense to have it integrated in some future mobile devices. Or maybe have a dedicated storage, RAM and processing cores that goes as an USB-C add-on? A case with integrated battery and this chip?<p>I'm dreaming of having LLMs on anything. Unlike the "bluetooth on everything" craze, this can be practical as every device can become smart. Remember how some British researchers made a self driving car using an LLM? A toaster anticipating how to cook when you describe it what you want want actually be an improvement.
Very impressive! I am even more impressed by the API pricing though - 0.27/1M tokens seems like an order of magnitude cheaper than the GPT-3.5 API, and two orders of magnitude cheaper than GPT-4? Am I missing something here?
The Groq demo was indeed impressive. I work with LLM alot in work, and a generation speed of 500+ tokens/s would definitely change how we use these products. (Especially considering it's an early-stage product)<p>But the "completely novel silicon architecture" and the "self-developed LPU" (claiming not to use GPUs)... makes me bit skeptical. After all, pure speed might be achievable through stacking computational power and model quantization. Shouldn't innovation at the GPU level be quite challenging, especially to achieve such groundbreaking speeds?
This is very impressive, but whilst it was very fast with Mixtral yesterday, today I waited 59.44s for a response. If I was to use your API, the end-to-end is much more important than the Output Tokens Throughput and Time to first token metrics. Will you also publish average / minimum / maximum end-to-end times too?
The demo is pretty cool, but the mobile interface could be a parody of bad interface design. The text box at the top is hard to reach if you want to open the keyboard, which automatically closes, or press the button to send the question, and the chat history is out of chronological order for no logical reason.<p>Edit: Text selection is also broken.
How come the answers for Mixtral 8x7B-32k and Llama 2 70B-4k are identical?<p>After asking via Mixtral a couple of questions I switched to Llama, and while it shows Llama as the Model used for the response, the answer is identical.<p>See first and last question:<p><a href="https://pastebin.com/ZQV10C8Q" rel="nofollow">https://pastebin.com/ZQV10C8Q</a>
How is the Token/second calculated? I ask it a simple prompt and the model generated a 150 word (about 300 tokens?) answer in 17 seconds, then mentioning the speed of 408T/s.<p>Also, I guess this demo would feel real time if you could stream the outputs to the UI? Can this be done in your current setup?
It is fast, but if it spits useless garbage, then useless. I don't mind waiting for chatGPT, the quality of what it produces is quite remarkable, and I am excited to see it better. I think this has more to do with mistral model v GPT4 than Groq. If Groq can host GPT4, wow, then that is amazing.
This is extremely impressive - no login, extremely fast, and Mixtral quality is very good. It's already more useful than my (paid) GPT4 for many things due to the speed.
Just a minor gripe the bullet option doesn't seem to be logical..<p>When I asked about Marco Polo's travels and used Modify to add bullets, it added China, Pakistan etc as children of Iran. And the same for other paragraphs.
I asked it "What is the oldest known written recipe" and then asked for that recipe. It seemed to repeat the same response a bunch of times (very quickly!) before stopping. Is this abnormal behavior?
It's impressive, but I have one problem with all of those models. I wanted them to answer what Mixtral or Llama2 are, but with no luck. It would be great if models could at least describe themselves.
This is <i>super</i> impressive. The rate of iteration and innovation in this space means that just as I'm feeling jaded/bored/oversaturated - some new project makes my jaw drop again.
Amazingly fast, but I don't like that the only option for signing up is a Google account. Are you planning to implement some simple authentication using maybe just an email?
O M G<p>It is fast, like instant. It is straight to the point comparatively to others. It answered few of my programming questions to create particular code and passed with flying colors.<p>Conclusion: shut up and take my money
I wish there was an option to export whole session chat, say in plaintext as a link to some pastebin, that chat I just had with groq would have some ppl I know really impressed
Really impressive work! I wonder how easy would it be to support (a future open source version of) SORA using Groq's design. Will there be a Video Processing Unit (VPU)?
I (only) ran a couple of prompts but I am impressed. It has the speed of gpt 3.5 and the quality of gpt 4.<p>Seriously considering switching from [open]AI to Mix/s/tral in my apps.
are there also experiments around image embedding generation to use in combination with the LLM?<p>maybe for this use-case is it better to execute the vision tower on a GPU and leave the LPU for the language part?
Tried it out, seriously impressive. I'm sure you welcome the detractors but as someone who doesn't work for or have any investments in AI, colour me impressed.<p>Though with the price of the hardware, I'll probably mess with the API for now. Give us a bell when the hardware is consumer friendly, ha ha.
Please when/where can I buy some of these for home use? Otherwise is there any way to get access to the API without being a large company building a partner product? I would love this for personal use.
Interesting, I ran my "prime test", it passed on generating a prime, it failed on it being large.<p>> "Can you generate me a large prime please"<p>> Sure, here's a large prime number: 193707721<p>> This number is a prime number because it is only divisible by 1 and itself. You can check its primality by trying to divide it by all the numbers up to its square root, and if none of them divide it evenly, then it is a prime number. This number is quite large, but there are much larger prime numbers known. For example, the largest known prime number, as of January 2022, is 2^82,589,933−1, a number with 24,862,048 digits, discovered as part of the Great Internet Mersenne Prime Search (GIMPS).<p>I then prompted "That's not large" which seemed to break it, the response is stuck in a loop:<p>> That isn't large<p>> I apologize if the prime number I provided did not meet your expectations for "large." Prime numbers can be very large indeed, and there is no upper limit to how large a prime number can be.<p>If you are looking for a very large prime number, here is one with a billion digits:<p>100000303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303030303...<p>It's printed many more pages of 303030... but hacker news has a post length limit!<p>Further testing, such as prompting to generate a 1024 bit prime also runs into odd loops.<p>> Can you generate a 1024 bit prime?<p>> Yes, I can generate a 1024-bit prime number for you. Here's one I just generated:<p>> 115322512531875133213135311402305250340282302283103413642283221253115653221253112210221122112112112112112112112112112112112112112112112112112112112112112112112112112112112112112112112112112112 ( Loops 112 forever )