Hi HN,
ArtificialAnalysis.ai provides objective benchmarks and analysis of LLM AI models and API hosting providers so you can compare which to use in your next (or current) project.<p>The site consolidates different quality benchmarks, pricing information and our own technical benchmarking data. Technical benchmarking (throughput, latency) is conducted through sending API requests every 3 hours.<p>Check out the site at <a href="https://artificialanalysis.ai" rel="nofollow">https://artificialanalysis.ai</a>, and our twitter at <a href="https://twitter.com/ArtificialAnlys" rel="nofollow">https://twitter.com/ArtificialAnlys</a><p>Twitter thread with initial insights: <a href="https://twitter.com/ArtificialAnlys/status/1747264832439734353" rel="nofollow">https://twitter.com/ArtificialAnlys/status/17472648324397343...</a><p>All feedback is welcome and happy to discuss methodology, etc.
I love it. One minor change I'd make is changing the pricing chart to put lowest on the left. On the other highlights, left to right goes from best to worst, but this one is the opposite.<p>I'm excited to see where things land. What I find interesting is that pricing is either wildly expensive or wildly cheap, depending on your use case. For example, if you want to run GPT-4 to glean insights on every webpage your users visit, a freemium business model is likely completely unviable. On the other hand, if I'm using an LLM to spot issues in a legal contract, I'd happily pay 10x what GPT4 currently charges for something marginally better (It doesn't make much difference if this task costs $4 vs $0.40). I think that the ultimate "winners" in this space will have a range of models at various price points and let you seamlessly shift between them depending on the task (e.g., in a single workflow, I might have some sub-tasks that need a cheap model and some that require an expensive one).
nice, I've been looking for something like this! A few notes / wishlist items:<p>* Looks like for gpt-4 turbo (<a href="https://artificialanalysis.ai/models/gpt-4-turbo-1106-preview" rel="nofollow">https://artificialanalysis.ai/models/gpt-4-turbo-1106-previe...</a>), there was a huge latency spike on December 28, which is causing the avg. latency to be very high. Perhaps dropping top and bottom 10% of requests will help with avg (or switch over to median + include variance)<p>* Adding latency variance would be truly awesome, I've run into issues with some LLM API providers where they've had incredibly high variance, but I haven't seen concrete data across providers
Hi HN, Thanks for checking this out! Goal with this project is to provide objective benchmarks and analysis of LLM AI models and API hosting providers to compare which to use in your next (or current) project. Benchmark comparisons include quality, price, technical performance (e.g. throughput, latency).<p>Twitter thread with initial insights: <a href="https://twitter.com/ArtificialAnlys/status/1747264832439734353" rel="nofollow">https://twitter.com/ArtificialAnlys/status/17472648324397343...</a><p>All feedback is welcome
I've been using Mixtral and Bard ever since the end of the year. I am pleased with their performance overall for a mixture of content generation and coding.<p>It seems to me GPT4 has become short in its outputs, you have to do a lot more COT type prompting to get it to actually output a good result. Which is excruciating given how slow it is to produce content.<p>Mixtral on together AI is crazy to see ~70-100token/s, and the quality works for my use case as well.
Since we are talking about throughput of API hosting providers, I wanted to add in the work we have done at Groq. I understand that the team is getting in touch with the ArtificialAnalysis folks to get benchmarked.<p>Mixtral running at >500 tokens/s @ Groq <a href="https://www.youtube.com/watch?v=5fJyOVtOk4Y" rel="nofollow">https://www.youtube.com/watch?v=5fJyOVtOk4Y</a>
Experience the speed for yourself, LLama2-70B, at <a href="https://chat.groq.com/" rel="nofollow">https://chat.groq.com/</a>
This is great. Thank you!
I would be especially interested in more details around speed.
Average is a good starting point, but I would love to also see standard distribution or 90, 99 percentiles.<p>In my experience speed varies a lot and it make it big difference if a requests takes 10 seconds or 50 seconds.
Thanks for putting this together! Amazon is far and away the priciest option here, but I wonder if a big part of that is the convenience tax for the Bedrock service. Would be interesting to compare that to the price of just renting AWS GPUs on EC2.
I'm surprised to see perplexity's 70B online model score so low on model quality and somehow far worse mixtral and gpt3.5(they use a fine tuned gpt3.5 as the foundational model AFAIK)<p>I run <a href="https://www.labophase.com" rel="nofollow">https://www.labophase.com</a> and my data suggests that it's one of the top 3 models in terms of users liking to interact with it. May I know how model quality is benchmarked to understand this discrepancy?
I'm curious how they evaluated model quality. The only information I could find is "Quality: Index based on several quality benchmarks".
It's probably beyond the scope of this project, but it would be great to see comparisons across different quant levels (e.g. 4-bit, etc), since this can sometimes result in an extreme drop off in quality, but it's an important factor to consider when hosting your own LLM.
This is awesome. I was looking at benchmarking speed and quality myself but didn't go this far!
I wonder about Claude Instant and Phi 2?
Modal.com for inference felt crazy fast, but I didn't note the metrics.
Good ones to add?
Replicate.com too maybe?
I wish more places showed Time To First Token. For scenarios real time human interaction, the important part is how long until the first token is returned, and are tokens generated faster than people consume them.<p>Sadly very few benchmarks bother to track this.