Benchmarks and comparison of LLM AI models and API hosting providers

152 pointsby Gcamover 1 year ago

Hi HN, ArtificialAnalysis.ai provides objective benchmarks and analysis of LLM AI models and API hosting providers so you can compare which to use in your next (or current) project.The site consolidates different quality benchmarks, pricing information and our own technical benchmarking data. Technical benchmarking (throughput, latency) is conducted through sending API requests every 3 hours.Check out the site at <a href="https://artificialanalysis.ai" rel="nofollow">https://artificialanalysis.ai</a>, and our twitter at <a href="https://twitter.com/ArtificialAnlys" rel="nofollow">https://twitter.com/ArtificialAnlys</a>Twitter thread with initial insights: <a href="https://twitter.com/ArtificialAnlys/status/1747264832439734353" rel="nofollow">https://twitter.com/ArtificialAnlys/status/17472648324397343...</a>All feedback is welcome and happy to discuss methodology, etc.

23 comments

chadashover 1 year ago

I love it. One minor change I'd make is changing the pricing chart to put lowest on the left. On the other highlights, left to right goes from best to worst, but this one is the opposite.I'm excited to see where things land. What I find interesting is that pricing is either wildly expensive or wildly cheap, depending on your use case. For example, if you want to run GPT-4 to glean insights on every webpage your users visit, a freemium business model is likely completely unviable. On the other hand, if I'm using an LLM to spot issues in a legal contract, I'd happily pay 10x what GPT4 currently charges for something marginally better (It doesn't make much difference if this task costs $4 vs $0.40). I think that the ultimate "winners" in this space will have a range of models at various price points and let you seamlessly shift between them depending on the task (e.g., in a single workflow, I might have some sub-tasks that need a cheap model and some that require an expensive one).

badFEengineerover 1 year ago

nice, I've been looking for something like this! A few notes / wishlist items:* Looks like for gpt-4 turbo (<a href="https://artificialanalysis.ai/models/gpt-4-turbo-1106-preview" rel="nofollow">https://artificialanalysis.ai/models/gpt-4-turbo-1106-previe...</a>), there was a huge latency spike on December 28, which is causing the avg. latency to be very high. Perhaps dropping top and bottom 10% of requests will help with avg (or switch over to median + include variance)* Adding latency variance would be truly awesome, I've run into issues with some LLM API providers where they've had incredibly high variance, but I haven't seen concrete data across providers

评论 #39019600 未加载

Gcamover 1 year ago

Hi HN, Thanks for checking this out! Goal with this project is to provide objective benchmarks and analysis of LLM AI models and API hosting providers to compare which to use in your next (or current) project. Benchmark comparisons include quality, price, technical performance (e.g. throughput, latency).Twitter thread with initial insights: <a href="https://twitter.com/ArtificialAnlys/status/1747264832439734353" rel="nofollow">https://twitter.com/ArtificialAnlys/status/17472648324397343...</a>All feedback is welcome

评论 #39018857 未加载

评论 #39023616 未加载

评论 #39019351 未加载

bearjawsover 1 year ago

I've been using Mixtral and Bard ever since the end of the year. I am pleased with their performance overall for a mixture of content generation and coding.It seems to me GPT4 has become short in its outputs, you have to do a lot more COT type prompting to get it to actually output a good result. Which is excruciating given how slow it is to produce content.Mixtral on together AI is crazy to see ~70-100token/s, and the quality works for my use case as well.

评论 #39020572 未加载

评论 #39020700 未加载

评论 #39059218 未加载

评论 #39018867 未加载

m3kw9over 1 year ago

I feel sorry for all other models when gpt4.5 comes out. If you are not at gpt4 level it’s pretty useless other than have some fun.

djshover 1 year ago

Since we are talking about throughput of API hosting providers, I wanted to add in the work we have done at Groq. I understand that the team is getting in touch with the ArtificialAnalysis folks to get benchmarked.Mixtral running at >500 tokens/s @ Groq <a href="https://www.youtube.com/watch?v=5fJyOVtOk4Y" rel="nofollow">https://www.youtube.com/watch?v=5fJyOVtOk4Y</a> Experience the speed for yourself, LLama2-70B, at <a href="https://chat.groq.com/" rel="nofollow">https://chat.groq.com/</a>

zurferover 1 year ago

This is great. Thank you! I would be especially interested in more details around speed. Average is a good starting point, but I would love to also see standard distribution or 90, 99 percentiles.In my experience speed varies a lot and it make it big difference if a requests takes 10 seconds or 50 seconds.

评论 #39017182 未加载

causalover 1 year ago

Thanks for putting this together! Amazon is far and away the priciest option here, but I wonder if a big part of that is the convenience tax for the Bedrock service. Would be interesting to compare that to the price of just renting AWS GPUs on EC2.

评论 #39015481 未加载

binsquareover 1 year ago

I'm surprised to see perplexity's 70B online model score so low on model quality and somehow far worse mixtral and gpt3.5(they use a fine tuned gpt3.5 as the foundational model AFAIK)I run <a href="https://www.labophase.com" rel="nofollow">https://www.labophase.com</a> and my data suggests that it's one of the top 3 models in terms of users liking to interact with it. May I know how model quality is benchmarked to understand this discrepancy?

评论 #39019741 未加载

idilivover 1 year ago

I'm curious how they evaluated model quality. The only information I could find is "Quality: Index based on several quality benchmarks".

评论 #39017792 未加载

vunderbaover 1 year ago

It's probably beyond the scope of this project, but it would be great to see comparisons across different quant levels (e.g. 4-bit, etc), since this can sometimes result in an extreme drop off in quality, but it's an important factor to consider when hosting your own LLM.

MacsHeadroomover 1 year ago

Perhaps price should be tokens per dollar, to keep the charts all "higher is better."

luke-stanleyover 1 year ago

This is awesome. I was looking at benchmarking speed and quality myself but didn't go this far! I wonder about Claude Instant and Phi 2? Modal.com for inference felt crazy fast, but I didn't note the metrics. Good ones to add? Replicate.com too maybe?

评论 #39017845 未加载

com2kidover 1 year ago

I wish more places showed Time To First Token. For scenarios real time human interaction, the important part is how long until the first token is returned, and are tokens generated faster than people consume them.Sadly very few benchmarks bother to track this.

评论 #39022695 未加载

sabareeshover 1 year ago

I want to see benchmarks for RAG. Most of the models are not very good with RAG

评论 #39020631 未加载

wonderfulyover 1 year ago

If you want to compare LLMs on daily usage, checkout: <a href="https://chathub.gg" rel="nofollow">https://chathub.gg</a>

throwawaymathsover 1 year ago

Latency (ttft) would be a nice metric.

评论 #39018267 未加载

elicksaurover 1 year ago

> Application error: a client-side exception has occurred (see the browser console for more information).iOS Safari

评论 #39015606 未加载

scribuover 1 year ago

I’m not sure about the Speed chart. I would expect gpt-4-turbo to be faster than plain gpt-4.

评论 #39022468 未加载

评论 #39018456 未加载

jdthediscipleover 1 year ago

Really neat!And I did not realize how much Gemini Pro lags behind GPT4 in terms of quality, wow!

评论 #39036666 未加载

avereveardover 1 year ago

I wish there was claude instant in there is a damn fine model often overlooked

评论 #39029011 未加载

评论 #39020651 未加载

rubymamisover 1 year ago

I wish there were more details about how you measure "quality".

评论 #39018488 未加载

jafitcover 1 year ago

Deepinfra Mixtral is $0.27 / M tokens as per their website

评论 #39022448 未加载