I created this dense visual comparison to better understand and contextualize the precise relationships between capability, cost, and speed for text LLMs widely available via cloud providers today.<p>All values are sourced externally from publicly available data.<p>This sheet is only as good as the data I've found for it. Some values change over time (eg 0-100 normalized index), while others have contradictory sources. For example, OpenAI's self-reported metrics for GPT-4-turbo are quite close but not identical between their simple-evals repo[1] and the charts in the GPT-4o announcement[2]. For others, strong benchmark scores are prominent on marketing pages while weaker scores require some digging.<p>As a general rule of thumb, I've tried to:
a) Include every metric I can find to help mitigate cherry-pick bias.
b) Resolve conflicts by selecting what I consider to be either the more current or more trustworthy source. For what it's worth, I haven't come across any evaluation discrepancies with a meaningful margin of difference.<p>The folks I've shared this with so far have found it useful - I hope you do as well!<p>[1] <a href="https://github.com/openai/simple-evals">https://github.com/openai/simple-evals</a>
[2] <a href="https://openai.com/index/hello-gpt-4o/" rel="nofollow">https://openai.com/index/hello-gpt-4o/</a>