TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: A registry of agent benchmarks (including many OSS agent trajectories)

6 pointsby lbeurerkellner5 months ago
If you&#x27;re interested in exploring what LLM-based agent systems these days actually do to solve certain benchmarks such as SWEBench or WebArena, we created a small leaderboard with our team, that allows to view a lot of public and OSS agent results including all the runtime traces (the step-by-step reasoning behind the scenes).<p>Looking at traces is actually quite interesting, as they reveal a lot about the inner working and shortcomings of current agent system, e.g. see <a href="https:&#x2F;&#x2F;explorer.invariantlabs.ai&#x2F;u&#x2F;invariant&#x2F;webarena--SteP&#x2F;t&#x2F;4" rel="nofollow">https:&#x2F;&#x2F;explorer.invariantlabs.ai&#x2F;u&#x2F;invariant&#x2F;webarena--SteP...</a> for an example trace.

1 comment

lbeurerkellner5 months ago
Let us know if you can think of any benchmark, that you&#x27;d like to see added.