科技回声

Leaderboards are getting harder and harder as a decision tool. What does it mean to be better 0.7% or 1.6%. How does that help me? Is higher always better? What are the trade offs? Evals continue be the hardest most important parts of LLMs and tools that use them

Hey all! Wanted to share this leaderboard we put together to centralize some of the different models available in the AI browser agent space.<p>Since working on Steel, we've seen a ton of people have a hard time putting the browser agent space and how it's progressing into perspective and it felt odd to us that there were no centralized leaderboards like there were for so many other agentic use cases.<p>So we launched this leaderboard to help! It's open-source and we're open to any contributions we may be missing. We're committed to keeping this up to date as the space progresses (which it seems to be doing quite quickly).<p>Let us know if you have any feedback/thoughts :)

Show HN: AI Browser Agent Leaderboard

2 条评论

Show HN: AI Browser Agent Leaderboard

2 条评论