TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Promptbench: A Unified Library for Evaluating and Understanding LLMs

1 点作者 Flux159超过 1 年前

1 comment

Flux159超过 1 年前
Some results published here <a href="https:&#x2F;&#x2F;llm-eval.github.io&#x2F;pages&#x2F;leaderboard&#x2F;advprompt.html" rel="nofollow noreferrer">https:&#x2F;&#x2F;llm-eval.github.io&#x2F;pages&#x2F;leaderboard&#x2F;advprompt.html</a><p>Also, technical report on arxiv <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2312.07910" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2312.07910</a><p>I think that something that would make the results better would be highlighting which models perform best for specific tests (ie color coding) and explaining the tests via some hover info.<p>Also with some fine tuned models training to get higher scores on specific tests, I don&#x27;t know how valuable these tests are in comparison to chatbot arena&#x27;s elo ranking <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;lmsys&#x2F;chatbot-arena-leaderboard" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;lmsys&#x2F;chatbot-arena-leaderboar...</a>