TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Open-source study to measure end user satisfaction levels with LLMs

12 点作者 sparacha9 个月前
The LLM challenge - an online study - aims to answer a simple question: what is the quality corridor that matters to end users when interacting with LLMs? At what point do users stop seeing a quality difference and at what point do users get frustrated by poor LLM quality.<p>The project is an Apache 2.0 licensed open source project available on Github: <a href="https:&#x2F;&#x2F;github.com&#x2F;open-llm-initiative&#x2F;llm-challenge">https:&#x2F;&#x2F;github.com&#x2F;open-llm-initiative&#x2F;llm-challenge</a>. And the challenge is hosted on AWS as a single-page web app, where users see greeting text, followed by a randomly selected prompt and a LLM response, which they must rate on a likert scale of 1-5 (or yes&#x2F;no rating) that matches the task represented in the prompt.<p>The study uses pre-generated prompts across popular real-world uses cases like information extraction and summarization, creative tasks like writing a blog post or story, problem solving task like getting central ideas from a passage or writing business emails or brainstorming ideas to solve a problem at work&#x2F;school. And to generate responses of varying quality the study uses the following OSS LLMs: Qwen 2-0.5B-Instruct, Qwen2-1.5B-Instruct, gemma-2-2B-it, Qwen2-7B-Instruct, Phi-3-small-128k-instruct, Qwen2-72B and Meta-Llama-3.1-70B. And for proprietary LLMs, we limited our choices to Claude 3 Haiku, Claude 3.5 Sonnet, OpenAI GPT 3.5-Turbo and OpenAI GPT4-o.<p>Today, LLM vendors are in a race with each other to one-up benchmarks like MMLU, MTBench, HellowSwag etc - designed and rated primarily by human experts. But as LLMs get deployed in the real-world for end users and productivity workers, there hasn&#x27;t been a study (as far as we know) that helps researches and developers understand the impact of model selection as perceived by end users. This study aims to get valuable insights to incorporate human-centric benchmarks in building generative AI applications and LLMs<p>If you want to contribute to the AI community in an open source way, we&#x27;d love if you can take the challenge. We&#x27;ll publish study results in 30 days on Github.

1 comment

sampreeth959 个月前
It is a great challenge to uncover a much-needed insight - human satisfaction level - rather than compare the numbers on leaderboards for different LLMs.
评论 #41371998 未加载