TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Creating a LLM-as-a-Judge That Drives Business Results

82 点作者 thenameless77417 个月前

5 条评论

Lerc7 个月前
There are a few broad areas of risk in AI.<p>1. Enabling goes both ways, therefore bad actors can also be enabled by AI.<p>2. Accuracy of suggestions. Information provided by AI may be incorrect, be it code, how to brush one&#x27;s teeth, or height of Arnold Schwarzenegger. At worst AI can respond against the users interests if the creator of the AI has configured it to do so.<p>3. Accuracy of Determinations. LLM-as-a-Judge falls under this criteria. This is one of the areas where a single error can magnify the most.<p>This post says: <i>What about guardrails?</i><p><i>Guardrails are a separate but related topic. They are a way to prevent the LLM from saying&#x2F;doing something harmful or inappropriate. This blog post focuses on helping you create a judge that’s aligned with business goals, especially when starting out.</i><p>That seems woefully inadequate.<p>When using AI to make determinations there has to be guardrails. Having looked at drafts of legislation and position statements of governments, many are looking at legally requiring that any implementers of AI systems that make determinations <i>must</i> implement processes to deal with the situation where the AI makes an incorrect determination. To be effective this should be a process that can be initiated by individuals affected by this determination.
评论 #42012750 未加载
评论 #42001026 未加载
jerpint7 个月前
The biggest problem these days is that it’s very easy to hack together a solution for a problem that, at first glance, seems to work just fine. Understanding the limits of the system is the hard part, especially since LLMs can’t know when they don’t know
评论 #41998751 未加载
petesergeant7 个月前
I&#x27;m going through almost exactly this process at the moment, and this article is excellent. Aligns with my experience while adding a bunch of good ideas I hadn&#x27;t thought of &#x2F; discovered yet. A+, would read again.
firejake3086 个月前
&gt; The real value of this process is looking at your data and doing careful analysis. Even though an AI judge can be a helpful tool, going through this process is what drives results. I would go as far as saying that creating a LLM judge is a nice “hack” I use to trick people into carefully looking at their data!<p>Interesting conclusion. One of the reasons I like programming is that in order to automate a process using traditional software, you have to really understand the process and break it down into individual lines of code. I suppose the same is true for automating processes with LLMs; you still have to really understand the process and break it down into individual instructions for your prompt.
bzmrgonz7 个月前
This is a brilliant write up, very thick but very detailed, thank you for taking the time(assuming you didn&#x27;t employ AI.. LOL). So listen, assuming you are the author, there is an open source case management software called arkcase. I engaged them as a possible flagship platform at a lawfirm. Going thru their presentation, I noticed that the platorm is extremely customizable and flexible. So much so, that I think that in itself is the reason people don&#x27;t adopt it in droves. Essentially too permissive. However, I think it would be a great backend component to a &quot;rechat&quot; style LLM front end. Is there such a need? To have a backend data repository that interacts with a front-end LLM that employees interact with in pure prose and directives? What does the current backend look like for services such as rechat and other chat based LLM agents? I bring this up, because arkcase is so flexible that i can work in broad industries and needs, from managing a highschool athletic department(dosier and bio on each staff and players) to the entire US OFFICE OF PERSONNEL(ALFRESCO AND ARKCASE for security clearance investigation). The idea would be that by introducing an agent LLM as front end, the learning curve could be flatten and the extrem flexibility can be abstracted.