TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Re-Evaluating GPT-4's Bar Exam Performance

122 点作者 rogerkeays12 个月前

11 条评论

fnordpiglet12 个月前
Scoring 96 percentile among humans taking the exam without moving goal posts would have been science fiction two years ago. Now it’s suddenly not good enough and the fact a computer program can score decent among passing lawyers and first time test takers is something to sneer at.<p>The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking. Anyone who views it as anything less in 2024 and asserts with a straight face they wouldn’t have said the same thing in 2020 is lying.<p>I do however find the paper really useful in contextualizing the scoring with a much finer grain. Personally I didn’t take the 96 percentile score to be anything other than “among the mass who take the test,” and have enough experience with professional licensing exams to know a huge percentage of test takers fail and are repeat test takers. Placing the goal posts quantitatively for the next levels of achievement is a useful exercise. But the profusion of jaded nerds makes me sad.
评论 #40557396 未加载
评论 #40558060 未加载
评论 #40557866 未加载
评论 #40557848 未加载
评论 #40557376 未加载
评论 #40557358 未加载
评论 #40558016 未加载
评论 #40571171 未加载
评论 #40557864 未加载
评论 #40561973 未加载
评论 #40557485 未加载
thehoneybadger12 个月前
It is difficult to comment without sounding obnoxious, but having taken the bar exam, I found the exam simple. Surprisingly simple. I think it was the single most over hyped experience of my life. I was fed all this insecurity and walked into the convention center expecting to participate in the biggest intellectual challenge in my life. Instead, it was endless multiple choice questions and a couple contrived scenarios for essays.<p>It may also be surprising to some to understand that legal writing is prized for its degree of formalism. It aims to remove all connotation from a message so as to minimize misunderstanding, much like clean code.<p>It may also be surprising, but the goal when writing a legal brief or judicial opinion is not to try to sound smart. The goal is to be clear, objective, and thereby, persuasive. Using big words for the sake of using big words, using rare words, using weasel words like &quot;kind of&quot; or &quot;most of the time&quot; or &quot;many people are saying&quot;, writing poetically, being overly obtuse and abstract, these are things that get your law school application rejected, your brief ridiculed, and your bar exam failed.<p>The simpler your communication, the more formulaic, the better. The more your argument is structured, akin to a computer program, the better.<p>As compared to some other domain, such as fiction, good legal writing much easier for an attention model to simulate. The best exam answers are the ones that are the most formulaic and that use the smallest lexicon and that use words correctly.<p>I only want to add this comment because I want to inform how non-lawyers perceive the bar exam. Getting an attention model to pass the bar exam is a low bar. It is not some great technical feat. A programmer can practically write a semantic disambiguation algorithm for legal writing from scratch with moderate effort.<p>It will be a good accomplishment, but it will only be a stepping stone. I am still waiting for AI to tackle messages that have greater nuance and that are truly free form. LLMs are still not there yet.
评论 #40557281 未加载
评论 #40557112 未加载
评论 #40557758 未加载
radford-neal12 个月前
A basic problem with evaluations like these is that the test is designed to discriminate between <i>humans</i> who would make good lawyers and humans who would not make good lawyers. The test is not necessarily any good at telling whether a non-human would make a good lawyer, since it will not test anything that pretty much all humans know, but non-humans may not.<p>For example, I doubt that it asks whether, for a person of average wealth and income, a $1000 fine is a more or less severe punishment than a month in jail.
评论 #40556737 未加载
评论 #40551354 未加载
elicksaur12 个月前
&gt; Furthermore, unlike its documentation for the other exams it tested (OpenAI 2023b, p. 25), OpenAI’s technical report provides no direct citation for how the UBE percentile was computed, creating further uncertainty over both the original source and validity of the 90th percentile claim.<p>This is the part that bothered me (licensed attorney) from the start. If it scores this high, where are the receipts? I’m sure OpenAI has the social capital to coordinate with the National Conference of Bar Examiners to have a GPT “sit” for a simulated bar exam.
评论 #40565699 未加载
dogmayor12 个月前
The bigger issue here is that actual legal practice looks nothing like the bar, so whether or not an llm passes says nothing about how llms will impact the legal field.<p>Passing the bar should not be understood to mean &quot;can successfully perform legal tasks.&quot;
评论 #40556988 未加载
评论 #40556735 未加载
评论 #40557175 未加载
Bromeo12 个月前
Very interesting. The abstract claims that although GPT-4 was claimed to score in the 92nd percentile on the bar exam, when correcting for a bunch of things they find that these results are overinflated, and that it only scores in the 15th percentile specifically on essays when compared to only people that passed the bar.<p>That still does put it into bar-passing territory, though, since it still scores better than about one sixth of the people that passed the exam.
评论 #40545278 未加载
gnicholas12 个月前
This analysis touches on the difference between first-time takers and repeat takers. I recall when I took the bar in 2007, there was a guy blogging about the experience. He went to a so-so school and failed the bar. My friends and I, who had been following his blog, checked in occasionally to see if he ever passed. After something like a dozen attempts, he did. Every one of us who passed was counted in the pass statistics once. He was counted a dozen times. This dramatically skews the statistics, and if you want to look at who becomes a lawyer (especially one at a big firm or company), you really need to limit yourself to those who pass on the first (or maybe second) try.
jeffbee12 个月前
It appears that researchers and commentators are totally missing the application of LLMs to law, and to other areas of professional practice. A generic trained-on-Quora LLM is going to be straight garbage for any specialization, but one that is trained on the contents of the law library will be utterly brilliant for assisting a practicing attorney. People pay serious money for legal indexes, cross-references, and research. An LLM is nothing but a machine-discovered compressed index of text. As an augmentation to existing law research practices, the right LLM will be extremely valuable.
评论 #40557217 未加载
Digory12 个月前
They originally scored against a test usually taken by people who failed the bar.<p>So, GPT-4 scores closer to the bottom of people who pass the bar the first time. In other words, it matches the people who cull the rules from texts already written, but who cannot apply it imaginatively.
评论 #40557233 未加载
_fw12 个月前
So it knows more about the law than you do, but less than they do.<p>Really glad to see research replicated like this. I’m not surprised that the 90th percentile doesn’t hold up.<p>It’s still handy though.
lccerina12 个月前
It&#x27;s amazing the level of mental gymnastics I see in the comments trying to justify a piece of technology that is evidently not as good as they believed to be...
评论 #40563012 未加载