TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

AI hype is built on flawed test scores

204 点作者 antondd超过 1 年前

29 条评论

mg超过 1 年前
I don&#x27;t think the &quot;hype&quot; is built on test scores.<p>It is built on the observation how fast AI is getting better. If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.<p>Just two years ago, I was mesmerized by GPT-3&#x27;s ability to understand concepts:<p><a href="https:&#x2F;&#x2F;twitter.com&#x2F;marekgibney&#x2F;status&#x2F;1403414210642649092" rel="nofollow noreferrer">https:&#x2F;&#x2F;twitter.com&#x2F;marekgibney&#x2F;status&#x2F;1403414210642649092</a><p>Nowadays, using it daily in a productive fashion feels completely normal.<p>Yesterday, I was annoyed with how cumbersome it is to play long mp3s on my iPad. I asked GPT-4 something like &quot;Write an html page which lets me select an mp3, play it via play&#x2F;pause buttons and offers me a field to enter a time to jump to&quot;. And the result was usable out of the box and is my default mp3 player now.<p>Two years ago it didn&#x27;t even dawn on me that this would be my way of writing software in the near future. I have been coding for over 20 years. But for little tools like this, it is faster to ask ChatGPT now.<p>It&#x27;s hard to imagine where we will be in 20 years.
评论 #37830655 未加载
评论 #37832002 未加载
评论 #37830404 未加载
评论 #37832801 未加载
评论 #37831482 未加载
评论 #37836601 未加载
评论 #37839199 未加载
评论 #37839441 未加载
评论 #37839074 未加载
评论 #37832223 未加载
评论 #37856951 未加载
评论 #37830932 未加载
评论 #37830531 未加载
kfk超过 1 年前
AI hype is really problematic in Enterprise. Big companies are now spending C executive time figuring out a company &quot;AI strategy&quot;. This is going to be another cycle of money-wasted&#x2F;biz-upset, very similar to what I have seen with Big data. The thing in Enterprise is that everyone serious about biz operations knows AI test scores and AI quality is not there, but very few are able to communicate these concerns in a constructive way, rather everyone is embracing the hype because, maybe they get a promotion? Tech, as usual, is very happy to feed the hype and never, as usual, telling businesses honestly that, at best, this is an incremental productivity improvement, nothing life changing. I think the issue is overall lack of honesty, professionalism, and accountability across the board, with tech leading this terrible way of pushing product and &quot;adding value&quot;.
评论 #37830401 未加载
评论 #37830514 未加载
评论 #37831138 未加载
评论 #37838352 未加载
评论 #37838502 未加载
评论 #37830594 未加载
评论 #37830564 未加载
评论 #37836741 未加载
评论 #37836613 未加载
评论 #37841241 未加载
评论 #37830422 未加载
评论 #37830378 未加载
danielvaughn超过 1 年前
I remember watching a documentary about an old blues guitar player from the 1920&#x27;s. They were trying to learn more about him and track down his whereabouts during certain periods of his life.<p>At one point, they showed some old footage which featured a montage of daily life in a small Mississippi town. You&#x27;d see people shopping for groceries, going on walks, etc. Some would stop and wave at the camera.<p>In the documentary, they noted that this footage exists because at the time, they&#x27;d show it on screen during intermission at movie theaters. Film was still in its infancy in that time, and was so novel that people loved seeing themselves and other people on the big screen. It was an interesting use of a new technology, and today it&#x27;s easy to understand why it died out. Of course, it likely wasn&#x27;t obvious at the time.<p>I say all that because I don&#x27;t think we can <i>know</i> at this point what AI is capable of, and how we want to use it, but we should expect to see lots of failure while we figure it out. Over the next decade there&#x27;s undoubtedly going to be countless ventures similar to the &quot;show the townspeople on the movie screen&quot; idea, blinded by the novelty of technological change. But failed ventures have no relevance to the overall impact or worth of the technology itself.
评论 #37838510 未加载
评论 #37837894 未加载
评论 #37837856 未加载
randcraw超过 1 年前
The debate over whether LLMs are &quot;intelligent&quot; seem a lot like the old debate among NLP experts whether English must be modeled as a context-free grammar (push down automaton) or finite-state machine (regular expression). Yes, any language can be modeled using regular expressions; you just need an insane number of FSMs (perhaps billions). And that seems to be the model that LLMs are using to model cognition today.<p>LLMs seem to use little or no abstract reasoning (is-a) or hierarchical perception (has-a), as humans do -- both of which are grounded in semantic abstraction. Instead, LLMs can memorize a brute force explosion in finite state machines (interconnected with Word2Vec-like associations) and then traverse those machines and associations as some kind of mashup, akin to a coherent abstract concept. Then as LLMs get bigger and bigger, they just memorize more and more mashup clusters of FSMs augmented with associations.<p>Of course, that&#x27;s not how a human learns, or reasons. It seems likely that synthetic cognition of this kind will fail to enable various kinds of reasoning that humans perceive as essential and normal (like common sense based on abstraction, or physically-grounded perception, or goal-based or counterfactual reasoning, much less insight into the thought processes &#x2F; perceptions of other sentient beings). Even as ever-larger LLMs &quot;know more&quot; by memorizing ever more FSMs, I suspect they&#x27;ll continue to surprise us with persistent cognitive and perceptual deficits that would never arise in organic beings that <i>do</i> use abstract reasoning and physically grounded perception.
评论 #37843418 未加载
评论 #37841314 未加载
iambateman超过 1 年前
This really is a good article, and is seriously researched. But the conclusion in the headline - “AI hype is built on flawed test scores” - feels like a poor summary of the article.<p>It _is_ correct to say that an LLM is not ready to be a medical doctor, even if it can pass the test.<p>But I think a better conclusion is that test scores don’t help us understand LLM capabilities like we think they do.<p>Using a human test for an LLM is like measuring a car’s “muscles” and calling it horsepower. They’re just different.<p>But the AI hype is justified, even if we struggle to measure it.
dleslie超过 1 年前
Two years ago I didn&#x27;t use AI at all. Now I wouldn&#x27;t go without it; I have Copilot integrated with Emacs, VSCode, and Rider. I consider it a ground-breaking productivity accelerator, a leap similar to when I transitioned from Turbo Pascal 2 to Visual C 6.<p>That&#x27;s why I&#x27;m hyped. If it&#x27;s that good for me, and it&#x27;s generalizable, then it&#x27;s going to rock the world.
评论 #37838024 未加载
评论 #37836754 未加载
GuB-42超过 1 年前
I don&#x27;t think test scores have anything to do with the hype. Most people don&#x27;t even realize test scores exist.<p>One is just to wow factor. It will be short lived. A bit like VR, which is awesome when you first try it, but it wears out quickly. Here, you can have a bot write convincing stories and generate nice looking images, which is awesome until you notice that the story doesn&#x27;t make sense and that the images has many details wrong. This is not just a score, it is something you can see and experience.<p>And there is also the real thing. People start using GPT for real work. I have used it to document my code for instance, and it works really well, with it I can do a better job than without, and I can do it faster. Many students use it to do their homework, which may not be something you want, but it no less of a real use. Many artists are strongly protesting against generative AI, this in itself is telling, it means it is taken seriously, and at the same time, other artists are making use of it.<p>It is even use for great effect where you don&#x27;t notice. Phone cameras are a good example, by enhancing details using AI, they give you much better pictures than what the optics are capable of. Some people don&#x27;t like that because the picture are &quot;not real&quot;, but most enjoy the better perceived quality. Then, there are image classifiers, speech-to-text and OCR, fuzzy searching, content ranking algorithms we love to hate, etc... that all make use of AI.<p>Note: here AI = machine learning with neural networks, which is what the hype is about. AI is a vague term that can mean just about anything.
评论 #37838797 未加载
dmezzetti超过 1 年前
This video from Yann LeCun gives a great summary on where things stand. <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=pd0JmT6rYcI">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=pd0JmT6rYcI</a><p>He is of the opinion the current generation transformers architecture is flawed and it will take a new generation of models to get close to the hype.
PeterisP超过 1 年前
It&#x27;s not built on high test scores - while academics do benchmark models on various tests, all the many people who built up the hype mostly did it based on their personal experience with a chatbot, not by running some long (and expensive) tests on those datasets.<p>The tests are used (and, despite their flaws, useful) to compare various facets of model A to model B - however, the validation whether a model is <i>good</i> now comes from users, and that validation really can&#x27;t be flawed much - if it&#x27;s helpful (or not) to someone, then it is what it is, the proof of the pudding is in the eating.
waynenilsen超过 1 年前
This article is absurd.<p>&gt; But when a large language model scores well on such tests, it is not clear at all what has been measured. Is it evidence of actual understanding? A mindless statistical trick? Rote repetition?<p>It is measuring how well it does _at REPLACING HUMANS_. It is hard to believe how the author clearly does not understand this. I don&#x27;t care how it obtains its results.<p>GPT-4 is like a hyperspeed entry to mid level dev that has almost no ability to contextualize. Tools built on top of 32k will allow repo ingestion.<p>This is the worst it will ever be.
评论 #37839111 未加载
评论 #37831710 未加载
评论 #37840341 未加载
评论 #37831750 未加载
chewxy超过 1 年前
I note something very interesting in the AI hype, and I would like someone to help explain it.<p>Whenever there&#x27;s a news or article noting the limits of current LLM tech (especially the GPT class of models from OpenAI), there&#x27;s always a comment that says something along the lines of &quot;ah did you test it on GPT-4&quot;?<p>Or if it&#x27;s clear that it&#x27;s the limitation of GPT-4, then you have comments along the lines of &quot;what&#x27;s the prompt?&quot;, or &quot;the prompt is poor&quot;. Usually, it&#x27;s someone who hasn&#x27;t in the past indicated that they understand that prompt engineering is model specific, and the papers&#x27; point is to make a more general claim as opposed to a claim on one model.<p>Can anyone explain this? It&#x27;s like the mere mention of LLMs being limited in X, Y, Z fashion offends their lifestyle&#x2F;core beliefs. Or perhaps it&#x27;s a weird form of astroturfing. To which, I ask, to what end?
评论 #37831291 未加载
评论 #37831166 未加载
评论 #37831240 未加载
评论 #37832350 未加载
评论 #37831631 未加载
评论 #37834854 未加载
评论 #37831074 未加载
epups超过 1 年前
I think ironically there has been an &quot;AI-anti-hype hype&quot;, with people like Gary Marcus trying to blow up every single possible issue into a deal breaker. Most of the claims in this article are based on tests performed only on GPT-3, and researchers often seem to make tests in a way that proves their point - see an earlier comment from me here with an example: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37503944">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37503944</a><p>I agree there has been many attention-grabbing headlines that are due to simple issues like contamination. However, I think AI has already proved its business value far beyond those issues, as anyone using ChatGPT with a code base not present in their dataset can attest.
评论 #37830957 未加载
bondarchuk超过 1 年前
&gt;<i>But there’s a problem: there is little agreement on what those results really mean. Some people are dazzled by what they see as glimmers of human-like intelligence; others aren’t convinced one bit.</i><p>I find the whole hype &amp; anti-hype dynamic so tiresome. Some are over-hyping, others are responding with over-anti-hyping. Somewhere in-between are many reasonable, moderate and caveated opinions, but neither the hypesters or anti-hypesters will listen to these (considering all of them to come from people at the opposite extreme), nor will outside commentators (somehow being unable to categorize things as anything more complicated than this binary).
评论 #37832372 未加载
mcguire超过 1 年前
&quot;<i>When Horace He, a machine-learning engineer, tested GPT-4 on questions taken from Codeforces, a website that hosts coding competitions, he found that it scored 10&#x2F;10 on coding tests posted before 2021 and 0&#x2F;10 on tests posted after 2021. Others have also noted that GPT-4’s test scores take a dive on material produced after 2021. Because the model’s training data only included text collected before 2021, some say this shows that large language models display a kind of memorization rather than intelligence.</i>&quot;<p>I&#x27;m sure that is just a matter of prompt engineering, though.
评论 #37839208 未加载
robertlagrant超过 1 年前
&gt; AI hype is built on high test scores<p>No, it&#x27;s built on people using DALLE and Midjourney and ChatGPT.
评论 #37837072 未加载
nojvek超过 1 年前
Related paper <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2309.08632.pdf" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2309.08632.pdf</a><p>‘Pre-training on the Test Set Is All You Need‘<p>GPT-4 is really smart to dig information it has seen before, but please don’t use it for any serious reasoning. Always take the answer with a grain of salt.
refulgentis超过 1 年前
This is my favorite new AI argument, took me a few months to see it. Enjoyed it at first.<p>You start with everyone knows there&#x27;s AI hype from tech bros. Then you introduce a PhD or two at institutions with good names. Then they start grumbling about anthropomorphizing and who knows what AI is anyway.<p>Somehow, if it&#x27;s long enough, you forget that this kind of has nothing to do with anything. There is no argument. Just imagining other people must believe crazy things and working backwards from there to find something to critique.<p>Took me a bit to realize it&#x27;s not even an argument, just parroting &quot;it&#x27;s a stochastic parrot!&quot; Assumes other people are dunces and genuinely believe it&#x27;s a minihuman. I can&#x27;t believe MIT Tech Review is going for this, the only argument here is the tests are flawed if you think they&#x27;re supposed to show the AI model is literally human.
MrYellowP超过 1 年前
I disagree entirely.<p>The hype is based entirely on the fact that I can talk (in text) to a machine and it responds like a human. It might sometimes make up stuff, but so do humans. I therefore don&#x27;t consider that a significant downside, or problem. In the end chatgpt is still ... a baby.<p>The hype builds around the fact that I can run a language model that fits into my graphics cards and responds at faster-than-typing speed, which is sufficient.<p>The hype builds around the fact that it can create and govern whole text based games for me, if I just properly ask it to do so.<p>The hype builds around the fact that I can have this everywhere with me, all day long, whenever I want. It never grows tired, it never stops answering, it never scoffs at me, it never hates me, it never tells me that I&#x27;m stupid, it never tells me that I&#x27;m not capable of doing something.<p>It always teaches me, always offers me more to learn, it always is willingly helping me, it never intentionally tries to hide the fact that it doesn&#x27;t know something and never intentionally tries to impress me just to get something from me.<p>Can it get things wrong? Sure! Happens! Happens to everyone. Me, you, your neighbour, parents, teachers, plumbers.<p>Not a single minute did I, or dozens of millions of others, give a single flying fuck about test scores.
janalsncm超过 1 年前
The only test I need is the amount of time it takes me to do common tasks with and without ChatGPT. I’m aware it’s not perfect but perfect was never necessary.
derbOac超过 1 年前
This was interesting to me but mostly because of a question I thought it was going to focus on, which is how should we interpret these tests when a human takes it?<p>I wasn&#x27;t sure that the phenomena they discussed was as relevant to the question of whether AI is overhyped as they made it out to be, but I did think a lot of questions about the meaning of the performances were important.<p>What&#x27;s interesting to me is you could flip this all on its head and, instead of asking &quot;what can we infer about the machine processes these test scores are measuring?&quot;, we could ask &quot;what does this imply about the human processes these test scores are measuring?&quot;<p>A lot of these test are well-validated but overinterpreted I think, and leaned on too heavily to make inferences about people. If a machine can pass a test, for instance, what does it say about the test as used in people? Should we be putting as much weight on them as we do?<p>I&#x27;m not arguing these tests are useless or something, just that maybe we read into them too much to begin with.
Cloudef超过 1 年前
AI is honestly wrong word to use. These are ML models and they are able to only do the task they have been specifically trained for (not saying the results aren&#x27;t impressive!). There really isn&#x27;t competition either as the only people who can train these giant models are those who have the cash.
评论 #37831216 未加载
评论 #37832997 未加载
javier_e06超过 1 年前
As a developer when I work with ChatGPT I can see ChatGPT eventually taking over my JIRA stories. Then ChatGPT will take over management creating product roadmaps, prioritizing and assigning tasks to itself. All dictated by customer feedback. The clock is ticking. But reasoning like a human? No.
Garvi超过 1 年前
Counterpoint: Journalism is dead and has been replaced with algorithms that supply articles on a supply and demand basis.<p>&quot;25% of the potential target audience dislikes AI and do not have their opinion positively represented in the media they consume. The potential is unsaturated. Maximum saturation estimated at 15 articles per week.&quot;<p>A bit more serious: AI hasn&#x27;t even scratched the surface. Once we apply LLMs to speech synth and improve the visual generators by just a tiny bit, to fix faces, we can basically tell the AI to &quot;create the best romantic comedy ever made&quot;.<p>&quot;Oh, and repeat 1000 times, please&quot;.
rvz超过 1 年前
Most of the hype comes from the AI grifters who need to find the next sucker to dump their VC shares onto to the next greater fool to purchase their ChatGPT-wrapper snake oil project to at an overvalued asking price.<p>The ones who have to dismantle the hype are the proper technologies such as Yann LeCun and Grady Booch who know exactly what they are talking about.
评论 #37838911 未加载
rahimnathwani超过 1 年前
<p><pre><code> “People have been giving human intelligence tests—IQ tests and so on—to machines since the very beginning of AI,” says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico. “The issue throughout has been what it means when you test a machine like this. It doesn’t mean the same thing that it means for a human.” </code></pre> The last sentence above is an important point that most people don&#x27;t consider.
评论 #37838486 未加载
aldousd666超过 1 年前
Only idiots are basing their excitement about what&#x27;s possible on those test scores. They&#x27;re just an attempt to measure one bot against another. There is a strong possibility that they are only measuring how well the bot takes the test, and nothing at all about what the tests themselves purport to measure. I mean, those tests are probably similar to stuff that&#x27;s in the training data.
评论 #37837765 未加载
aidenn0超过 1 年前
Any task that gets solved with AI retroactively becomes something that doesn&#x27;t require reasoning.
评论 #37838384 未加载
Kalanos超过 1 年前
Didn&#x27;t it perform well on both the SAT and LSAT though?
yieldcrv超过 1 年前
This was 2 months ago, irrelevant in AI time