Today's Large Language Models Are Essentially BS Machines

57 pointsby tomlinover 1 year ago

22 comments

It's silly to flag this submission. Back in April researchers at Stanford reported that less than half of the results from AI-powered search corresponded to verifiable facts. What do we call the remaining portion. "BS" seems reasaonable.<a href="https://aiindex.stanford.edu/report/" rel="nofollow noreferrer">https://aiindex.stanford.edu/report/</a>"As internet pioneer and Google researcher Vint Cerf said Monday, AI is "like a salad shooter," scattering facts all over the kitchen but not truly knowing what it's producing. "We are a long way away from the self-awareness we want," he said in a talk at the TechSurge Summit."<a href="https://www.cnet.com/tech/computing/bing-ai-bungles-search-results-at-times-just-like-google/" rel="nofollow noreferrer">https://www.cnet.com/tech/computing/bing-ai-bungles-search-r...</a>

评论 #37476308 未加载

评论 #37476344 未加载

httpzover 1 year ago

To be honest, I hated writing essays in English classes because I felt like I'm forced to write BS to fill up the space when my argument can be summed up in several bullet points.Since I'm not a student anymore, I can just give ChatGPT a few bullet points and ask it to write a paragraph for me. As an engineer who doesn't like writing "fluff", it's great I can now outsource the BS part of writing.

评论 #37476251 未加载

评论 #37476268 未加载

评论 #37476977 未加载

mlsuover 1 year ago

So what?Today, ChatGPT helped me write a driver.The driver either compiles, or it doesn't; it compiled. The driver either reads a value from a register, or it doesn't; it read. The driver either causes the chip to physically move electrons in the real world in the way that I want it to, or it doesn't.The real world does not distinguish between bullshit or not. Things either work or they do not. They either are one way, or they are another way. ChatGPT produces things that work in reality. We humans live in reality. Reality is what matters.I notice a thread through all of the breathless panicking about LLMs: it does not correspond to REALITY. It's a panic about a fiction. The fiction that the content of text is reality itself. The fiction that the LLM can somehow recursively improve itself. The fiction that the map is the territory.

dekhnover 1 year ago

During the big GPT-4 news cycle I think a bunch of folks posted claims that were outrageously good- "language model passes medical exams better than humans", etc. When I looked into them, in nearly all cases, the claims were boosted far beyond the reality. And the reality seemed much more consistent with a fairly banal interpretation: LLMs produce realistic looking text but have no real ability to distinguish truth from fabrication (which is a step beyond bullshit!).The one example that still interests me is math problem solving. Can next-token predictors really solve generalized math problems as well as children? <a href="https://arxiv.org/abs/2110.14168" rel="nofollow noreferrer">https://arxiv.org/abs/2110.14168</a>

评论 #37476263 未加载

ggmover 1 year ago

To me, this is the quintessential risk: It's plausible enough it will fool somebody with authority to act, but lacking competency to understand the information is low grade. Boom! "oh man.. but the computer said it was ok"

评论 #37476183 未加载

评论 #37476179 未加载

austinkhaleover 1 year ago

I like to think of all responses from LLM's like the top-rated post on Stack Overflow or a top five blog post from a Google search. It's helpful information that _may_ be correct but needs to be verified. A lot of the time, it's spot on. Some percentage of the time, it's straight up incorrect. You have to be willing to compare various sources of data and find what's accurate. It's a nice, easy-to-use starting point, essentially.

dmezzettiover 1 year ago

While there is truth here, they can be quite effective as a logic engine vs a fact engine. One of the most popular LLM use cases is retrieval augmented generation (RAG), where the LLM is limited by a provided context.Do you need 7B/13B/33B/77B parameters to do this? That is a question up for debate and something I'm exploring with the concept of micro/nano models (<a href="https://neuml.hashnode.dev/train-a-language-model-from-scratch" rel="nofollow noreferrer">https://neuml.hashnode.dev/train-a-language-model-from-scrat...</a>). There is the sense that today's LLMs could be overkill for a problem such as RAG.

danenaniaover 1 year ago

Using LLMs to write code, particularly in a statically typed language, is a good way to get a sense for how accurate they are, since most mistakes/hallucinations are readily apparent.I've been using GPT-4 to write code almost daily for months now, and I'd estimate that it is maybe 80-90% accurate in general, with the caveat that the quality of the prompt can have a major impact on this. If the prompt is vague, you're unlikely to get good results on the first try. If the prompt is very thorough and precise, and relevant context is included, it can often nail even fairly complex tasks in one shot.Regardless of what the accuracy number is, it strikes me as pretty silly to call them "BS Machines". It's like calling human programmers "bug machines". Yeah, we do produce a lot of bugs, but we somehow seem to get a quite a bit of working software out the door.GPT-4 isn't perfect and people should certainly be aware that it makes mistakes and makes things up, but it also produces quite a lot of extremely useful output across many domains. I know it's made me more productive. Honestly, I can't think of any programming language, framework, technique, or product that has increased my productivity so quickly or dramatically in the 17 years I've been programming. Nothing else even comes close. Pretty good for a BS machine.

btownover 1 year ago

Even if you take the headline at face value (and IMO it's rather unfair)... the incredible saving grace of LLMs is that you have a plurality of BS machines, with different flavors of BS, whose outputs can be wired together.Sure, the first-order output of today's generalist LLMs outputting one token at a time do seem to meet meet diminishing returns on factuality at approximately the level of a college freshman pulling an all-nighter. Not a great standard, that. But if you took an entire class of those tired freshmen, gave their outputs to an independent group of tired freshmen unfamiliar with the material, and told the second group to identify, in a structured manner, commonalities and discrepancies and topics they'd look up in an encyclopedia and things they'd like to escalate to a human expert on, and so on... all of a sudden, you can start to build structured knowledge about the topic, and an understanding of what is and isn't likely to be a hallucination.One might argue that the right kind of model architecture and RLHF could bake this into the LLM itself - but you don't need to wait for that research to be brought into production to create a self-correcting system-of-systems today.

lemmsjidover 1 year ago

I noticed he mentioned asking Bing, so I directly asked ChatGPT “who is Ryan McGreal?” and it wrote, “ Ryan McGreal is a writer, urbanist, and the editor of Raise the Hammer, an online publication that focuses on urban issues, transportation, and sustainability. He has been actively involved in advocating for better urban planning and transportation policies in Hamilton, Ontario, Canada. Please note that my knowledge is based on information available up to September 2021, so there may have been developments or changes in his work or activities since then.”That appears more or less correctSo to give chatgpt an opportunity to hallucinate similarly to the article, I followed up with, “Did he write for the nyt?” and it replied, “ I do not have any information indicating that Ryan McGreal has written for The New York Times (NYT). His work primarily focuses on urban issues and transportation, as mentioned earlier, and he is associated with Raise the Hammer, a local publication in Hamilton, Ontario, Canada. It’s possible that he may have contributed to other publications, but I do not have specific information regarding his contributions to The New York Times.”While I have seen ChatGPT make stuff up I do think it’s useful to compare specific results across LLMs before using particular examples to make holistic statements.

评论 #37476371 未加载

ariymover 1 year ago

They're not focused on being informationally accurate, they're optimized to be articulate

borisskover 1 year ago

Any time I asked ChatGPT or another GPT a question regarding science (haven't asked any questions on other topics) I got a mostly correct answer back. And I've asked a few hundred by this point. This includes state of the art research covered in just one or a few articles.So I'm curious why my personal experience doesn't match all the complains about hallucinations.

评论 #37476323 未加载

coliveiraover 1 year ago

I think that an AI-powered world will create a population that doesn't know how to distinguish truth from lies. People already believe that AI has some powerful hidden knowledge that they need to use, even when the AI model is spilling garbage. In the future, they will also be incapable to separate what AI models tell from reality.

评论 #37476301 未加载

评论 #37476254 未加载

mcintover 1 year ago

Most people, most of the time are just BS machines. Obligatory -- but also question of the standards, presupposed purpose. Many dreams for what AI can be, can do, can provide sounds similar in the hoped futures they enable. That does not mean that the particular next-step goals of designers and implementers of different systems will achieve the same ends.These ones are premised on regurgitating inputs. That they can imitate more than one observer's interpretation of truth at one time. More the better.

slavetologicover 1 year ago

These models will be astounding in five years. Any hot take like this is click bait. And it's never from the people actually pushing the models forwards. Always onlookers

评论 #37476210 未加载

joshspankitover 1 year ago

Counterpoint:Humans have been incentivized to essentially be BS machines.From low-quality blog posts to the highest-grossing marketing and everything in between (including many published books and scientific papers): BS makes enough money that it’s low-effort gives a decent ROI.Of course an AI trained on a large human corpus is going to produce BS. It’s just doing what it learned.

yawnxyzover 1 year ago

I'm surprised it doesn't touch on "creativity" which is a form of BS. So is being able to summarize or extract books and papers.Unless it's mechanical work, it requires some form of BS, and that's why we've traditionally been so much better at this than machines. We've never been able to create "BS machines" before, so this completely shifts the paradigm.

评论 #37495869 未加载

ftxbroover 1 year ago

imagine posting such a hot take in september 2023

评论 #37476242 未加载

评论 #37476176 未加载

borisskover 1 year ago

Any idea why was this flagged?

评论 #37476280 未加载

评论 #37476284 未加载

nilslindemannover 1 year ago

Flagging this is clearly wrong.

DienLe94over 1 year ago

"The internet is simply too slow to be useful".

Pxtlover 1 year ago

Artificial Cliff Claven. Mansplaining as a Service. Truthiness on tap.

22 comments

1vuio0pswjnm7over 1 year ago

评论 #37476308 未加载

评论 #37476344 未加载

httpzover 1 year ago

评论 #37476251 未加载

评论 #37476268 未加载

评论 #37476977 未加载

mlsuover 1 year ago

dekhnover 1 year ago

评论 #37476263 未加载

ggmover 1 year ago

评论 #37476183 未加载

评论 #37476179 未加载

austinkhaleover 1 year ago

dmezzettiover 1 year ago

danenaniaover 1 year ago

btownover 1 year ago

lemmsjidover 1 year ago

评论 #37476371 未加载

ariymover 1 year ago

They're not focused on being informationally accurate, they're optimized to be articulate

borisskover 1 year ago

评论 #37476323 未加载

coliveiraover 1 year ago

评论 #37476301 未加载

评论 #37476254 未加载

mcintover 1 year ago

slavetologicover 1 year ago

These models will be astounding in five years. Any hot take like this is click bait. And it's never from the people actually pushing the models forwards. Always onlookers

评论 #37476210 未加载

joshspankitover 1 year ago

yawnxyzover 1 year ago

评论 #37495869 未加载

ftxbroover 1 year ago

imagine posting such a hot take in september 2023

评论 #37476242 未加载

评论 #37476176 未加载

borisskover 1 year ago

Any idea why was this flagged?

评论 #37476280 未加载

评论 #37476284 未加载

nilslindemannover 1 year ago

Flagging this is clearly wrong.

DienLe94over 1 year ago

"The internet is simply too slow to be useful".

Pxtlover 1 year ago

Artificial Cliff Claven. Mansplaining as a Service. Truthiness on tap.