It's silly to flag this submission. Back in April researchers at Stanford reported that less than half of the results from AI-powered search corresponded to verifiable facts. What do we call the remaining portion. "BS" seems reasaonable.<p><a href="https://aiindex.stanford.edu/report/" rel="nofollow noreferrer">https://aiindex.stanford.edu/report/</a><p>"As internet pioneer and Google researcher Vint Cerf said Monday, AI is "like a salad shooter," scattering facts all over the kitchen but not truly knowing what it's producing. "We are a long way away from the self-awareness we want," he said in a talk at the TechSurge Summit."<p><a href="https://www.cnet.com/tech/computing/bing-ai-bungles-search-results-at-times-just-like-google/" rel="nofollow noreferrer">https://www.cnet.com/tech/computing/bing-ai-bungles-search-r...</a>
To be honest, I hated writing essays in English classes because I felt like I'm forced to write BS to fill up the space when my argument can be summed up in several bullet points.<p>Since I'm not a student anymore, I can just give ChatGPT a few bullet points and ask it to write a paragraph for me. As an engineer who doesn't like writing "fluff", it's great I can now outsource the BS part of writing.
So what?<p>Today, ChatGPT helped me write a driver.<p>The driver either compiles, or it doesn't; it compiled.
The driver either reads a value from a register, or it doesn't; it read.
The driver either causes the chip to physically move electrons in the real world in the way that I want it to, <i>or it doesn't.</i><p>The real world does not distinguish between bullshit or not. Things either work or they do not. They either are one way, or they are another way. ChatGPT produces things that work in reality. We humans live in reality. Reality is what matters.<p>I notice a thread through all of the breathless panicking about LLMs: it does not correspond to REALITY. It's a panic about a fiction. The fiction that the content of text is reality itself. The fiction that the LLM can somehow recursively improve itself. The fiction that the map is the territory.
During the big GPT-4 news cycle I think a bunch of folks posted claims that were outrageously good- "language model passes medical exams better than humans", etc.
When I looked into them, in nearly all cases, the claims were boosted far beyond the reality. And the reality seemed much more consistent with a fairly banal interpretation: LLMs produce realistic looking text but have no real ability to distinguish truth from fabrication (which is a step beyond bullshit!).<p>The one example that still interests me is math problem solving. Can next-token predictors really solve generalized math problems as well as children? <a href="https://arxiv.org/abs/2110.14168" rel="nofollow noreferrer">https://arxiv.org/abs/2110.14168</a>
To me, this is the quintessential risk: It's plausible enough it will fool somebody with authority to act, but lacking competency to understand the information is low grade. Boom! "oh man.. but the computer said it was ok"
I like to think of all responses from LLM's like the top-rated post on Stack Overflow or a top five blog post from a Google search. It's helpful information that _may_ be correct but needs to be verified. A lot of the time, it's spot on. Some percentage of the time, it's straight up incorrect. You have to be willing to compare various sources of data and find what's accurate. It's a nice, easy-to-use starting point, essentially.
While there is truth here, they can be quite effective as a logic engine vs a fact engine. One of the most popular LLM use cases is retrieval augmented generation (RAG), where the LLM is limited by a provided context.<p>Do you need 7B/13B/33B/77B parameters to do this? That is a question up for debate and something I'm exploring with the concept of micro/nano models (<a href="https://neuml.hashnode.dev/train-a-language-model-from-scratch" rel="nofollow noreferrer">https://neuml.hashnode.dev/train-a-language-model-from-scrat...</a>). There is the sense that today's LLMs could be overkill for a problem such as RAG.
Using LLMs to write code, particularly in a statically typed language, is a good way to get a sense for how accurate they are, since most mistakes/hallucinations are readily apparent.<p>I've been using GPT-4 to write code almost daily for months now, and I'd estimate that it is maybe 80-90% accurate in general, with the caveat that the quality of the prompt can have a major impact on this. If the prompt is vague, you're unlikely to get good results on the first try. If the prompt is very thorough and precise, and relevant context is included, it can often nail even fairly complex tasks in one shot.<p>Regardless of what the accuracy number is, it strikes me as pretty silly to call them "BS Machines". It's like calling human programmers "bug machines". Yeah, we do produce a lot of bugs, but we somehow seem to get a quite a bit of working software out the door.<p>GPT-4 isn't perfect and people should certainly be aware that it makes mistakes and makes things up, but it also produces quite a lot of extremely useful output across many domains. I know it's made me more productive. Honestly, I can't think of any programming language, framework, technique, or product that has increased my productivity so quickly or dramatically in the 17 years I've been programming. Nothing else even comes close. Pretty good for a BS machine.
Even if you take the headline at face value (and IMO it's rather unfair)... the incredible saving grace of LLMs is that you have a plurality of BS machines, with different flavors of BS, whose outputs can be wired together.<p>Sure, the first-order output of today's generalist LLMs outputting one token at a time do seem to meet meet diminishing returns on factuality at approximately the level of a college freshman pulling an all-nighter. Not a great standard, that. But if you took an entire class of those tired freshmen, gave their outputs to an independent group of tired freshmen unfamiliar with the material, and told the second group to identify, in a structured manner, commonalities and discrepancies and topics they'd look up in an encyclopedia and things they'd like to escalate to a human expert on, and so on... all of a sudden, you can start to build structured knowledge about the topic, and an understanding of what is and isn't likely to be a hallucination.<p>One might argue that the right kind of model architecture and RLHF could bake this into the LLM itself - but you don't need to wait for that research to be brought into production to create a self-correcting system-of-systems today.
I noticed he mentioned asking Bing, so I directly asked ChatGPT “who is Ryan McGreal?” and it wrote, “ Ryan McGreal is a writer, urbanist, and the editor of Raise the Hammer, an online publication that focuses on urban issues, transportation, and sustainability. He has been actively involved in advocating for better urban planning and transportation policies in Hamilton, Ontario, Canada. Please note that my knowledge is based on information available up to September 2021, so there may have been developments or changes in his work or activities since then.”<p>That appears more or less correct<p>So to give chatgpt an opportunity to hallucinate similarly to the article, I followed up with, “Did he write for the nyt?” and it replied, “ I do not have any information indicating that Ryan McGreal has written for The New York Times (NYT). His work primarily focuses on urban issues and transportation, as mentioned earlier, and he is associated with Raise the Hammer, a local publication in Hamilton, Ontario, Canada. It’s possible that he may have contributed to other publications, but I do not have specific information regarding his contributions to The New York Times.”<p>While I have seen ChatGPT make stuff up I do think it’s useful to compare specific results across LLMs before using particular examples to make holistic statements.
Any time I asked ChatGPT or another GPT a question regarding science (haven't asked any questions on other topics) I got a mostly correct answer back. And I've asked a few hundred by this point. This includes state of the art research covered in just one or a few articles.<p>So I'm curious why my personal experience doesn't match all the complains about hallucinations.
I think that an AI-powered world will create a population that doesn't know how to distinguish truth from lies. People already believe that AI has some powerful hidden knowledge that they need to use, even when the AI model is spilling garbage. In the future, they will also be incapable to separate what AI models tell from reality.
Most people, most of the time are just BS machines. Obligatory -- but also question of the standards, presupposed purpose. Many dreams for what AI can be, can do, can provide sounds similar in the hoped futures they enable. That does not mean that the particular next-step goals of designers and implementers of different systems will achieve the same ends.<p>These ones are premised on regurgitating inputs. That they can imitate more than one observer's interpretation of truth at one time. More the better.
These models will be astounding in five years. Any hot take like this is click bait. And it's never from the people actually pushing the models forwards. Always onlookers
Counterpoint:<p><i>Humans</i> have been incentivized to essentially be BS machines.<p>From low-quality blog posts to the highest-grossing marketing and everything in between (including many published books and scientific papers): BS makes enough money that it’s low-effort gives a decent ROI.<p>Of course an AI trained on a large human corpus is going to produce BS. It’s just doing what it learned.
I'm surprised it doesn't touch on "creativity" which is a form of BS. So is being able to summarize or extract books and papers.<p>Unless it's mechanical work, it requires some form of BS, and that's why we've traditionally been so much better at this than machines. We've never been able to create "BS machines" before, so this completely shifts the paradigm.