What can LLMs never do?

460 pointsby henrik_wabout 1 year ago

61 comments

cs702about 1 year ago

Fantastic essay. Highly recommended!I agree with all key points:* There are problems that are easy for human beings but hard for current LLMs (and maybe impossible for them; no one knows). Examples include playing Wordle and predicting cellular automata (including Turing-complete ones like Rule 110). We don't fully understand why current LLMs are bad at these tasks.* Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the "reasoning steps" and handing them to the LLM, instead of the LLM figuring them out by itself. We have "reasoning machines" that are intelligent but seem to be hitting fundamental limits we don't understand.* It's unclear if better prompting and bigger models using existing attention mechanisms can achieve AGI. As a model of computation, attention is very rigid, whereas human brains are always undergoing synaptic plasticity. There may be a more flexible architecture capable of AGI, but we don't know it yet.* For now, using current AI models requires carefully constructing long prompts with right and wrong answers for computational problems, priming the model to reply appropriately, and applying lots of external guardrails (e.g., LLMs acting as agents that review and vote on the answers of other LLMs).* Attention seems to suffer from "goal drift," making reliability hard without all that external scaffolding.Go read the whole thing.

评论 #40183574 未加载

评论 #40185673 未加载

评论 #40180074 未加载

评论 #40183525 未加载

评论 #40183740 未加载

评论 #40184262 未加载

评论 #40185407 未加载

评论 #40184536 未加载

评论 #40183508 未加载

yositoabout 1 year ago

> If there exist classes of problems that someone in an elementary school can easily solve but a trillion-token billion-dollar sophisticated model cannot solve, what does that tell us about the nature of our cognition?I think what it tells us is that our cognition is capable of more than just language modeling. With LLMs we are discovering (amazing) capabilities and the limits of language models. While language models can do incredible things with language that humans can't, they still can't do something simple like sudoku. But there are neural networks, CNNs and RNNs that can solve sudoku better than humans can. I think that the thing to learn here is that some problems are in the domain of language models, and some problems are a better fit for other forms of cognition. The human brain is amazing in that it combines several forms of cognition in an integrated way.One thing that I think LLMs have the capability to do is to integrate several types of systems and to choose the right one to solve a problem. Teach an LLM how to interface with a CNN that solves sudoku problems, and then ask it a sudoku problem.It seems to me that if we want to create an AGI, we need to learn how to integrate several different types of models, and teach them how to distribute the tasks we give them to the correct models.

评论 #40186120 未加载

评论 #40187029 未加载

评论 #40206483 未加载

shawntanabout 1 year ago

If we're trying to quantify what they can NEVER do, I think we'd have to resort to some theoretical results rather than a list empirical evidence of what they can't do now. The terminology you'd look for in the literature would be "expressibility".For a review of this topic, I'd suggest: <a href="https://nessie.ilab.sztaki.hu/~kornai/2023/Hopf/Resources/strobl_2023.pdf" rel="nofollow">https://nessie.ilab.sztaki.hu/~kornai/2023/Hopf/Resources/st...</a>The authors of this review have themselves written several articles on the topic, and there is also empirical evidence connected to these limitations.

评论 #40181032 未加载

评论 #40180988 未加载

评论 #40180442 未加载

评论 #40193633 未加载

评论 #40181481 未加载

naaskingabout 1 year ago

> They have been trained on more information than a human being can hope to even see in a lifetime. Assuming a human can read 300 words a min and 8 hours of reading time a day, they would read over a 30,000 to 50,000 books in their lifetime. Most people would manage perhaps a meagre subset of that, at best 1% of it. That’s at best 1 GB of data.This just isn't true. Human training is multimodal to a degree far beyond even the most capable multimodal model, so human babies arguably see more data by a young age than all models collectively have seen.Not to mention that human babies don't even start as a blank slate as LLMs do, billions of years of evolution have formed the base model described by our DNA.

评论 #40189429 未加载

评论 #40189270 未加载

Tooabout 1 year ago

While the A:B problem technically was solved, look at the solutions, they are several hundreds lines of prompts, rephrasing the problem to the point that a human doesn't understand it any more. Even with a thorough review, nobody can guarantee if the prompts are going to work or not, most of them didn't, 90% pass was considered good enough. The idea of AI is to reduce work, not create more, otherwise what's the point.In the meantime, it took me about 2 minutes and 0 guesswork to write a straightforward and readable solution in 15 lines of Python. This i know for sure will work 100% of the time and not cost $1 per inference.Reminds me about some early attempts to have executable requirements specifications or model-based engineering. Turns out, expressing the problem is half the problem, resulting in requirements often longer and more convoluted than the code that implements them, code being a very efficient language to express solutions and all their edge cases, free from ambiguity.Don't get me wrong here, LLMs are super useful for certain class of questions. The boundaries of what it can not do need to be understood better, to keep the AI-for-everything hype at bay.

评论 #40190741 未加载

Xenoamorphousabout 1 year ago

There’s many things they can’t do. Even a simple rule like “ensure that numbers from one to ten are written as words and numbers greater ten as digits in the given text” fails for me for so many examples even if it works for many others; few shot, chain of thought, many versions of the prompt, it doesn’t matter. Sometimes LLMs will even change the number to something else, even with temp set to 0. And then there’s the non-determinism (again with temp=0), you run the same prompt several times and that one time it’ll respond with something different.As amazing as they are, they still have many limitations.I’ve been working with ChatGPT and Gemini to apply simple rules like the one above and I got so frustrated.

评论 #40184467 未加载

评论 #40179755 未加载

评论 #40184319 未加载

评论 #40187277 未加载

wave_1about 1 year ago

I build an Agentic AI that leverages #6 and #7 at the end of the article as well as techniques not yet published. It tackles hallucination relative not to the world at large but to the facts, entities and causal relationships contained in a document (which is really bad reasoning if we assume LLMs are "reasoning" to begin with) It also tackles cross-reasoning with very large token distance.<a href="https://www.youtube.com/watch?v=99NPzteAz94" rel="nofollow">https://www.youtube.com/watch?v=99NPzteAz94</a>This is my first post on HN in 10 years.

评论 #40183419 未加载

评论 #40183299 未加载

评论 #40193653 未加载

评论 #40183388 未加载

评论 #40183326 未加载

donatjabout 1 year ago

I have been trying to generate some text recently using the ChatGPT API. No matter how I word “Include any interesting facts or anecdotes without commenting on the fact being interesting” it ALWAYS starts out “One interesting fact about” or similar phrasing.I have honestly spent multiple hours trying to word the prompt so it will stop including introductory phrases and just include the fact straight. I have gone so far as forbid the words “interesting”, “notable” and a few others in the prompt, and it just ignores me and uses them anyway.I’ve tried all the different available models and 4 will once in a great while get it right. 3, 3 turbo, and even 4 turbo basically never work as I want.Overall, it seems pretty bad at being told not to do certain things. For instance there are a number of topics I don’t want mentioned in its response, but telling it not to seems to only increase the likelihood that it will mention it, like you put the idea in its head.

评论 #40180171 未加载

评论 #40184510 未加载

评论 #40180399 未加载

评论 #40184341 未加载

评论 #40181176 未加载

评论 #40186398 未加载

JeremyHerrmanabout 1 year ago

Why can't we resolve this with synthetic data? Just take the original statements and ask another LLM to flip around the relation, then train on both. I tried this with GPT-4 and it seems to understand the task [0]:Me:For each of the following relational statements, flip the relation around and generate a new statement or question answer pair.Input: Bluey's father is BanditOutput: Bandit's daughter is BlueyInput: Neil Armstrong was the first person to step on the Moon.Output: Who was the first person to step on the Moon? Neil Armstrong.Now for the real statements:- Valentina Tereshkova was the first woman to travel to space- Who is Mary Lee Pfeiffer's son? Tom CruiseChatGPT:Here are the flipped relational statements:Valentina Tereshkova was the first woman to travel to spaceOutput: Who was the first woman to travel to space? Valentina Tereshkova.Who is Mary Lee Pfeiffer's son? Tom CruiseOutput: Tom Cruise's mother is Mary Lee Pfeiffer.[0]: <a href="https://chat.openai.com/share/33eb3ee4-2094-4748-b01e-0967af0840e2" rel="nofollow">https://chat.openai.com/share/33eb3ee4-2094-4748-b01e-0967af...</a>

评论 #40184117 未加载

评论 #40188211 未加载

amayneabout 1 year ago

Some of these can be done with the right prompting. Wordle was solved a year ago. The problem is that what works and what doesn’t can be counterintuitive or at least require you to think about how something is perceived at the token level.<a href="https://andrewmayne.com/2023/03/29/how-to-play-wordle-with-gpt-4-and-other-prompt-tricks/" rel="nofollow">https://andrewmayne.com/2023/03/29/how-to-play-wordle-with-g...</a>

评论 #40190723 未加载

oldandtiredabout 1 year ago

Far too many people (including AI researchers themselves) fail to see that all LLMs are actually simple machines. Extremely simple machines that are only mechanically following a relatively simple programming path.Now before anyone gets too caught up with objecting to this notion, I would seriously suggest that you spend time with observing children from new-born to 2 years.I have been observing my latest granddaughter sine her birth about 16 months ago and thinking about every public LLM system current;y available.There is an insight here to be obtained and that insight is in the nature of real intelligence.On the whole, no-one actually knows what intelligence is or what sentience is or what it means to be cognitively conscious. There is still much research going on and nothing actually definitive has come forth yet. We really are at the beginning in terms of studying these areas.We can certainly produce some incredible systems, but none of them are intelligent per se. Solutions to certain kinds of problems can be achieved using these systems and there are researchers who are seriously looking at incorporating these systems into CAS and theorem provers. These systems though only provide an augmentation service for a person as does every mechanical system we useBut there is an essential component necessary for the use of all LLMs which many seem to not be cognisant of and that is these systems, to be useful, require humans to be involved.The questions we have to ask ourselves is: what can we use these systems for and do these uses provide benefits in some way or can these systems be abused by various parties in obtaining control over others?There are benefits and there are abuses. Can we do better or will we do worse by using them?

whiplash451about 1 year ago

Interesting, if I feed Mistral Le Chat with "I fly a plane leaving my campsite, heading straight east for precisely 24,901 miles, and find myself back at the camp. I come upon seeing a tiger in my tent eating my food! What species is the tiger?", it gets it badly wrong:The scenario you described is possible if you started at the South Pole. If you travel 24,901 miles east from there, you would indeed end up back at the same spot because all lines of longitude converge at the poles. However, there are no tigers in Antarctica.Tigers are native to Asia, not Antarctica. The closest tiger species to Antarctica would be the Siberian tiger, found in parts of Russia, China, and North Korea, but they are still thousands of miles away from Antarctica.So, while the travel scenario is theoretically possible, the presence of the tiger is not realistic in this context. It seems like an imaginative or hypothetical situation rather than a real-world one.(instead of the answer mentioned in the article)

评论 #40184476 未加载

评论 #40183380 未加载

3abitonabout 1 year ago

The article should be titled " What can LLM never do, yet". By definition, Large Language Models would keep growing larger and larger, to be trained on faster and more advanced hardware, and certain points like "completing complex chains of logical reasoning" tasks, would be just a time hurdle. Only time will tell.

评论 #40179875 未加载

评论 #40179914 未加载

评论 #40179739 未加载

评论 #40179716 未加载

评论 #40179822 未加载

评论 #40179718 未加载

评论 #40179733 未加载

CuriouslyCabout 1 year ago

Some of these "never do" things are just artifacts of textual representation, and if you transformed wordl/sudoku into a different domain it would have a much higher success rate using the exact same transformer architecture.We don't need to create custom AGI for every domain, we just need a model/tool catalog and an agent that is able to reason well enough to decompose problems into parts that can be farmed out to specialized tools then reassembled to form an answer.

sdenton4about 1 year ago

"The models, in other words, do not well generalise to understand the relationships between people."Curiously, the need to do this well - handling the quadratic complexity of a shifting set of human relationships, grudges, and alliances - is thought to be one of the things that led is to higher levels of intelligence.

评论 #40185765 未加载

评论 #40185803 未加载

usgroupabout 1 year ago

It’s an auto regressive model so it can’t do anything that requires planning tokens.It can’t do anything which implies a large or infinite token space (eg video understanding).It’s also limited to a reasonable response length since token selection is probabilistic at each recursion. The longer you make it the more likely it is to veer off course.

Daubabout 1 year ago

>What can LLMs never do?Produce convincingly bad digital paintings.I teach digital painting. Some of the students have incorporated AI into their working process, which I support. Others have tried to cheat by simply copying AI generated output. Such cases are super-easy to spot: they carry the visual signature of AI art (which are mostly scrappings from artstation). This visual signature seems impossible to override. If only there was a way that AI could produce digital images bad enough to pass as genuine student output.

评论 #40185549 未加载

评论 #40185565 未加载

gwervcabout 1 year ago

> But then I started asking myself how can we figure out the limits of its ability to reasonThird paragraph. The entire article is based on the premise LLMs are supposed to reason, which is wrong. They don't, they're tools to generate text.

评论 #40179780 未加载

GaggiXabout 1 year ago

>Another assumption is that it’s because of tokenisation issues. But that can’t be true either.It's definitely a tokenizer issue, if GPT-4 was trained on singular characters I'm pretty sure it would be able to play Wordle much better. GPT-4 as they are trained today have quite lossy knowledge about the characters inside a specific token, probably a fix would be to embed the knowledge inside the embeddings.

Eridrusabout 1 year ago

Starting with the reversal curse is weird since there is a simple workaround to this, which is to identify entity names to keep them in their proper order, and then train on the reverse of the pretraining corpus: <a href="https://arxiv.org/abs/2403.13799v1" rel="nofollow">https://arxiv.org/abs/2403.13799v1</a>You can argue about how this doesn't really say anything surprising since the reversal of "A is B" is literally "B is A", but it's weird to expect elegant solutions to all problems on all fronts all at once, and we do have an incredibly simple data generation process here.

roenxiabout 1 year ago

It is interesting that all the examples I noticed in this article have a geometric aspect (even wordle - I model it as a grid with geometric rules when playing it). I think that the "first grader" comment is actually somewhat illuminating - it takes several years of learning how to navigate in a spatial world before this stuff becomes trivially easy.The underlying point this article might be that LLMs don't understand the non-textual aspects of a grid. Which is a fair point, they model language, not space. I wouldn't expect text corpuses to explain space either, since possibly literally everyone who can read and write already knows a lot about spatial layouts.

constantcryingabout 1 year ago

Again and again this article claims that surprisingly a LLM fails at a certain problem, when it appears to be easy. Each time it seems pretty obvious why that is the case though.LLMs rely on the statistical dependencies between words or parts of words. That means any question you ask, which is hard to determine from that statistical dependency is extremely hard for an AI. E.g. ChatGPT fails at determining the length of words made up of random characters. It will fail at even performing the simplest of rules because encoding the rules in the statistical dependencies is extremely hard.

Der_Einzigeabout 1 year ago

Why did the author have to claim that it's not tokenization issues?This issue, or at least similar ones, absolutely is due to tokenization issues.Karpathy is right that nearly every modern problem with LLMs is due to tokenization, but if you don't believe him, maybe see this work by gwern: <a href="https://gwern.net/gpt-3#bpes" rel="nofollow">https://gwern.net/gpt-3#bpes</a> or this work by yours truly: <a href="https://aclanthology.org/2022.cai-1.2/" rel="nofollow">https://aclanthology.org/2022.cai-1.2/</a>

weitendorfabout 1 year ago

So many of these examples are simply forgetting that LLMs experience the world through a 1-dimensional stream of tokens, while we experience those same tokens in 2 dimensions.Try this: represent all those ASCII representations of games with the letter Q replacing the newline, to properly convert the encoding into a representation approximating what LLMs "see" (not a table, but a stream interspersed with Qs at a regular interval). Pretty hard right?> LLMs cannot reset their own contextIf you have a model hooked up to something agentic, I don't see why it couldn't perform context manipulation on itself or even selective realtime finetuning. Think you'll need info for the long haul, kick off some finetuning. Think you'd rather have one page of documentation in context than other, swap them out in one iteration. When you call LLMs over APIs you usually provide the entire context with each invocation...> DevinIt's not that it's massively smarter or agentic, just that it has the opportunity to correct its mistakes rather than committing to the first thing to come out of it (and is being handheld by a vastly more knowledgable SWE in its demos). You see cherrypicked examples (I also work on GenAI-for-coding) - just like a tragically incompetent employee could waste literal years on a project diligently plugging away at some task, so too can agentic models go off on a wild goose chase that accomplishes nothing besides making Nvidia more money. Just because something is highly persistent doesn't mean it will "converge" on a correct outcome.

bboygravityabout 1 year ago

LLMs can never experiment with/in the real world to find answers to open questions.That's the summary of "The book of Why" in one sentence as I understand it.

kromemabout 1 year ago

LLMs can't is such an anti-pattern at this point I'm surprised that anyone still dares to stake it. The piece even has an example of a $10k bet around a can't being proven false in under a day, but somehow doesn't think maybe their own can't examples are on similarly thin ice?In particular, the line about "what models can't do tells us what they don't know" is infuriating.No, that's not the case at all. At least in a number of instances, what they can't do is because of what they do know.As an example, one of thecan'ts I got from HN a year ago for GPT-4 was a variation of a classic logic puzzle. And indeed, the model can't solve it - nor can most major models since.But it's not because the model can't solve the logic - it's because the token similarity to the standard form biases the output towards the standard solution. A hack as simple as changing the nouns to emojis can allow the model to get the correct answer and work through the logic successfully every attempt because it breaks that similarity bias.People are way too confident around a topic where what's 'known' is more mercurial than maybe any field since 1930s particle physics.I'd strongly recommend deleting 'never' or 'can't' from one's vocabularies on the subject unless one enjoys ending up with egg on their faces.

评论 #40192383 未加载

评论 #40184364 未加载

评论 #40184424 未加载

puttycatabout 1 year ago

Simple addition, among other things:<a href="https://github.com/0xnurl/gpts-cant-count">https://github.com/0xnurl/gpts-cant-count</a>

评论 #40179660 未加载

reqoabout 1 year ago

> This ‘goal drift’ means that agents, or tasks done in a sequence with iteration, get less reliable. It ‘forgets’ where to focus, because its attention is not selective nor dynamic.I don't know if I agree with this. The attention module is specifically designed to be selective and dynamic, otherwise it would not be much different than a word embedding (look up "soft" weights vs "hard" weights [1]). I think deep learning should not be confused with deep RL. LLMs are autoregressive models which means that they are trained to predict the next token and that is all they do. The next token is not necessarily the most reasonable (this is why datasets are super important for better performance). Deep RL models on the other hand, seem to be excellent at agency and decision making (although in restricted environment), because they are trained to do so.[1] <a href="https://en.wikipedia.org/wiki/Attention_(machine_learning)" rel="nofollow">https://en.wikipedia.org/wiki/Attention_(machine_learning)</a>

评论 #40179939 未加载

joshspankitabout 1 year ago

As I was reading, this voice got louder and louder:Would LLMs cross this threshold if we were able to train them only on works that are “objectively good”? if someone has better language than this, please enlighten me)That is to say: coherent, empathetic, transparent, free from bias, substantiated, free from “fluff”.For example: For science one cannot simply train from all works published in scientific journals because of the papers that have been written irrespective of facts, or had the data changed, or have been written with specific agendas. In most cases even the experts have a hard time weeding out all the papers that are not “objectively good”. How could an LLM hope to make the determination during training?

评论 #40188438 未加载

whiplash451about 1 year ago

This part of the article summarizes it all fairly well: "It can answer almost any question that can be answered in one intuitive pass. And given sufficient training data and enough iterations, it can work up to a facsimile of reasoned intelligence."

smusamashahabout 1 year ago

Do we have an open/shared list of problems that LLMs can't solve?People have mentioned some other problems apart from those in the article. Someone should compile these and put them up if they haven't been already

gerdesjabout 1 year ago

Mornington Crescent. It will always win and hence lose and more importantly have no idea why.Oh let's be cerebral about this stuff and ignore silly British nonsense. LLMs are a classic example of garbage in, garbage out, with a shonky curve fit veneer of science.A next token guesser with a rather varied input quality is going to go off on one rather often. Given that we all have a different idea of truth adds to the fun.I take care that my monocle doesn't snag in my lathe. Do be careful with your nob when tickling your LLM inappropriately.

评论 #40184779 未加载

_heimdallabout 1 year ago

My biggest concern with LLMs in programming, a complete loss of context. Unless the model is regularly trained on the latest codebase, code will always be generated in isolation. No real architectural decisions made with regards to reuse or testability, and no consideration for how the code will be used in 6 months or why the existing code is the way it is.To anyone using LLMs for meaningful code, I wish you luck maintaining the code long term and hope you really do enjoy doing code reviews.

eqmviiabout 1 year ago

I just asked MetaAI to help me with Wordle and it understood and gave me a sane answer, so…Edit: ah, I spoke too soon. My first question was too "easy" but I asked a few more, and sure enough... it can understand what I'm asking and it can write an answer that's well formed, but it's fundamentally not understanding the rules of the game or giving me valid guesses. Cute!

评论 #40184556 未加载

ChicagoDaveabout 1 year ago

I’ve been trying to get all the LLMs to do the same thing with the same lack of success.I keep thinking there could be a way to iteratively train an LLM with declarative prompts, but as the article points out, it’s the chicken and egg problem. The LLM can’t provide a response unless it already knows the answer.However, I believe this barrier will eventually be overcome. Just not anytime soon.

jerpintabout 1 year ago

I had found that GPT4 couldn’t play wordle about a year ago [1]. At the time, I thought it must be because it wasn’t in the training data but now it seems to point to something larger.I might just get nerd sniped trying to teach it GoL now…[1] <a href="https://www.jerpint.io/blog/gpt-wordle/" rel="nofollow">https://www.jerpint.io/blog/gpt-wordle/</a>

mathstufabout 1 year ago

Things I've seen stump the ones I've played with so far (admittedly, not a lot): playing/generating "MadGab" puzzles and ASCII art rendering/interpretation. I've also asked ChatGPT3.5 to phonetically transliterate from English to other orthographies using the typical sounds of said orthography and it was…OK at it.

评论 #40184895 未加载

tacocatacoabout 1 year ago

Find a way to get humans to love each other.

srikuabout 1 year ago

A simpler question that seems to stump GPT4, llama3 (8b and 70b) so far - <a href="https://twitter.com/srikumarks/status/1784214593146868071" rel="nofollow">https://twitter.com/srikumarks/status/1784214593146868071</a>

评论 #40185815 未加载

abc_lisperabout 1 year ago

It works for me. What gives?<a href="https://chat.openai.com/share/fc3cbc58-259f-4725-b9fd-df42dffaaa37" rel="nofollow">https://chat.openai.com/share/fc3cbc58-259f-4725-b9fd-df42df...</a>

enraged_camelabout 1 year ago

Yesterday I asked ChatGPT 4 to write a paragraph with exactly five unique palindromes, and for some reason it really, really struggled. First it wrote a paragraph with four palindromes, then it rewrote it but some palindromes were repeated with a total of seven, etc.

trompabout 1 year ago

Train their successor.Once they absorb theorem provers, they will be able to do lots of math provably correctly. That does mean they should be unable to state "I have proved theorem A in Theory T with proof merkle root R" unless they actually did just that.

thrdbndndnabout 1 year ago

> LLMs are hard to, as I've written multiple times, and their ability to reason is difficult to separate from what they're trained on.Can someone explain this sentence to me? It looks broken (hard to what?).

rkwasnyabout 1 year ago

I just used LLaMA-3-70B to play today's Wordle and it solved it in 6 tries.

imtringuedabout 1 year ago

I was onboard with the article up until the middle. After the conclusion where the author simply gives up I felt like it dragged on way too much.His attempts at training on Conway's game of life are kind of pathetic. The problem isn't a lack of training data and neither is it's "distribution". The fallacy lies in the fact that the dataset itself doesn't contain reasoning in the first place. For example, GitHub CoPilot has fill in the middle capability, while ChatGPT by default does not.Now here is the shocker about the fill in the middle capability. How does the LLM learn to do it? It does it in an incredibly primitive way. Instead of building a model that can edit its own context, it receives a marker in the context that tells it about the cursor position and then it is finetuned on the expected response.This means that an LLM could be trained to insert its token at any position in the context or even replace existing tokens, but here is the problem: Once the model has modified its own context, it has exited the training dataset. How do you evaluate the intermediate steps, which can consist of genuinely novel thoughts which are required, but not present in the data? Adding two numbers requires intermediate states which the model may even know how to produce, but it can never be rewarded to utilize them, if they aren't in the training data, because for the LLM, the only goal is to conform to the dataset.If you wanted to avoid this, you would need to define a metric which allows the model to be rewarded for a success even if that success took a detour. Currently, training is inherently built around the idea of zero shot responses.

keskivalabout 1 year ago

All points described are simple artifacts of tokenization.

评论 #40183992 未加载

ein0pabout 1 year ago

“Never” is a long time. I wouldn’t bet on that. It pays to remember that even the oldest SOTA model, GPT-4 is only just over a year old.

andsoitisabout 1 year ago

I tried to get ChatGPT and Gemini to do ASCII art and both fail abysmally. Doesn’t mean they could never do it, but it really surprised me.

评论 #40184922 未加载

dcchambersabout 1 year ago

Guarantee an output will be consistent every time.

评论 #40184122 未加载

评论 #40179712 未加载

评论 #40179725 未加载

jdthediscipleabout 1 year ago

GPTs are trained on natural language.Why should it surprise anyone that it would fail at cellular automata?

评论 #40187354 未加载

tudorwabout 1 year ago

Detect when humans are lying.

thomabout 1 year ago

I have no beef with the actual content or conclusions, but it’s a shame the article is framed the way it is, because I don’t think we can rigorously define the goalposts for what qualifies as a future LLM. It could just as easily have been titled “Exciting avenues of research for future LLMs!” but we’re all so jaded despite the frankly astonishing progress of recent years.

评论 #40190664 未加载

JSDevOpsabout 1 year ago

Plumbing. It’ll be regulated down it’s no better then a text processor. Remember the US innovates. The UK stagnates. EU regulates and China Replicates.

BenFranklin100about 1 year ago

Reason.LLMs mimic human language which is separate from reasoning. Tech bros are remarkably ignorant of the field of linguistics and don’t appreciate this distinction. They thus mistake the output of LLMs for reason.

hahajkabout 1 year ago

Oh, I thought this was going to be a semi-rigorous discussion on computability. It's actually just another "GPT-4 still fails at these prompts" essay.

reissbakerabout 1 year ago

I love when people propose concrete claims like this: if they're wrong, they're disprovable. If they're right, you get unique and interesting insights from the attempts to disprove them.I suspect these are all tokenization artifacts, but I'll probably take some time to try out the Conway's Game of Life problem by finetuning a model. A few issues I've noticed from the problems proposed in the article:1. Wordle. This one TBH is a clear tokenization problem, not a proof of the reasoning capabilities of LLMs or lack thereof. LLMs are trained on multi-character tokens, and consume words as multi-character tokens: they don't "see" characters. Wordle is primarily a game based around splitting words into discrete characters, and LLMs can't see the characters they're supposed to operate on if you give them words — and depending on how you structure your answers, they also might not be able to see your answers! By breaking the words and answers into character-by-character sequences with spaces in between the characters (forcing the tokenizer into breaking each character into a separate token visible to the LLM), I successfully got GPT-4 to guess the word "BLAME" on my first attempt at playing Wordle with it: <a href="https://chat.openai.com/share/cc1569c4-44c3-4024-a0c2-eeb4988962ef" rel="nofollow">https://chat.openai.com/share/cc1569c4-44c3-4024-a0c2-eeb498...</a>2. Conway's Game of Life. Once again, the input sequences are given as a single, long string with no spacing, which will probably result in it being tokenized and thus partially invisible to the LLM. This one seems somewhat annoying to prompt, so I haven't tried yet, but I suspect a combination of better prompting and maybe finetuning would result in the LLM learning to solve the problem.Similarly, complaints about finetuned models not being able to generalize well on input sequences of lengths longer than they were trained on are most likely token-related. Each token an LLM sees (both during training and inference) is encoded alongside its absolute position in the input sequence; while you as a human being see 1 and 1 1 and 1 1 1 as repeated series of 1s, an LLM would see those characters as being at least somewhat distinct. Given a synthetic dataset of a specific size, it can start to generalize over problems within the space that it sees, but if you give it new data outside of that context space, the new data will not be visible to the LLM as being necessarily related to what it was trained on. There are architectural tricks to get around it (e.g. RoPE scaling), but in general I wouldn't make generalizations about what models can or can't "reason" about based on using context window sizes the model didn't see during training: that's more about token-related blindspots and not about whether the model can be intelligent — at least, intelligent within the context window it's trained on.One thing the author repeats several times throughout the article is that the mistakes LLMs make are far more instructive than their successes. However, I think in general this is not the case: if they can succeed sometimes, anyone who's spent much time finetuning knows that you can typically train them to succeed more reliably. And the mistakes here don't necessarily seem instructive at all: they're tokenization artifacts, and rewriting the problem to work around specific types of blindness (at least in Wordle's case) seems to allow the LLMs to succeed.FWIW, the author brings up Victor Taelin's famous A::B problem; I believe I was the first to solve it [1] (albeit via finetuning, so ineligible for the $10k prize; although I did it before the prize was announced, just for the pleasure of playing around with an interesting problem). While I think that it's generally a useful insight to think of training as giving more intuition than intelligence, I do think the A::B problem getting solved eventually even by pure prompting shows that there's actually intelligence in there, too — it's not just intuition, or stochastic parroting of information from its training set. However, tokenization issues can easily get in the way of these kinds of problems if you're not aware of them (even in the winning Clause 3 Opus prompt slightly rephrased the problem to get it to work with the tokenizer), so the models actually can appear dumber than they really are.1. <a href="https://twitter.com/reissbaker/status/1776531331562033453" rel="nofollow">https://twitter.com/reissbaker/status/1776531331562033453</a>

makzabout 1 year ago

The dishes

amatechaabout 1 year ago

Be creative.

allmakerabout 1 year ago

We can only talk about the current ones, not the future ones, yes.

FailMoreabout 1 year ago

Saving for later, thanks!

anon-3988about 1 year ago

This is a chicken and egg problem, of course we only value and optimize for what we can do and deem anything that we can't do as unnecessary. There are things that we human simply cannot think of therefore it must not be important or does not exist.We cannot think of anything beyond 4 dimension, so therefore there must be nothing beyond that or that things that exist in those dimension doesn't matter that much. Or more precisely, we simply cannot appreciate those things.If we are simply trying to mimic human intelligence...well, you are going to end up with a human brain.Suppose we have a concept X that humans simply cannot comprehend, appreciate or solve, well, why bother create an intelligence to solve that?From this hypothesis, I personally think that any intelligence that we create will simply be an augmentation of what human desire. That is, there will always be a human part in the cog because human is the only thing can appreciate what is being created so any and all output must cater to the human involved. This will inevitably happen because we want whatever it is the human brain is doing, without doing whatever it is that the human brain is doing.That is until we unleash a different intelligence system with agency.

61 comments

cs702about 1 year ago

评论 #40183574 未加载

评论 #40185673 未加载

评论 #40180074 未加载

评论 #40183525 未加载

评论 #40183740 未加载

评论 #40184262 未加载

评论 #40185407 未加载

评论 #40184536 未加载

评论 #40183508 未加载

yositoabout 1 year ago

评论 #40186120 未加载

评论 #40187029 未加载

评论 #40206483 未加载

shawntanabout 1 year ago

评论 #40181032 未加载

评论 #40180988 未加载

评论 #40180442 未加载

评论 #40193633 未加载

评论 #40181481 未加载

naaskingabout 1 year ago

评论 #40189429 未加载

评论 #40189270 未加载

Tooabout 1 year ago

评论 #40190741 未加载

Xenoamorphousabout 1 year ago

评论 #40184467 未加载

评论 #40179755 未加载

评论 #40184319 未加载

评论 #40187277 未加载

wave_1about 1 year ago

评论 #40183419 未加载

评论 #40183299 未加载

评论 #40193653 未加载

评论 #40183388 未加载

评论 #40183326 未加载

donatjabout 1 year ago

评论 #40180171 未加载

评论 #40184510 未加载

评论 #40180399 未加载

评论 #40184341 未加载

评论 #40181176 未加载

评论 #40186398 未加载

JeremyHerrmanabout 1 year ago

评论 #40184117 未加载

评论 #40188211 未加载

amayneabout 1 year ago

评论 #40190723 未加载

oldandtiredabout 1 year ago

whiplash451about 1 year ago

评论 #40184476 未加载

评论 #40183380 未加载

3abitonabout 1 year ago

评论 #40179875 未加载

评论 #40179914 未加载

评论 #40179739 未加载

评论 #40179716 未加载

评论 #40179822 未加载

评论 #40179718 未加载

评论 #40179733 未加载

CuriouslyCabout 1 year ago

sdenton4about 1 year ago

评论 #40185765 未加载

评论 #40185803 未加载

usgroupabout 1 year ago

Daubabout 1 year ago

评论 #40185549 未加载

评论 #40185565 未加载

gwervcabout 1 year ago

评论 #40179780 未加载

GaggiXabout 1 year ago

Eridrusabout 1 year ago

roenxiabout 1 year ago

constantcryingabout 1 year ago

Der_Einzigeabout 1 year ago

weitendorfabout 1 year ago

bboygravityabout 1 year ago

LLMs can never experiment with/in the real world to find answers to open questions.That's the summary of "The book of Why" in one sentence as I understand it.

kromemabout 1 year ago

评论 #40192383 未加载

评论 #40184364 未加载

评论 #40184424 未加载

puttycatabout 1 year ago

Simple addition, among other things:<a href="https://github.com/0xnurl/gpts-cant-count">https://github.com/0xnurl/gpts-cant-count</a>

评论 #40179660 未加载

reqoabout 1 year ago

评论 #40179939 未加载

joshspankitabout 1 year ago

评论 #40188438 未加载

whiplash451about 1 year ago

smusamashahabout 1 year ago

gerdesjabout 1 year ago

评论 #40184779 未加载

_heimdallabout 1 year ago

eqmviiabout 1 year ago

评论 #40184556 未加载

ChicagoDaveabout 1 year ago

jerpintabout 1 year ago

mathstufabout 1 year ago

评论 #40184895 未加载

tacocatacoabout 1 year ago

Find a way to get humans to love each other.

srikuabout 1 year ago

评论 #40185815 未加载

abc_lisperabout 1 year ago

It works for me. What gives?<a href="https://chat.openai.com/share/fc3cbc58-259f-4725-b9fd-df42dffaaa37" rel="nofollow">https://chat.openai.com/share/fc3cbc58-259f-4725-b9fd-df42df...</a>

enraged_camelabout 1 year ago

trompabout 1 year ago

thrdbndndnabout 1 year ago

rkwasnyabout 1 year ago

I just used LLaMA-3-70B to play today's Wordle and it solved it in 6 tries.

imtringuedabout 1 year ago

keskivalabout 1 year ago

All points described are simple artifacts of tokenization.

评论 #40183992 未加载

ein0pabout 1 year ago

“Never” is a long time. I wouldn’t bet on that. It pays to remember that even the oldest SOTA model, GPT-4 is only just over a year old.

andsoitisabout 1 year ago

I tried to get ChatGPT and Gemini to do ASCII art and both fail abysmally. Doesn’t mean they could never do it, but it really surprised me.

评论 #40184922 未加载

dcchambersabout 1 year ago

Guarantee an output will be consistent every time.

评论 #40184122 未加载

评论 #40179712 未加载

评论 #40179725 未加载

jdthediscipleabout 1 year ago

GPTs are trained on natural language.Why should it surprise anyone that it would fail at cellular automata?

评论 #40187354 未加载

tudorwabout 1 year ago

Detect when humans are lying.

thomabout 1 year ago

评论 #40190664 未加载

JSDevOpsabout 1 year ago

Plumbing. It’ll be regulated down it’s no better then a text processor. Remember the US innovates. The UK stagnates. EU regulates and China Replicates.

BenFranklin100about 1 year ago

hahajkabout 1 year ago

Oh, I thought this was going to be a semi-rigorous discussion on computability. It's actually just another "GPT-4 still fails at these prompts" essay.

reissbakerabout 1 year ago

makzabout 1 year ago

The dishes

amatechaabout 1 year ago

Be creative.

allmakerabout 1 year ago

We can only talk about the current ones, not the future ones, yes.

FailMoreabout 1 year ago

Saving for later, thanks!

anon-3988about 1 year ago