Can LLMs learn from a single example?

440 点作者 jdkee超过 1 年前

23 条评论

jph00超过 1 年前

Thank you for posting this to HN! :DI'm one of the authors of this post -- Johno & I found it really interesting looking into this curious issue of rapid memorization from LLMs. I've been working with neural nets for 30 years, and fine-tuning language models since 2017, and this behavior is most surprising to me! Other folks have seen it in LLMs too, although I haven't seen a analysis of this kind before (although we might have missed something).Let me know if you have any questions or thoughts.

评论 #37400120 未加载

评论 #37400461 未加载

评论 #37400548 未加载

评论 #37400549 未加载

评论 #37400489 未加载

评论 #37401886 未加载

评论 #37400311 未加载

评论 #37400953 未加载

评论 #37403698 未加载

评论 #37405739 未加载

Nevermark超过 1 年前

Do people really use the phrase “over confident” in this way? It is very misleading.What is happening is called “over fitting”.Think of data as dots. A model that generalizes well will create as simple of a function as possible that fits the training data points pretty well.But keep training and parameters will often get very large, creating huge up and down swings in the function curve, far outside the actual data values, in order to pass through the training data points exactly.So it’s technically a better fit to the training data, but it is now a crazy function, often producing extreme outputs on new data. Practically a worst case lack of generalization.Thus, “over fitting”.And “over fitting” isn’t the same as “memorization”. Large models can memorize small datasets without over fitting. They have so many parameters, it takes few changes to fit the training data. At which time, learning stops at an otherwise random function, and generalization is never achieved.That case is called “underdetermined”.There are models that produce both outputs and confidences (essentially predict their own error standard deviation per output, based on the input).So “over confident” can mean a model that predicted high confidence (low error deviation) inaccurately.

评论 #37406048 未加载

评论 #37402889 未加载

评论 #37404873 未加载

评论 #37403251 未加载

bjornsing超过 1 年前

I’m no expert on LLMs, but I don’t find this super surprising from a general ML point of view:You have a generative model with billions of parameters that already assigns some probability mass to your (fine-tuning) samples. Now you compute a gradient that increases that probability mass, and take a step in the gradient’s direction. Essentially the OP is surprised that this significantly increases the probability mass of the samples under the model.I’m not very surprised. The generative model is enormously over-parameterized and already assigns some probability mass to the (fine-tuning) samples. It would be surprising to me if there wasn’t a direction in this billion-dimensional parameter space that rapidly increases the probability of the relatively few samples.

评论 #37406837 未加载

whimsicalism超过 1 年前

Was this not sort of the clear implication of the fact that most LLMs are currently only being trained with one epoch?ie. if they are only being trained from one epoch, there is clear overfitting concerns just by doing even a second pass in the data.It does seem somewhat contrary to the findings of this paper [0] that found that old data was as good as new for at least 4 epochs.[0]: <a href="https://arxiv.org/abs/2305.16264" rel="nofollow noreferrer">https://arxiv.org/abs/2305.16264</a>

评论 #37400985 未加载

评论 #37400867 未加载

评论 #37406881 未加载

评论 #37400869 未加载

calrain超过 1 年前

Probably unrelated, but I tried to get ChatGPT to write me some code to programmatically control the details of a column filter in an Excel spreadsheet in PowerShell.Nothing it tried worked, it got close, but it didn't work.Finally I found some C# code that fixed the problem, and I pasted that code into ChatGPT, asked it to read it, and then fix the problem in PowerShell.It said it understood the solution, updated the script, and it worked perfectly.For some reason that behavior was pretty eye opening. Providing material in the question that it wasn't trained on made it solve it.It's understandable how it did it from language training, it just felt very cool that LLM's can do that.

评论 #37405524 未加载

Buttons840超过 1 年前

Does anyone know if LLMs have been used to augment their own training data?I wonder what would happen if you trained an LLM on a little input but then had it generate a lot of synthetic input added to the training data. I think of it as "dreaming". This seems like it would just add noise, but LLMs are able to improve their output by augmenting their own context (by "thinking out loud"), maybe they can do the same with their own training data?

评论 #37400873 未加载

评论 #37401008 未加载

评论 #37400413 未加载

评论 #37411086 未加载

评论 #37400687 未加载

imjonse超过 1 年前

I found the title misleading.Isn't learning from a single example desirable, while memorizing undesirable in the context of training? The former is the goal we're aiming for in order to match how animals learn, while the latter a failure mode that happens often. The article shows a case of unexplained memorizing, not of learning, right?

fpgaminer超过 1 年前

I see similar loss curves when training ViTs (from scratch), which has always bothered me but I had bigger concerns so never delved too deep into it. The only difference is that I see the training loss go _up_ during each epoch. The cliffs between epochs are large enough that training loss goes down overall and validation loss keeps going down the whole time as well. The model gets close-ish to SoTA so I guess it's "normal".I haven't trained convnets at this scale so I'm not sure if similar behavior has been seen there, but you'd think someone would have mentioned it at some point. So perhaps these strange loss curves are a feature of Transformer based models in particular?

评论 #37401118 未加载

评论 #37400843 未加载

评论 #37422876 未加载

评论 #37401408 未加载

SubiculumCode超过 1 年前

Does this mean it is now computationally efficient to have the model learn/memorize information on the fly, say the current chat context, as part of the model weights? One shot encoding (something the hippocampus is very good at) allows us to build experiences into retrievable memories tied into semantic concepts we've previously learned..in fact it gets better the more rich our semantic conceptualization of events become from childhood into adulthood.If memorization of events in llm is accelerated because of- these deep semantic frameworks, then does this provide a path towards long context windows?

评论 #37400987 未加载

评论 #37400710 未加载

jerpint超过 1 年前

If this holds true, this would support the idea that much smaller, human curated datasets will be of much higher value than synthetic datasets generated by LLMs

评论 #37400711 未加载

评论 #37401292 未加载

评论 #37400448 未加载

评论 #37401317 未加载

评论 #37400822 未加载

评论 #37400379 未加载

Palmik超过 1 年前

If you find this interesting, checkout also "Mass Editing Memory in a Transformer" [1] and "Locating and Editing Factual Associations in GPT" [2].[1] <a href="https://memit.baulab.info/" rel="nofollow noreferrer">https://memit.baulab.info/</a> [2] <a href="https://rome.baulab.info/" rel="nofollow noreferrer">https://rome.baulab.info/</a>

deyiao超过 1 年前

I often observe similar phenomenna in CNN related reserch. which indicate that the model indeed can learn from a single example, but sadly, this requires the dataset to be randomly distributed, In real-world applications, new data does not meet this requirement.

PaulHoule超过 1 年前

I’ve observed the same phenomenon with fine-tuning LLMs and I thought it was pretty strange but so far as I could tell other people were observing the same thing but mostly not commenting on it. The conclusion I’d draw is that you’re not going to benefit greatly from adding more data when your model behaves like this.Overconfidence bugs moe because if you want to turn predictions into decisions and actions you have to be calibrated. I’ve found that some of these models that look like they are over fitting on loss are actually still improving on AUc (matters to me more than accuracy) and I can put a calibrator after the model to get the results I want.(Still, for my current problem which has noisy labels, I find embedding + classical ML performs as well and takes a fraction of the time as fine tuning and clearly shows benefit trained on more examples than FT does. If I was going to do more model engineering on this problem I would probably resort to “stacking”)

itissid超过 1 年前

Could this be an artifact of just not reshuffling the dataset and how the weight regime is? What if you reversed the dataset in the second epoch, under the memory hypothesis the training loss would not plummet if it has not learnt anything during the epoch after the first 10%. Yes?The report mentions there is no reshuffling: > We’re not re-shuffling the dataset at the start of the epoch, so those first batches of the second epoch are when the learning rate was still warming up.

tomaskafka超过 1 年前

Isn't this what people would do? I'd definitely update my knowledge after a single failed test question, if it was something I'd care about, and I discovered my previous model of reality was wrong.

评论 #37403450 未加载

评论 #37403424 未加载

klft超过 1 年前

GPT-4 (I haven't really tested other models) is surprisingly adept at "learning" from examples provided as part of the prompt. This could be due to the same underlying mechanism.

评论 #37401489 未加载

评论 #37402328 未加载

YeGoblynQueenne超过 1 年前

"Can LLMs learn from a single example"?Sure. Neural nets in general can: after they've been trained on billions of examples first.It really helps if they've previously seen the same or similar "single example". Which, let's be fair, the larger the training data, the higher the chances they have.>> This seemed, at first, quite impossible. It would imply that the model was learning to recognise inputs from just one or two examplesTo be more precise: the article is talking about fine-tuning a pre-trained LLM, so that's a-few-billion-plus-one-or-two examples.Btw, what model was that? The article doesn't say.

spit2wind超过 1 年前

What are the axis labels on the graphs?

评论 #37401841 未加载

rafaelero超过 1 年前

That's intriguing. But what I want to see is if that one example can change the whole web of knowledge previously established. So, for example, if we finetune the model with a sentence like "Scientists discovered that a type of antigen can make a host immune to HIV" will it then be able to infer that "mRNA vaccines are a valid preventive approach to AIDS since they may be able to express a type of resistance known to make hosts immune to HIV"?

评论 #37404163 未加载

justanotherjoe超过 1 年前

isn't it highly dependent on what is your one epoch of data? if there are a lot of repetitions of similar concepts in there then can you say it's learning from one example?

评论 #37401846 未加载

anoncow超过 1 年前

That is like saying can energy be created anew.

mrjin超过 1 年前

No understanding, no learning! Period.

OhNoNotAgain_99超过 1 年前

Yes it can Yesterday i gave it a help chapter in a prompt about angular 16 The knowledge cut of is perhaps nice for politics but not for programmers. After wards i could ask it about syntax problems i had in some code.Essentially it understands programming didnt know what was possible in angular16 A single example made it learn from it. Though when i asked for an example i got the exact same sample as i had given it to learn from.Perhaps end this language cut of for technical data. Its okay not wanting to get into politics (neither do i). But give it something to read (yup let it read and remember it) a simple prompt read this page by page will do, and give it some recent books, or popular coding websites, let it read python.org angular.io perhaps some modern manuals and books.It also seemed keen to learn new information, it quickly adopted it. But only in that session.