Do Machine Learning Models Memorize or Generalize?

454 pointsby 1wheelalmost 2 years ago

28 comments

Sometimes I think the reason human memory in some sense is so amazing, is what we lack in storage capacity that machines have, we makeup for in our ability to create patterns that compress the amount of information stored dramatically, and then it is like we compress those patterns together with other patterns and are able to extract things from it. Like it is an incredibly lossy compression, but it gets the job done.

评论 #37078770 未加载

评论 #37079751 未加载

评论 #37081898 未加载

评论 #37082989 未加载

评论 #37079885 未加载

评论 #37078711 未加载

评论 #37080746 未加载

评论 #37082819 未加载

greenflagalmost 2 years ago

It seems the take home is weight decay induces sparsity which helps learn the "true" representation rather than an overfit one. It's interesting the human brain has a comparable mechanism prevalent in development [1]. I would love to know from someone in the field if this was the inspiration for weight decay (or presumably just the more equivalent nn pruning [2]).[1] <a href="https://en.wikipedia.org/wiki/Synaptic_pruning" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Synaptic_pruning</a> [2] <a href="https://en.wikipedia.org/wiki/Pruning_(artificial_neural_network)" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Pruning_(artificial_neural_net...</a>

评论 #37080979 未加载

评论 #37076979 未加载

评论 #37076981 未加载

评论 #37077444 未加载

评论 #37078263 未加载

gorjusborgalmost 2 years ago

Grr, the AI folks are ruining the term 'grok'.It means roughly 'to understand completely, fully'.To use the same term to describe generalization... just shows you didn't grok grokking.

评论 #37077795 未加载

评论 #37079513 未加载

评论 #37078411 未加载

评论 #37082762 未加载

评论 #37079713 未加载

评论 #37078262 未加载

评论 #37078066 未加载

评论 #37085299 未加载

评论 #37080716 未加载

评论 #37083259 未加载

评论 #37081447 未加载

评论 #37078464 未加载

jimwhite42almost 2 years ago

I'm not sure if I'm remembering it right, but I think it was on a Raphaël Millière interview on Mindscape, where Raphaël said something along the lines of when there are many dimensions in a machine learning model, the distinction between interpolation and extrapolation is not clear like it is in our usual areas of reasoning. I can't work out if this could be something similar to what the article is talking about.

_ache_almost 2 years ago

Does anyone know how that charts are created ? I bet that it's half generated by some sort of library and them manually improved but the generated animated SVG are beautiful.

评论 #37077358 未加载

ComputerGurualmost 2 years ago

PSA: if you’re interested in the details of this topic, it’s probably best to view TFA on a computer as there is data in the visualizations that you can’t explore on mobile.

SimplyUnknownalmost 2 years ago

First of all, great blog post with great examples. Reminds me of distill.pub used to be.Second, the article correctly states that typically L2 weight decay is used, leading to a lot of weights with small magnitudes. For models that generalize better, would it then be better to always use L1 weight decay to promote sparsity in combination with longer training?I wonder whether deep learning models that only use sparse fourier features rather than dense linear layers would work better...

评论 #37079280 未加载

评论 #37079224 未加载

taericalmost 2 years ago

I'm curious how representative the target function is? I get that it is common for you to want a model to learn the important pieces of an input, but a string of bits, and only caring about the first three, feels particularly contrived. Literally a truth table on relevant parameters of size 8? And trained with 4.8 million samples? Or am I misunderstanding something there? (I fully expect I'm misunderstanding something.)

评论 #37078087 未加载

superkuhalmost 2 years ago

There were no auto-discovery RSS/Atom feeds in the HTML, no links to the RSS feed anywhere, but by guessing at possible feed names and locations I was able to find the "Explorables" RSS feed at: <a href="https://pair.withgoogle.com/explorables/rss.xml" rel="nofollow noreferrer">https://pair.withgoogle.com/explorables/rss.xml</a>

lachlan_grayalmost 2 years ago

It looks like grid cells!<a href="https://en.wikipedia.org/wiki/Grid_cell" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Grid_cell</a>If you plot a head map of a neuron in the hidden layer on a 2D chart where one axis is $a$ and the other is $b$, I think you might get a triangular lattice. If it's doing what I think it is, then looking at another hidden neuron would give a different lattice with another orientation + scale.Also you could make a base 67 adding machine by chaining these together.I also can't help the gut feeling that the relationship between W_in-proj's neurons compared to the relationship between W_out-proj's neurons looks like the same mapping as the one between the semitone circle and the circle of fifths<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Pitch_class_space_star.svg/220px-Pitch_class_space_star.svg.png" rel="nofollow noreferrer">https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Pi...</a>

flyer_goalmost 2 years ago

I don't think I have seen an answer here that actually challenges this question - from my experience, I have yet to see a neural network actually learn representations outside the range in which it was trained. Some papers have tried to use things like sinusoidal activation functions that can force a neural network to fit a repeating function, but on its own I would call it pure coincidence.On generalization - its still memorization. I think there has been some proof that chatgpt does 'try' to perform some higher level thinking but still has problems due to the dictionary type lookup table it uses. The higher level thinking or agi that people are excited about is a form of generalization that is so impressive we don't really think of it as memorization. But I actually question if our wantingness to generate original thought isn't as actually separate from what we currently are seeing.

评论 #37080759 未加载

评论 #37080569 未加载

mjburgessalmost 2 years ago

Statistical learning can typically be phrased in terms of k nearest neighboursIn the case of NNs we have a "modal knn" (memorising) going to a "mean knn" ('generalising') under the right sort of training.I'd call both of these memorising, but the latter is a kind of weighted recall.Generalisation as a property of statistical models (ie., models of conditional freqs) is not the same property as generalisation in the case of scientific models.In the latter a scientific model is general because it models causally necessary effects from causes -- so, necessarily if X then Y.Whereas generalisation in associative stats is just about whether you're drawing data from the empirical freq. distribution or whether you've modelled first. In all automated stats the only diff between the "model" and "the data" is some sort of weighted averaging operation.So in automated stats (ie., ML,AI) it's really just whether the model uses a mean.

评论 #37079103 未加载

评论 #37079173 未加载

评论 #37078736 未加载

评论 #37077829 未加载

esafakalmost 2 years ago

I haven't read the latest literature but my understanding is that "grokking" is the phase transition that occurs during the coalescing of islands of understanding (increasingly abstract features) that eventually form a pathway to generalization. And that this is something associated with over-parameterized models, which have the potential to learn multiple paths (explanations).<a href="https://en.wikipedia.org/wiki/Percolation_theory" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Percolation_theory</a>A relevant, recent paper I found from a quick search: The semantic landscape paradigm for neural networks (<a href="https://arxiv.org/abs/2307.09550" rel="nofollow noreferrer">https://arxiv.org/abs/2307.09550</a>)

ajucalmost 2 years ago

I was trying to make an AI for my 2d sidescrolling game with asteroid-like steering learn from recorded player input + surroundings.It generalized splendidly - it's conclusion was that you always need to press "forward" and do nothing else, no matter what happens :)

huijzeralmost 2 years ago

A bit of both, but it does certainly generalize. Just look into the sentiment neuron from OpenAI in 2017 or come up with an unique question to ChatGPT.

davidguettaalmost 2 years ago

hierarchize would be a better term than generalize

评论 #37076714 未加载

评论 #37077002 未加载

评论 #37078051 未加载

westurneralmost 2 years ago

If you omit the training data points where the baseball hits the ground, what will a machine learning model predict?You can train a classical ML model on the known orbits of the planets in the past, but it can presumably never predict orbits given unseen n-body gravity events like another dense mass moving through the solar system because of classical insufficiency to model quantum problems, for example.Church-Turing-Deutsch doesn't say there could not exist a Classical / Quantum correspondence; but a classical model on a classical computer cannot be sufficient for quantum-hard problems. (e.g. Quantum Discord says that there are entanglement and non-entanglement nonlocal relations in the data.)Regardless of whether they sufficiently generalize, [LLMs, ML Models, and AutoMLs] don't yet Critically Think and it's dangerous to take action without critical thought.Critical Thinking; Logic, Rationality: <a href="https://en.wikipedia.org/wiki/Critical_thinking#Logic_and_rationality" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Critical_thinking#Logic_and_ra...</a>

tehjokeralmost 2 years ago

Well they memorize points and lines (or tanh) between different parts of the space right? So it depends on whether a useful generalization can be extracted from the line estimation and how dense the points on the landscape are no?

djha-skinalmost 2 years ago

How is this even a shock.Anyone who so much as taken a class on this knows that even the simplest of perceptron networks, decision trees, or any form of machine learning model generalizes. That's why we use them. If they don't, it's called overfit[1], where the model is so accurate on the training data that its inferential ability on new data suffers.I know that the article might be talking about a higher form of generalization with LLMs or whatever, but I don't see why the same principle of "don't overfit the data" wouldn't apply to that situation.No, really: what part of their base argument is novel?1: <a href="https://en.wikipedia.org/wiki/Overfitting" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Overfitting</a>

评论 #37079937 未加载

评论 #37082939 未加载

评论 #37080474 未加载

MagicMoonlightalmost 2 years ago

Memorise because there is no decision component. It attempts to just brute force a pattern rather than thinking through the information and making a conclusion.

blueyesalmost 2 years ago

If your data set is too small, they memorize. If you train them well on a large dataset, they learn to generalize.

评论 #37078818 未加载

wwarneralmost 2 years ago

This is such a good explainer

lsh123almost 2 years ago

Current ML models neither memorize or generalize, but instead approximate.

tipsytoadalmost 2 years ago

Seriously, are they only talking about weight decay? Why so complicated?

agumonkeyalmost 2 years ago

They ponderize.

lewhooalmost 2 years ago

So, the TLDR could be: they memorize at first and then generalize ?

评论 #37084737 未加载

aapplebyalmost 2 years ago

They digest.

xaellisonalmost 2 years ago

what's the TLDR: memorize, or generalize?