Sometimes I think the reason human memory in some sense is so amazing, is what we lack in storage capacity that machines have, we makeup for in our ability to create patterns that compress the amount of information stored dramatically, and then it is like we compress those patterns together with other patterns and are able to extract things from it. Like it is an incredibly lossy compression, but it gets the job done.
It seems the take home is weight decay induces sparsity which helps learn the "true" representation rather than an overfit one. It's interesting the human brain has a comparable mechanism prevalent in development [1]. I would love to know from someone in the field if this was the inspiration for weight decay (or presumably just the more equivalent nn pruning [2]).<p>[1] <a href="https://en.wikipedia.org/wiki/Synaptic_pruning" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Synaptic_pruning</a>
[2] <a href="https://en.wikipedia.org/wiki/Pruning_(artificial_neural_network)" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Pruning_(artificial_neural_net...</a>
Grr, the AI folks are ruining the term 'grok'.<p>It means roughly 'to understand completely, fully'.<p>To use the same term to describe generalization... just shows you didn't grok grokking.
I'm not sure if I'm remembering it right, but I think it was on a Raphaël Millière interview on Mindscape, where Raphaël said something along the lines of when there are many dimensions in a machine learning model, the distinction between interpolation and extrapolation is not clear like it is in our usual areas of reasoning. I can't work out if this could be something similar to what the article is talking about.
Does anyone know how that charts are created ?
I bet that it's half generated by some sort of library and them manually improved but the generated animated SVG are beautiful.
PSA: if you’re interested in the details of this topic, it’s probably best to view TFA on a computer as there is data in the visualizations that you can’t explore on mobile.
First of all, great blog post with great examples. Reminds me of distill.pub used to be.<p>Second, the article correctly states that typically L2 weight decay is used, leading to a lot of weights with small magnitudes. For models that generalize better, would it then be better to always use L1 weight decay to promote sparsity in combination with longer training?<p>I wonder whether deep learning models that only use sparse fourier features rather than dense linear layers would work better...
I'm curious how representative the target function is? I get that it is common for you to want a model to learn the important pieces of an input, but a string of bits, and only caring about the first three, feels particularly contrived. Literally a truth table on relevant parameters of size 8? And trained with 4.8 million samples? Or am I misunderstanding something there? (I fully expect I'm misunderstanding something.)
There were no auto-discovery RSS/Atom feeds in the HTML, no links to the RSS feed anywhere, but by guessing at possible feed names and locations I was able to find the "Explorables" RSS feed at: <a href="https://pair.withgoogle.com/explorables/rss.xml" rel="nofollow noreferrer">https://pair.withgoogle.com/explorables/rss.xml</a>
It looks like grid cells!<p><a href="https://en.wikipedia.org/wiki/Grid_cell" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Grid_cell</a><p>If you plot a head map of a neuron in the hidden layer on a 2D chart where one axis is $a$ and the other is $b$, I think you might get a triangular lattice. If it's doing what I think it is, then looking at another hidden neuron would give a different lattice with another orientation + scale.<p>Also you could make a base 67 adding machine by chaining these together.<p>I also can't help the gut feeling that the relationship between W_in-proj's neurons compared to the relationship between W_out-proj's neurons looks like the same mapping as the one between the semitone circle and the circle of fifths<p><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Pitch_class_space_star.svg/220px-Pitch_class_space_star.svg.png" rel="nofollow noreferrer">https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Pi...</a>
I don't think I have seen an answer here that actually challenges this question - from my experience, I have yet to see a neural network actually learn representations outside the range in which it was trained. Some papers have tried to use things like sinusoidal activation functions that can force a neural network to fit a repeating function, but on its own I would call it pure coincidence.<p>On generalization - its still memorization. I think there has been some proof that chatgpt does 'try' to perform some higher level thinking but still has problems due to the dictionary type lookup table it uses. The higher level thinking or agi that people are excited about is a form of generalization that is so impressive we don't really think of it as memorization. But I actually question if our wantingness to generate original thought isn't as actually separate from what we currently are seeing.
Statistical learning can typically be phrased in terms of k nearest neighbours<p>In the case of NNs we have a "modal knn" (memorising) going to a "mean knn" ('generalising') under the right sort of training.<p>I'd call both of these memorising, but the latter is a kind of weighted recall.<p>Generalisation as a property of statistical models (ie., models of conditional freqs) is not the same property as generalisation in the case of scientific models.<p>In the latter a scientific model is general because it models causally necessary effects from causes -- so, <i>necessarily</i> if X then Y.<p>Whereas generalisation in associative stats is just about whether you're drawing data from the empirical freq. distribution or whether you've modelled first. In all automated stats the only diff between the "model" and "the data" is some sort of weighted averaging operation.<p>So in automated stats (ie., ML,AI) it's really just whether the model uses a mean.
I haven't read the latest literature but my understanding is that "grokking" is the phase transition that occurs during the coalescing of islands of understanding (increasingly abstract features) that eventually form a pathway to generalization. And that this is something associated with over-parameterized models, which have the potential to learn multiple paths (explanations).<p><a href="https://en.wikipedia.org/wiki/Percolation_theory" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Percolation_theory</a><p>A relevant, recent paper I found from a quick search: <i>The semantic landscape paradigm for neural networks</i>
(<a href="https://arxiv.org/abs/2307.09550" rel="nofollow noreferrer">https://arxiv.org/abs/2307.09550</a>)
I was trying to make an AI for my 2d sidescrolling game with asteroid-like steering learn from recorded player input + surroundings.<p>It generalized splendidly - it's conclusion was that you always need to press "forward" and do nothing else, no matter what happens :)
A bit of both, but it does certainly generalize. Just look into the sentiment neuron from OpenAI in 2017 or come up with an unique question to ChatGPT.
If you omit the training data points where the baseball hits the ground, what will a machine learning model predict?<p>You can train a classical ML model on the known orbits of the planets in the past, but it can presumably never predict orbits given unseen n-body gravity events like another dense mass moving through the solar system because of classical insufficiency to model quantum problems, for example.<p>Church-Turing-Deutsch doesn't say there could not exist a Classical / Quantum correspondence; but a classical model on a classical computer cannot be sufficient for quantum-hard problems. (e.g. Quantum Discord says that there are entanglement and non-entanglement nonlocal relations in the data.)<p>Regardless of whether they sufficiently generalize,
[LLMs, ML Models, and AutoMLs] don't yet Critically Think and it's dangerous to take action without critical thought.<p>Critical Thinking; Logic, Rationality:
<a href="https://en.wikipedia.org/wiki/Critical_thinking#Logic_and_rationality" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Critical_thinking#Logic_and_ra...</a>
Well they memorize points and lines (or tanh) between different parts of the space right? So it depends on whether a useful generalization can be extracted from the line estimation and how dense the points on the landscape are no?
How is this even a shock.<p>Anyone who so much as taken a class on this knows that even the simplest of perceptron networks, decision trees, or any form of machine learning model generalizes. That's why we use them. If they don't, it's called <i>overfit</i>[1], where the model is so accurate on the training data that its inferential ability on new data suffers.<p>I know that the article might be talking about a higher form of generalization with LLMs or whatever, but I don't see why the same principle of "don't overfit the data" wouldn't apply to that situation.<p>No, really: what part of their base argument is novel?<p>1: <a href="https://en.wikipedia.org/wiki/Overfitting" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Overfitting</a>
Memorise because there is no decision component. It attempts to just brute force a pattern rather than thinking through the information and making a conclusion.