The title is misleading. The core technique still uses 60,000 images from MNIST, but 'distills' them into 10 images that contain the information from the original 60,000. The 10 'distilled' images look nothing like digits. Learning a complex model from 10 (later reduced to 2) 'distilled' number arrays <i>is</i> an interesting research idea, but it has little to do with reducing the size of the input dataset. Arguably the heavy lifting part of the learning process moved from training the model to generating the distilled dataset. There is also some unconvincing discussion around synthetic datasets, though it remains fully unclear how these synthetic datasets have anything to do with real world scenarios.<p>> In a previous paper, MIT researchers had introduced a technique to “distill” giant data sets into tiny ones, and as a proof of concept, they had compressed MNIST down to only 10 images.
Direct link to paper: <a href="https://arxiv.org/pdf/2009.08449.pdf" rel="nofollow">https://arxiv.org/pdf/2009.08449.pdf</a><p>Interesting paper, although the headline is of course sensational. The crux of the paper is that by using "soft labels" (for example a probability distribution rather than one-hot), it's possible to create a decision boundary that encodes more classes than you have examples. In fact, only two examples can be used to encode any finite number of classes.<p>This is interesting because it means that, in theory, ML models should be able to learn decision spaces that are far more complex than the input data has traditionally been thought to encode. Maybe one day we can create complex, generalizable models using a small amount of data.<p>As written, this paper does not provide much actionable information. The problem is a toy problem, and is far from being useful in "modern" AI techniques (especially things like deep learning or boosted trees). The paper also is not practical in the sense that in real life you don't know what your decision boundary should look like (that's what you learn after all), and there's no obvious way to know which data to collect to get a decision boundary you want.<p>In other words, this paper has said "this representation is mathematically possible" and is hoping that future work can actually make it useful in practice.
The title is click-bait. This has been known for several years[1], the technique has little practical value, and the assertion that you can learn from no data is completely false and misleading. The training data was compressed to a few examples. To the journalist: it's OK not to maximize for click-bait when you write an article.
[1]: <a href="https://www.ttic.edu/dl/dark14.pdf" rel="nofollow">https://www.ttic.edu/dl/dark14.pdf</a>
"carefully engineered their soft labels" is the same thing as training the network. Just because you encode information outside of the weights doesn't mean you're not encoding information from training data.<p>It's like saying here's the ideal partitioning scheme, memorize this.
I've started to view <i>Technology Review</i> as a PR puff piece for MIT. They often overstate claims or leave out critical details.<p>As an example, the media lab is still citing innovation with deep fakes, claiming entirely novel results people are shocked to see. They hype their own researchers even though there are kids on YouTube who that have been making similar content up to a year prior to Technology Review's publication.<p>I suspect they do the same with fields I'm less familiar with.
Am I correct that they don't use a test-train split for generating these distilled images? Until you test on new images outside of what is inputted to the distiller, it seems to be a way to just overfit specific images, probably by combining unique elements of each into a single composite image. There are plenty of classical signal processing ways to do this (including just building a composite patchwork quilt).
I like to think of this as adversarial training data. Adversarial inputs in general trick a NN to producing a specific output -- Adversarial training data tricks the NN into learning specific weights.<p>Note that the distilled data is not even from the same "domain" of input data any more. They're basically adversarial inputs.
if I understand correctly the key benefit would be that models could be trained on smaller datasets and therefore reduce the time spent computing the models?<p>I am not convinced that this time saving is more than the time spent to engineer the combined and synthesised data.
> ...very different from human learning. A child often needs to see just a few examples of an object, or even only one, before being able to recognize it for life.<p>I see this a lot. It's completely wrong. I'm not trying to pick on the author here, I think 95%+ of people share this misunderstanding of deep learning.<p>If you see "only one" horse, say for even a second, you really are seeing a huge number of horses, from various angles, with various shades of lighting. The motions of the horse; the motions of your head (even if slight); the undulations of the light; are generating a much larger number of basically augmented training data. If you look at a horse for a minute it could be the equivalent of training on 1 million images of a horse. I'm not sure the exact OOM, but it's certainly orders of magnitude more than "one" horse.<p>(Relatedly: Some people say there is an experiment you can conduct at home to see the actual images your brain is training on).