I remember an AI professor I had once asked the class to define "the number 3".<p>The answer he chose, which stuck with me (if I recall the nuance correctly), is that the number three is: The set of all things in the universe of which there three, three is that which they have in common.<p>Where it became interesting for me is observing our children growing up, especially learning colours and shapes. They exhibited a pattern of learning based upon observations of common patterns in communication by vocalization.<p>For example, children decided things were "red" based upon that trait being in-common with other things we called red. Circles based upon other things we call circles.<p>It's really quite a fascinating phenomenon to observe in children, and I expect there is a key atomicity of association from which more complex patterns - up to consciousness - can be created. Too fine grained and the patterns will be noise; too large and certain higher order structures will never form - a "Goldilocks" zone for the complex system of interpreting reality by observational exposure and initially arbitrary relation.
Good to see someone testing the limits of neural nets, rather just squeezing a few percent of performance on an artificial benchmark.<p>That said, is this result really all that surprising? Especially given the results demonstrated in that paper on fooling DNNs from 2015 and visualization experiments a-la Deep Dream.<p>Unless you believe in networks "painting" stuff, Deep Dream demonstrated that neural networks capture and store certain chunks of their training data and you can get those back out if you're clever enough.<p>That other paper[1] demonstrated that a trained DNN can classify noise as a particular label with very high confidence, as long as you construct that noise carefully enough. This hints at the fact that DNNs may do matching by applying some complex transformation that <i>usually</i> results in the correct answer, but does not necessarily capture the underlying patterns. (Kind of like guessing about the weather by telltale signs, without knowing anything air pressure, currents and so on.)<p>[1] - <a href="http://www.evolvingai.org/fooling" rel="nofollow">http://www.evolvingai.org/fooling</a>
We discussed this paper in our reading group last week[0]. I think the key to understanding what's going on here is figure 1(a). The fastest learning happens with true labels, and the slowest with random labels. Shuffled pixels is the second fastest. I believe the reason this is happening is that given training data composed of structured images, the convolutional architecture heavily favors learning filters which reflect geometric features, as opposed to random filters which can memorize the data. This results in fastest learning with the true labels because the geometric features correspond to the learning target, but for memorizing random labels, geometric features have lower capacity than random filters. On the other hand, it learns shuffled pixels pretty fast because the convolutional architecture makes it easy to capture a color histogram and learn off that.<p>[0] This week we discussed the Alpha Go paper. URL for that, although we don't generally advertise our meetings unless we think there's going to be broad interest: <a href="https://www.meetup.com/Cambridge-Artificial-Intelligence-Meetup/events/237183581/" rel="nofollow">https://www.meetup.com/Cambridge-Artificial-Intelligence-Mee...</a>
My halfway informed interpretation, just from the abstract- it turns out that modern image-recognition networks are capable of learning labels randomly assigned to sets of random images, which means that it's still mysterious why they learn labels with intelligible meaning when given non-random images (rather than just memorizing the training set via some nonsense model.)<p>I'd guess the resolution would have to involve an ordering over possible models, where (for well-designed networks) intelligible models are preferred over unintelligble ones. Filing this away to read later.
> Brute-force memorization is
typically not thought of as an effective form of learning. At the same time, it’s possible that sheer
memorization can in part be an effective problem-solving strategy for natural tasks.<p>I like the conclusion. Basically neural nets are just beasts with too many parameters and they even show you don't even need that many parameters to fit any data set of size n. This is one reason I think neural nets are kinda a dead end. People don't understand them and it is impossible to get any explanatory results from them and based on these results that kinda makes sense. Neural nets don't learn, they just memorize.
> number of parameters exceeds the number of data points as it usually does in practice<p>I dont get this part.<p>In reality, isnt the dataset much larger than the parameters of the nueral net ?
Is is possible that although neural nets can overfit as this paper shows, practitioners just stop training early before this happens? And/or they use a validation set? Would that be enough to explain the good generalization despite the huge number of parameters