It's kind of bonkers that this works. It suggests that the whole belief that layers are learning different representations is completely wrong: if layer three is expecting a certain kind of intermediate representation from layer two, and is then given the raw input, one would expect layer three to choke.<p>Instead, the depth seems to be giving something like a progressive unwinding of the feature space.<p>It would be interesting to compare the trained networks to networks trained in the usual way, to see if they're coming up with similar coefficients in spite of the different training methods, out if this is producing something completely different.