I am new to CNNs/machine learning, but here's my $0.02:
Regardless of which technique you use, it seems that the amount of data required to learn is too high. This article talks about neural networks accessing billions of photographs, a number which is nowhere near the number of photos/objects/whatever a human sees in a lifetime. Which leads me to the conclusion that we aren't extracting much information from the data. These techniques aren't able to calculate how the same object might look under different lighting conditions, different viewing angles, positions, sizes, and so on. Instead, companies just use millions of images to 'encode' the variations into their networks.<p>Imo there should be a push towards adapting CNNs to calculate/predict how the object might look under different conditions, which might lead to other improvements. This could also be extended to areas other than image recognition.