Have yet to see an illustration that grasps multichannel convolution filters (MCCF) concept clearly. Why those channel stack size keep growing? How are they actually connected?<p>The thing that each conv filter consists of kernels in multiple channels (that's why first layer filter visualisations are colored btw - color image is a "3-dimensional" image) - and we convolve each kernel with corresponding input channel, then <i>sum</i> (that's the key) the responses. Then having multiple MCCF (usually more at each layer) yields a new multi-channel image (say, 16 channels) and we apply new set of (say, 32) 16-channeled MCCFs to it (which we cannot visualise by themselves anymore, we need a 16-dimensional image for each filter) yielding 32-channel image. That sort of thing is almost never explained properly.
<a href="http://arxiv.org/abs/1602.04105#" rel="nofollow">http://arxiv.org/abs/1602.04105#</a> -- This paper is awesome for a use of CNNs, for automatic modulation recognition of RF signals.<p>I'm attempting to use their approach with GNU Radio currently -<p><a href="https://radioml.com/blog/2016/07/18/towards-a-gnu-radio-cnn-tensorflow-block/" rel="nofollow">https://radioml.com/blog/2016/07/18/towards-a-gnu-radio-cnn-...</a>
Great writeup from Stanford CS231 course:
<a href="http://cs231n.github.io/convolutional-networks/" rel="nofollow">http://cs231n.github.io/convolutional-networks/</a>
A human child learns much more easily by seeing only a handful of images of a cat and then almost being able to say any type of cat image as it grows (without ever seeing 1 million or billion images). So, there seem to be something that shows that more than the amount of data, the "reality" of seeing a real cat probably includes all possible aspects of a Cat ? There seem to be something missing with this whole deep learning stuff and the way it is trying to simulate the human cognition.
Here's an intro to ConvNets in Java: <a href="http://deeplearning4j.org/convolutionalnets.html" rel="nofollow">http://deeplearning4j.org/convolutionalnets.html</a><p>Karpathy's stuff is also great: <a href="https://cs231n.github.io/" rel="nofollow">https://cs231n.github.io/</a>
I am new to CNNs/machine learning, but here's my $0.02:
Regardless of which technique you use, it seems that the amount of data required to learn is too high. This article talks about neural networks accessing billions of photographs, a number which is nowhere near the number of photos/objects/whatever a human sees in a lifetime. Which leads me to the conclusion that we aren't extracting much information from the data. These techniques aren't able to calculate how the same object might look under different lighting conditions, different viewing angles, positions, sizes, and so on. Instead, companies just use millions of images to 'encode' the variations into their networks.<p>Imo there should be a push towards adapting CNNs to calculate/predict how the object might look under different conditions, which might lead to other improvements. This could also be extended to areas other than image recognition.