I am one of the people who helped analyze the results of the mentioned ILSVRC challenge. In particular, I performed an experiment comparing Google's performance to that of a human a week ago and wrote up the results in this blog post:<p><a href="http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/" rel="nofollow">http://karpathy.github.io/2014/09/02/what-i-learned-from-com...</a><p>TLDR is that it's very exciting that the models are starting to perform on par with humans (on ILSVRC classification at least), and doing so on orders of milliseconds. The included page also has a link to our annotation interface where you can try to compete against their model yourself, and see its predictions and mistakes.
"typical incarnations of which consist of over 100 layers with a maximum depth of over 20 parameter layers)"
Anyone know exactly what that means?
I'm guessing that that there are 100 layers total, 20 of which have tunable parameters, and the other 80 of which don't--e.g., max pooling and normalization.
That's pretty amazing. It seems like we're at a point where we could build really practical robots with this?<p>Robots to do dishes, weed crops, pick fruit? Why isn't this being applied to more tasks?
I wonder whether some of the intermediate layers in these models might correspond to something like "living room" or other locations that provide additional information about the objects that might be in the scene. For example, I suspect it was much easier for me to identify the preamp and the wii in one of the pictures because I knew it was a living room/den instead of an office or study.
I wish this was available as a translation app. You point your phone at a fruit stand and it names every single item, and you can then ask the vendor for the item by name.<p>It isn't that crazy, in fact that's exactly what they have right now but just in English only.
These classifications are amazing but the fact that the first image in the article is classified as "a dog wearing a wide-brimmed hat" and not as "a chihuahua wearing a sombrero" is telling of how far we are from true understanding of images.<p>Only a human possessed with the relevant cultural stereotypes (chihuahua implies Mexican, ergo, the hat must be a sombrero) could make that conclusion.<p>Even so, I firmly believe that at this rate of improvement, we're not far from that kind of deep understanding.
How big is the model? Training these kinds of networks is expert work and requires enormous infrastructure; but if they released the model, I'm sure people like us could come up with all sorts of very useful applications.