I'm aware of the vision algorithm in my head to a certain extent and I'm not sure if machine vision does the same. You can run simple thought experiments to see what your brain actually does to analyze an image.
First of all when I look at a scene I am 100% aware of geometry. Irregardless of meaning, words and symbols I can trace out the three dimensional shape of things without associations to words.<p>How do I know I can do this? Simple. Every scene I look at I can basically translate or imagine that scene in my head as wireframe scene or some low poly scene as if it was generated by a computer. Similar to if I look at wireframe scene generated by a computer, my mind can translate it into a scene that looks real. Try it, you can do it.<p>Second, I can look at an actual low poly wireframe model of an elephant and associate it with the word 'elephant.' I do not need color, or detail to know it's an elephant. In fact, with just color and detail alone it is harder for me to identify an elephant. For example if someone takes many very closeup photographs of parts of an elephant like its eye, skin, ear, etc.. and asks me to guess the subject by interpreting the pictures... I become fully aware that I would be accessing a slower, different part of my brain to deduce the meaning. This is a stark contrast to the instantaneous word association established when I look at a wireframe model of an elephant. The speed difference between both ways of identifying an elephant indicate to me that geometric interpretation is the primary driver behind our visual analysis and details like color or texture are tertiary when it comes to the identification of an elephant. I believe the visual cortex determines shape first, then subsequently determines word from shape.<p>If you feed a white sculpture of an elephant or a wireframe of an elephant into one of these deep learning networks it is unlikely you will get the word 'elephant' as output. But if you feed it a real picture of an elephant it can correctly identify the elephant (assuming it was trained against photos of an elephant). Because the delta between a white sculpture of an elephant and an actual picture of an elephant is just color and detail this indicates to me that when you train these deep learning networks to recognize an elephant you are training the network to recognize details. It's a form of over fitting, the training is not general enough to catch geometry. It is correlating blobs of pixels , color and detail with an elephant rather then associating a three dimensional model of it to the word... the opposite of what humans do. In fact I bet you that if you took those very closeup photographs of an elephant and fed it into the network it'd do a better job at recognition versus the picture of a white sculpture of an elephant.<p>This indicates to me that to improve our vision algorithms, the algorithm must first associate pixels with geometry then identify the associated word to the geometry rather than try to associate blobs of pixels to words. Train geometry recognition before word association.<p>My guess is that our minds have specific and genetically determined built in geometry recognition algorithms honed to turn a 2d image into a 3d shape. We do not learn to translate 2d to 3d we are born with that ability hardwired. Where learning comes in is the translation of this shape to a word. Whereas most of the machine learning we focus on in research is image recognition, I believe the brain is actually learning shape and geometry recognition.