People might be interested in this video by @kcimc , where Kyle runs the pretrained model forward in real time on his laptop while walking around the streets of Amsterdam<p><a href="https://vimeo.com/146492001" rel="nofollow">https://vimeo.com/146492001</a><p>Something people don't fully appreciate about neural networks is that their performance is quite a strong function of their training data. In this case the training data is taken from the MS COCO dataset (<a href="http://mscoco.org/explore/" rel="nofollow">http://mscoco.org/explore/</a>). That's why, for example, when Kyle points the camera at himself the model says something along the lines of "man with a suit and tie" - there is a very strong correlation between that kind of an image in the data, and the presence of a suit and tie. With such a strong correlation the model doesn't have a chance to tease the two concepts apart. A similar problem would come up with an ImageNet model, where a similar image might be classified as "seatbelt", because there is no Person class there, and shots of people in that pose usually come from the seatbelt class. It happens to be the most similar concept in the data it has seen. Another example is if you pointed the model at trees it might hallucinate a giraffe, since the two are strongly correlated in the data. Or when Kyle points the camera at the ground I'm fully expecting it to say relatively random things, because I know that those kinds of images are very rare in the training data.<p>In other words, a lot of the "mistakes" are limitations of training data and its variety rather than something to do with the model itself, and it's easier to recognize this if you're familiar with the training data and its classes and distribution.
Thank you. You mentioned that you plan on adding a re-ranker. Is that a re-ranker that encourages diversity? Just like what is done in this paper: <a href="http://arxiv.org/pdf/1510.03055.pdf" rel="nofollow">http://arxiv.org/pdf/1510.03055.pdf</a>