They kind of slip it under the rug that for the PASCAL VOC tests, unsupervised was only used as pre-training and then followed by supervised training before evaluation. That's the difference between "this course will teach you Spanish" and "this course is a good preparation to do before you start your actual Spanish course".<p>Also, while it is laudable that they attempt to learn slow higher-level features, the result of contrastive loss functions is still very much detail-focussed, it just is so in a translationally invariant way.<p>A common problem for image classification is that the AI will learn to recognize high-level fur patterns, as opposed to learning the shape of the animal. Using contrastive loss terms like in their example will drive the network towards having the same features vector for adjacent pixels, meaning that the fur pattern detector needs to become translation-invariant. But the contrastive loss term will NOT prevent the network from recognizing the fur, rather than the shape, as is claimed in this article.