This is really cool! I especially like that there's a premade colab notebook that lets you play with it: <a href="https://colab.research.google.com/github/openai/clip/blob/master/Interacting_with_CLIP.ipynb" rel="nofollow">https://colab.research.google.com/github/openai/clip/blob/ma...</a> .<p>I'm a little surprised that the paper doesn't seem to mention the effect of fine-tuning pretrained image and text encoders taken from somewhere else instead of learning the encoding from scratch. I would naively expect that to take way less compute to get good results, and possibly generalize better.<p>I guess the point is to test whether this technique is actually good for learning new representations from scratch? Still, I'm sure they must have run the experiment at some point just to see, and it would've been really interesting to see the numbers.