TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Clip: Connecting Text and Images

119 pointsby samaover 4 years ago

5 comments

mlucyover 4 years ago
This is really cool! I especially like that there&#x27;s a premade colab notebook that lets you play with it: <a href="https:&#x2F;&#x2F;colab.research.google.com&#x2F;github&#x2F;openai&#x2F;clip&#x2F;blob&#x2F;master&#x2F;Interacting_with_CLIP.ipynb" rel="nofollow">https:&#x2F;&#x2F;colab.research.google.com&#x2F;github&#x2F;openai&#x2F;clip&#x2F;blob&#x2F;ma...</a> .<p>I&#x27;m a little surprised that the paper doesn&#x27;t seem to mention the effect of fine-tuning pretrained image and text encoders taken from somewhere else instead of learning the encoding from scratch. I would naively expect that to take way less compute to get good results, and possibly generalize better.<p>I guess the point is to test whether this technique is actually good for learning new representations from scratch? Still, I&#x27;m sure they must have run the experiment at some point just to see, and it would&#x27;ve been really interesting to see the numbers.
neosatover 4 years ago
Impressive work! Their approach makes a ton of sense: &quot;by not directly optimizing for the benchmark, we show that it becomes much more representative&quot;<p>and is evident in the generalization and robustness: ~74% higher on the adversarial benchmark and similar&#x2F;superior results on the standard Imagenet
gravyover 4 years ago
Sometimes I look at stuff like this and sit in despair that I&#x27;m not learning anything that will let me work these kinds of problems while I work at a defense contractor maintaining 20+ year old code.
评论 #25651457 未加载
评论 #25651534 未加载
评论 #25651039 未加载
liuliuover 4 years ago
It still took great amount of computation resources (~250 V100s in 12 days or ~500 V100s in 18 days), but this can have much broader impact in everyday life much quicker. It could quickly translate to much more reasonable image labels, search rankings, video recommendations rather quickly. Very impressive work.
kordlessagainover 4 years ago
Now if there were just a model that would clip images out of a page for me.