TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

PaLI-3 Vision Language Models

176 pointsby maccawover 1 year ago

8 comments

buildbotover 1 year ago
Something that stood out to me skimming the paper - that was somewhat buried - they finetune the model on each benchmark.<p>&quot;Finally, for each individual task (benchmark), we fine-tune the PaLI-3 model with frozen ViT image encoder on the task’s training data as described in the cor- responding section. For most tasks, we fine-tune the 812×812 resolution checkpoint, but for two document understanding tasks, we go up to 1064×1064 resolution&quot;<p>S̶o̶ ̶t̶h̶i̶s̶ ̶i̶s̶ ̶c̶o̶m̶p̶a̶r̶i̶n̶g̶ ̶a̶ ̶s̶m̶a̶l̶l̶e̶r̶ ̶m̶o̶d̶e̶l̶ ̶f̶i̶n̶e̶t̶u̶n̶e̶d̶ ̶p̶e̶r̶ ̶b̶e̶n̶c̶h̶m̶a̶r̶k̶ ̶t̶o̶ ̶l̶a̶r̶g̶e̶r̶ ̶m̶o̶d̶e̶l̶s̶ ̶t̶h̶a̶t̶ ̶I̶ ̶p̶r̶e̶s̶u̶m̶e̶ ̶a̶r̶e̶ ̶n̶o̶t̶,̶ ̶t̶h̶o̶u̶g̶h̶ ̶I̶ ̶h̶a̶v̶e̶ ̶n̶o̶t̶ ̶r̶e̶a̶d̶ ̶t̶h̶e̶ ̶P̶a̶l̶i̶-̶X̶ ̶p̶a̶p̶e̶r̶.̶<p>Edit - No, I was wrong, Palm-X is also fine-tuned before each task&#x2F;set of tasks.<p>Impressive improvement!!!
评论 #37900989 未加载
评论 #37898073 未加载
tracyhenryover 1 year ago
maybe someone more informed can help me understand why they didn&#x27;t compared to Llava (<a href="https:&#x2F;&#x2F;llava-vl.github.io&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;llava-vl.github.io&#x2F;</a>)?
评论 #37896288 未加载
评论 #37896191 未加载
light_hue_1over 1 year ago
No comparison against GPT-4V? How embarrassing! Where are they going to submit this? A conference where no one knows about GPT-4V? Ridiculous.<p>It&#x27;s getting really awkward seeing these papers from Google. &quot;We&#x27;re here too! We&#x27;re totally not woefully behind everyone else in the field!&quot;. No model, no reasonable comparisons, just generic bragging.<p>I&#x27;m astounded as an ML researcher how Google can be doing so incredibly badly. They have access to unlimited compute, good people, and great infrastructure. Yet something about their internal culture means they are unable to compete with OpenAI, Facebook, and even the open source community. They constantly brag about how good their models are (even in private) and then every time they deploy anything its performance is pathetic (like Bard and Bard with vision).<p>You can tell why Google recently totally overhauled the leadership of Google Research&#x2F;Deep Mind and shut down Google Brain.
评论 #37899396 未加载
评论 #37899183 未加载
评论 #37904326 未加载
评论 #37900296 未加载
评论 #37902072 未加载
kolja005over 1 year ago
I probably could have guessed that contrastive pre-training works better for downstream vision-language tasks than image classification pre-training, but it&#x27;s nice to see this hypothesis thoroughly tested here.<p>Also, hacks to get LLMs to generate structured output seem to be mostly getting the job done. I&#x27;m less optimistic about this approach for traditional vision tasks where language is the interface, however. Are we going to get models to output a pixel-wise segmentation mask as text? I want to doubt it but seeing how LLMs are about to output long sequences of structured text leaves my mind open.
评论 #37896554 未加载
sgd99over 1 year ago
can anyone explain how these visual tokens which are concatenated with the tokenizer outputs for the encoder are created?
评论 #37902201 未加载
facu17yover 1 year ago
no github?
Technotrollover 1 year ago
Does the vision-language-model process raw image data, or does it process OCR character output?
评论 #37899067 未加载
doggerelover 1 year ago
The copyright violation is coming from inside the house.<p>Even undigitized materials aren&#x27;t safe any more.