Something that stood out to me skimming the paper - that was somewhat buried - they finetune the model on each benchmark.<p>"Finally, for each individual task (benchmark), we fine-tune the PaLI-3 model with frozen ViT image encoder on the task’s training data as described in the cor- responding section. For most tasks, we fine-tune the 812×812 resolution checkpoint, but for two document understanding tasks, we go up to 1064×1064 resolution"<p>S̶o̶ ̶t̶h̶i̶s̶ ̶i̶s̶ ̶c̶o̶m̶p̶a̶r̶i̶n̶g̶ ̶a̶ ̶s̶m̶a̶l̶l̶e̶r̶ ̶m̶o̶d̶e̶l̶ ̶f̶i̶n̶e̶t̶u̶n̶e̶d̶ ̶p̶e̶r̶ ̶b̶e̶n̶c̶h̶m̶a̶r̶k̶ ̶t̶o̶ ̶l̶a̶r̶g̶e̶r̶ ̶m̶o̶d̶e̶l̶s̶ ̶t̶h̶a̶t̶ ̶I̶ ̶p̶r̶e̶s̶u̶m̶e̶ ̶a̶r̶e̶ ̶n̶o̶t̶,̶ ̶t̶h̶o̶u̶g̶h̶ ̶I̶ ̶h̶a̶v̶e̶ ̶n̶o̶t̶ ̶r̶e̶a̶d̶ ̶t̶h̶e̶ ̶P̶a̶l̶i̶-̶X̶ ̶p̶a̶p̶e̶r̶.̶<p>Edit - No, I was wrong, Palm-X is also fine-tuned before each task/set of tasks.<p>Impressive improvement!!!
maybe someone more informed can help me understand why they didn't compared to Llava (<a href="https://llava-vl.github.io/" rel="nofollow noreferrer">https://llava-vl.github.io/</a>)?
No comparison against GPT-4V? How embarrassing! Where are they going to submit this? A conference where no one knows about GPT-4V? Ridiculous.<p>It's getting really awkward seeing these papers from Google. "We're here too! We're totally not woefully behind everyone else in the field!". No model, no reasonable comparisons, just generic bragging.<p>I'm astounded as an ML researcher how Google can be doing so incredibly badly. They have access to unlimited compute, good people, and great infrastructure. Yet something about their internal culture means they are unable to compete with OpenAI, Facebook, and even the open source community. They constantly brag about how good their models are (even in private) and then every time they deploy anything its performance is pathetic (like Bard and Bard with vision).<p>You can tell why Google recently totally overhauled the leadership of Google Research/Deep Mind and shut down Google Brain.
I probably could have guessed that contrastive pre-training works better for downstream vision-language tasks than image classification pre-training, but it's nice to see this hypothesis thoroughly tested here.<p>Also, hacks to get LLMs to generate structured output seem to be mostly getting the job done. I'm less optimistic about this approach for traditional vision tasks where language is the interface, however. Are we going to get models to output a pixel-wise segmentation mask as text? I want to doubt it but seeing how LLMs are about to output long sequences of structured text leaves my mind open.