科技回声

8 条评论

buildbot超过 1 年前

Something that stood out to me skimming the paper - that was somewhat buried - they finetune the model on each benchmark."Finally, for each individual task (benchmark), we fine-tune the PaLI-3 model with frozen ViT image encoder on the task’s training data as described in the cor- responding section. For most tasks, we fine-tune the 812×812 resolution checkpoint, but for two document understanding tasks, we go up to 1064×1064 resolution"S̶o̶ ̶t̶h̶i̶s̶ ̶i̶s̶ ̶c̶o̶m̶p̶a̶r̶i̶n̶g̶ ̶a̶ ̶s̶m̶a̶l̶l̶e̶r̶ ̶m̶o̶d̶e̶l̶ ̶f̶i̶n̶e̶t̶u̶n̶e̶d̶ ̶p̶e̶r̶ ̶b̶e̶n̶c̶h̶m̶a̶r̶k̶ ̶t̶o̶ ̶l̶a̶r̶g̶e̶r̶ ̶m̶o̶d̶e̶l̶s̶ ̶t̶h̶a̶t̶ ̶I̶ ̶p̶r̶e̶s̶u̶m̶e̶ ̶a̶r̶e̶ ̶n̶o̶t̶,̶ ̶t̶h̶o̶u̶g̶h̶ ̶I̶ ̶h̶a̶v̶e̶ ̶n̶o̶t̶ ̶r̶e̶a̶d̶ ̶t̶h̶e̶ ̶P̶a̶l̶i̶-̶X̶ ̶p̶a̶p̶e̶r̶.̶Edit - No, I was wrong, Palm-X is also fine-tuned before each task/set of tasks.Impressive improvement!!!

评论 #37900989 未加载

评论 #37898073 未加载

tracyhenry超过 1 年前

maybe someone more informed can help me understand why they didn't compared to Llava (<a href="https://llava-vl.github.io/" rel="nofollow noreferrer">https://llava-vl.github.io/</a>)?

评论 #37896288 未加载

评论 #37896191 未加载

light_hue_1超过 1 年前

No comparison against GPT-4V? How embarrassing! Where are they going to submit this? A conference where no one knows about GPT-4V? Ridiculous.It's getting really awkward seeing these papers from Google. "We're here too! We're totally not woefully behind everyone else in the field!". No model, no reasonable comparisons, just generic bragging.I'm astounded as an ML researcher how Google can be doing so incredibly badly. They have access to unlimited compute, good people, and great infrastructure. Yet something about their internal culture means they are unable to compete with OpenAI, Facebook, and even the open source community. They constantly brag about how good their models are (even in private) and then every time they deploy anything its performance is pathetic (like Bard and Bard with vision).You can tell why Google recently totally overhauled the leadership of Google Research/Deep Mind and shut down Google Brain.

评论 #37899396 未加载

评论 #37899183 未加载

评论 #37904326 未加载

评论 #37900296 未加载

评论 #37902072 未加载

kolja005超过 1 年前

I probably could have guessed that contrastive pre-training works better for downstream vision-language tasks than image classification pre-training, but it's nice to see this hypothesis thoroughly tested here.Also, hacks to get LLMs to generate structured output seem to be mostly getting the job done. I'm less optimistic about this approach for traditional vision tasks where language is the interface, however. Are we going to get models to output a pixel-wise segmentation mask as text? I want to doubt it but seeing how LLMs are about to output long sequences of structured text leaves my mind open.

评论 #37896554 未加载

sgd99超过 1 年前

can anyone explain how these visual tokens which are concatenated with the tokenizer outputs for the encoder are created?

评论 #37902201 未加载

facu17y超过 1 年前

no github?

Technotroll超过 1 年前

Does the vision-language-model process raw image data, or does it process OCR character output?

评论 #37899067 未加载

doggerel超过 1 年前

The copyright violation is coming from inside the house.Even undigitized materials aren't safe any more.

8 条评论

buildbot超过 1 年前

评论 #37900989 未加载

评论 #37898073 未加载

tracyhenry超过 1 年前

maybe someone more informed can help me understand why they didn't compared to Llava (<a href="https://llava-vl.github.io/" rel="nofollow noreferrer">https://llava-vl.github.io/</a>)?

评论 #37896288 未加载

评论 #37896191 未加载

light_hue_1超过 1 年前

评论 #37899396 未加载

评论 #37899183 未加载

评论 #37904326 未加载

评论 #37900296 未加载

评论 #37902072 未加载

kolja005超过 1 年前

评论 #37896554 未加载

sgd99超过 1 年前

can anyone explain how these visual tokens which are concatenated with the tokenizer outputs for the encoder are created?

评论 #37902201 未加载

facu17y超过 1 年前

no github?

Technotroll超过 1 年前

Does the vision-language-model process raw image data, or does it process OCR character output?

评论 #37899067 未加载

doggerel超过 1 年前

The copyright violation is coming from inside the house.Even undigitized materials aren't safe any more.

PaLI-3 Vision Language Models

8 条评论

PaLI-3 Vision Language Models

8 条评论