Vision language models are blind

451 点作者 taesiri11 个月前

47 条评论

Entertaining, but I think the conclusion is way off.> their vision is, at best, like that of a person with myopia seeing fine details as blurryis a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.[1] <a href="https://arxiv.org/abs/2404.04125" rel="nofollow">https://arxiv.org/abs/2404.04125</a>

评论 #40929161 未加载

评论 #40934130 未加载

评论 #40930013 未加载

评论 #40934753 未加载

评论 #40930749 未加载

评论 #40929102 未加载

评论 #40931280 未加载

评论 #40933253 未加载

评论 #40935904 未加载

评论 #40929050 未加载

评论 #40932524 未加载

jetrink11 个月前

I had a remarkable experience with GPT-4o yesterday. Our garage door started to fall down recently, so I inspected it and found that our landlord had installed the wire rope clips incorrectly, leading to the torsion cables losing tension. I didn't know what that piece of hardware was called, so I asked ChatGPT and it identified the part as I expected it to. As a test, I asked if there was anything notable about the photo. ChatGPT correctly identified that the cables were installed backwards, with the side of the cable that was (previously) under tension on top of the slack end, instead of sandwiched securely in the middle. To diagnose that requires tracing the cable through space and inferring which end is under tension from the geometry, though I can't rule out an educated guess.What was really remarkable though was that it failed to notice that one of the two nuts was obviously missing, even after I told it there was a second problem with the installation.Screenshot: <a href="https://imgur.com/a/QqCNzOM" rel="nofollow">https://imgur.com/a/QqCNzOM</a>

评论 #40930402 未加载

评论 #40932344 未加载

评论 #40946997 未加载

JeremyHerrman11 个月前

VLMs so far have never been good at counting objects or spatial relationships (e.g. the coffee is to the right of the microwave).There are ways to help the VLM out - Set of Marks [0] from Microsoft being the most prominent, which uses segmentation to outline and label sections of the image before sending to the VLM.Providing "speakable" labels to regions helps ground the visual abilities of VLMs and is why in this paper the performance is so much better when words are present in the grid for "Task 6: Counting the rows and columns of a grid"0: <a href="https://github.com/microsoft/SoM">https://github.com/microsoft/SoM</a>

评论 #40930302 未加载

评论 #40931245 未加载

评论 #40929261 未加载

joelburget11 个月前

Vision Transformers do a shocking amount of compression in the tokenizer. In the [Chameleon paper](<a href="https://arxiv.org/pdf/2405.09818" rel="nofollow">https://arxiv.org/pdf/2405.09818</a>) they say the tokenizer "encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192". That's 256 pixels per token (512 * 512 / 1024). If we assume that a pixel is 24 bits (3x 8 bit channels), this implies that they've compressed 256 * 24 = 6144 bits into 13 = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction and Generation](<a href="https://yucornetto.github.io/projects/titok.html" rel="nofollow">https://yucornetto.github.io/projects/titok.html</a>) pushes this even further. If these models work similarly, it's no wonder they struggle with some vision tasks.

评论 #40933553 未加载

评论 #40933107 未加载

评论 #40933488 未加载

cs70211 个月前

Wow, that is embarrassingly bad performance for current SOTA models (GPT-4o, Gemini-1.5 Pro, Sonnet-3, Sonnet-3.5), which are advertised and sold as being able to understand images, e.g., for guiding the blind or tutoring children in geometry!The tasks at which they fail are ridiculously simple for human beings, including, for example:* counting the number of times two lines intersect;* detecting whether two circles overlap;* selecting which letter is being circled in a word;* counting the number of circles in an Olympic-like logo.This should be at the top of the front page.

评论 #40929026 未加载

评论 #40933552 未加载

diwank11 个月前

I think while the findings are interesting, the title "Vision language models are blind" is hyperbolic and misleading of the findings. It is true that VLMs have a very different way of sensing and working with image inputs and for low resolutions each image is split up into blocks and then mapped into discrete tokens. They actually really dont have access to finegrained details because this mapping is quite lossy. So, in that sense, these results completely make sense and were not surprising to me but "blind" is a strong implication and the work isnt really able to back that.Take the first example of the four 2D line plots for instance. I asked Sonnet 3.5 for an answer and retried it 5 times and 2 out of 5 times, it said something like this:Me:here's an image with four 2-D line plots. Can you tell me A) how many lines are in each plot? and B) how many times do they intersect each other in each plot?Write your answers as: Plot 1 (leftmost): ...Sonnet:I'll analyze each plot from left to right and provide the requested information:Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0 intersections (the lines do not cross)Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the right edge)Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the lines cross once near the center)Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2 intersections (the lines cross twice, once near the left side and once near the right side)Each plot consistently shows two lines, one red and one blue, with varying intersection patterns across the four plots.---Which isn't completely correct (the 3rd plot has 2 intersections) but still pretty good and the other 3 times, it said plot 3 has 1 intersection.

akavi11 个月前

Speaking as someone with only a tenuous grasp of how VLMs work, this naïvely feels like a place where the "embodiement" folks might have a point: Humans have the ability to "refine" their perception of an image iteratively, focusing in on areas of interest, while VLMs have to process the entire image at the same level of fidelity.I'm curious if there'd be a way to emulate this (have the visual tokens be low fidelity at first, but allow the VLM to emit tokens that correspond to "focusing" on a region of the image with greater resolution). I'm not sure if/how it's possible to performantly train a model with "interactive" data like that, though

评论 #40930397 未加载

评论 #40935693 未加载

评论 #40931128 未加载

评论 #40939724 未加载

评论 #40931042 未加载

poikroequ11 个月前

It's ironic, they fail these seemingly simple tests that are trivial even for a child to solve. Yet, I used Gemini to read a postcard containing handwritten Russian cursive text with lots of visual noise (postmarks and whatnot). It was able to read the text and translate it into English. I didn't even need to tell it the text is Russian.On the one hand, it's incredible what these LLMs are capable of. On the other hand, they often fall flat on their face with seemingly simple problems like this. We are seeing the same from self driving cars, getting into accidents in scenarios that almost any human driver could have easily avoided.

评论 #40939740 未加载

mglz11 个月前

I tought some Computational Geometry courses and efficiently computing the intersections of N line segments is not as straightforward as you might initially think. Since somewhere some computation must be done to recognize this and LLMs are not specifically trained for this task, it's not suprising they struggle.In general, basic geometry seems under-explored by learning.

评论 #40932517 未加载

评论 #40932752 未加载

GaggiX11 个月前

Well, all the models (especially Claude 3.5 Sonnet) seem to perform much better than random, so they are clearly not blind. The only task where Claude Sonnet 3.5 does not perform better than random is the one where you have to follow many different paths (the ones where the answer from A to C is 3), something that would take me several seconds to solve.I have the feeling that they first choose the title of the paper and then run the evaluation on the new Claude 3.5 Sonnet on these abstract images.>their vision is, at best, like that of a person with myopia seeing fine details as blurryThis also makes no sense, since the images evaluate the abstract capabilities of the models, not their eyesight.

评论 #40929668 未加载

yantrams11 个月前

Tested these problems with llava-v1.6-mistral-7b and the results aren't bad. Maybe I just got lucky with these samplesIntersecting Lines <a href="https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc">https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc</a>Overlapping Circles <a href="https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384">https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384</a>Touching Circles <a href="https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc">https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc</a>Circled Text <a href="https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc">https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc</a>Nested Squares <a href="https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm">https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm</a>

评论 #40933542 未加载

taesiri11 个月前

This paper examines the limitations of current vision-based language models, such as GPT-4 and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?

simonw11 个月前

I've been generally frustrated at the lack of analysis of vision LLMs generally.They're clearly a very exciting category of technology, and a pretty recent one - they only got good last October with GPT-4 Vision, but since then we've had more vision models from Anthropic and Google Gemini.There's so much more information about there about text prompting compared to image prompting. I feel starved for useful information about their capabilities: what are vision models good and bad at, and what are the best ways to put them to work?

评论 #40929396 未加载

评论 #40929867 未加载

dheera11 个月前

Current approaches of multi-modal models work on embeddings and tokenizations of images, which is the fundamental problem: you are feeding blurry, non-precise data into the model. Yes, they are blind because of exactly this.An embedding isn't conceptually that much different from feeding a 1024-word description of an image instead of the actual image.At the moment compute power isn't good enough to feed high-res pixel data into these models, unless we discover a vastly different architecture, which I am also convinced likely exists.

评论 #40930120 未加载

评论 #40928917 未加载

jeromeparadis11 个月前

One use-case I always try is to have an AI try to read a school calendar image where days off are or days of interest are highlighted using a legend. i.e.: days with a square, circle or triangle or different color, etc.When asking days for specific days of interest for the school year, AIs always struggle. They get some days right but forget some or fabulate new days. They fare a bit better if you remove some of the noise and give them only a picture of a month but even then, it's unreliable.

_vaporwave_11 个月前

It's really interesting that there's a huge performance discrepancy between these SOTA models. In the Olympic logo example, GPT-4o is below the baseline accuracy of 20% (worse than randomly guessing) while Sonnet-3.5 was correct ~76% of the time.Does anyone have any technical insight or intuition as to why this large variation exists?

评论 #40933564 未加载

pjs_11 个月前

I don't like this paper for the following reasons:- The language is unnecessarily scathing- They repeatedly show data where the models are getting things right 70, 80, 90% of the time, and then show a list of what they call "qualitative samples" (what does "qualitative" mean? "cherry-picked"?) which look very bad. But it got the answer right 70/80/90% of the time! That's hardly "blind"...- Various of the tasks hinge on the distinction between two objects "exactly touching" vs. "very nearly touching" vs. "very slightly overlapping", a problem which (i) is hard for humans and (ii) is particularly (presumably deliberately) sensitive to resolution/precision, where we should not be surprised that models fail- The main fish-shaped example given in task 1 seems genuinely ambiguous to me - do the lines "intersect" once or twice? The tail of the fish clearly has a crossing, but the nose of the fish seems a bit fishy to me... is that really an intersection?- AFAIC deranged skepticism is just as bad as deranged hype, the framing here is at risk of appealing to the formerIt's absolutely fair to make the point that these models are not perfect, fail a bunch of the time, and to point out the edge cases where they suck. That moves the field forwards. But the hyperbole (as pointed out by another commenter) is very annoying.

评论 #40932827 未加载

评论 #40930699 未加载

评论 #40932881 未加载

gnutrino11 个月前

My guess is that the systems are running image recognition models, and maybe OCR on images, and then just piping that data as tokens into an LLM. So you are only ever going to get results as good as existing images models with the results filtered through an LLM.To me, this is only interesting if compared with results of image recognition models that can already answer these types of questions (if they even exist, I haven't looked).Maybe the service is smart enough to look at the question, and then choose one or more models to process the image, but not sure as I can't find anything on their sites about how it works.

评论 #40932175 未加载

评论 #40931314 未加载

jordan_bonecut11 个月前

This is an interesting article and goes along with how I understand how such models interpret input data. I'm not sure I would characterize the results as blurry vision, but maybe an inability to process what they see in a concrete manner.All the LLMs and multi-modal models I've seen lack concrete reasoning. For instance, ask ChatGPT to perform 2 tasks, to summarize a chunk of text and to count how many words are in this chunk. ChatGPT will do a very good job summarizing the text and an awful job at counting the words. ChatGPT and all the transformer based models I've seen fail at similar concrete/mathematical reasoning tasks. This is the core problem of creating AGI and it generally seems like no one has made any progress towards synthesizing something with both a high and low level of intelligence.My (unproven and probably incorrect) theory is that under the hood these networks lack information processing loops which make recursive tasks, like solving a math problem, very difficult.

评论 #40932537 未加载

评论 #40942766 未加载

评论 #40934526 未加载

Rebuff500711 个月前

In fairness, Mira Murati said GPT-4 is only high school level [1]. Maybe it takes PhD level to understand basic shapes?[1] <a href="https://www.ccn.com/news/technology/openais-gpt-5-phd-level-intelligence-2026-cto-mira-murati/" rel="nofollow">https://www.ccn.com/news/technology/openais-gpt-5-phd-level-...</a>

londons_explore11 个月前

Could some of the "wrong" answers be the LLM attempting to give an explanation rather than the answer, eg. instead of answering 'X', the LLM answers 'The letter is partially hidden by the oval, so cannot be certain, but it appears to be the english letter X'.The scoring criteria would rank this answer as 'T', which is wrong.

orbital-decay11 个月前

That's not anything like "myopia", though.FWIW I tried the line intersection and the circled letter test from the article with CogVLM (which is far from reaching the current SotA) and it correctly passed both. I haven't tried it with Sonnet/4o but I suspect there might be something wrong with how the author did their tests. Don't get me wrong, but too many "the model can't do that" claims ended up with demonstrations of the model doing exactly that...

Log_out_11 个月前

Chat gpt write me an argument that humans are blind because <a href="https://en.m.wikipedia.org/wiki/Optical_illusion" rel="nofollow">https://en.m.wikipedia.org/wiki/Optical_illusion</a> exist.Alexa experience that tragic irony for me.Siri.forget it.

randomtree11 个月前

I guess I know what's coming to every captcha tomorrow.

rezaghanbari111 个月前

Some of these samples are shocking. How do these models answer chart-based questions, I mean when they can't even count the intersections between two lines?

评论 #40928896 未加载

评论 #40929838 未加载

nichohel11 个月前

Vision language models are blind because they lack the Cartesian Theater, which you and I have. Which you and I say we have.

评论 #40942789 未加载

评论 #40932433 未加载

评论 #40936881 未加载

aaroninsf11 个月前

The title for this page and argument should be qualified with the specific generation of tools.That's in the abstract, but, it's bad to not be specific. In this case, because current public-facing models are WIWEB: the worst it will ever be.And there are trillion-dollar prizes at stake, so, improvement is happening as quickly as it possibly can.

Jack00011 个月前

This is kind of the visual equivalent of asking an LLM to count letters. The failure is more related to the tokenization scheme than the underlying quality of the model.I'm not certain about the specific models tested, but some VLMs just embed the image modality into a single vector, making these tasks literally impossible to solve.

axblount11 个月前

Would you say they have Blindsight?

michaelhoney11 个月前

This says to me that there are huge opportunities for improvement in providing vision modules for LLMs. Human minds aren't made of just one kind of thing: we have all sorts of hacky modular capabilities – there's no reason to think that a future AGI wouldn't also.

tantalor11 个月前

Are the "random-baseline accuracy" numbers correct?In the "Two circles" test, do they really have 50% chance of overlapping? I think this comes from "Distances between circle perimeters: -0.15 to 0.5 times the diameter" but doesn't say the distribution they use.

评论 #40930183 未加载

viraptor11 个月前

I love some of the interpretations there. For example "Fig. 10: Only Sonnet-3.5 can count the squares in a majority of the images.", when that model simply returns "4" for every question and happens to be right.

vessenes11 个月前

A few comments below talk about how tokenizing images using stuff like CLIP de-facto yields blurry image descriptions, and so these are ‘blind’ by some definitions. Another angle of blurring not much discussed is that the images are rescaled down; different resolutions for different models. I wouldn’t be surprised if Sonnet 3.5 had a higher res base image it feeds in to the model.Either way, I would guess that we’ll need new model architectures for multimodal to get really good at some of this, and even then some of these tasks are adjacent to things that we know LLMs are already bad at (numeric logic, for instance).As context lengths get longer, devoting more tokens to the image tokenization should help a bit here as well. Anyway, I’d anticipate next year we’d see 80s and 90s for most of these scores with next gen models.

评论 #40929802 未加载

评论 #40933599 未加载

iamleppert11 个月前

This could easily be fixed with training and fine tuning. Simply generate 100,000 examples or so, and train with ground truth until however long you want and its a solved problem.

评论 #40929320 未加载

评论 #40929829 未加载

make311 个月前

Hugged to death from my perspective. Here is a backup: <a href="https://archive.ph/kOE3Q" rel="nofollow">https://archive.ph/kOE3Q</a>

评论 #40930310 未加载

kristianpaul11 个月前

We see through thoughts and memories. We see when we desire, the vision just adds on a word pf thoughts and consciousness of being conscious.Vision links thoughts with reality

childintime11 个月前

Claude 3.5 does remarkably well though on many tasks, compared to the others, and on those it's not at all blind. It's getting there.

navaed0111 个月前

Is there a good primer on how these vision LlmS work?

nyxtom11 个月前

I wonder how well Alpha Geometry would do on this

评论 #40931313 未加载

jackblemming11 个月前

Ask it to draw any of those things and it can.

cpill11 个月前

I wonder how they would score if they used all 4 models and took a majority vote...?

nmca11 个月前

please use this opportunity to reflect on whether ARC measures reasoning skills :)

nothrowaways11 个月前

The next version will solve all of it.

mkoubaa11 个月前

They interact with pixel buffers as a mathematical array. To call them blind is to confuse what they doing with the experience of sight...

评论 #40932409 未加载

verbalstoner11 个月前

It's virtually impossible to take a paper seriously when the title has an emoji.

spullara11 个月前

in other news, vision models are bad at things they aren't trained to do

hi_dang_11 个月前

I was hoping that someone in the comments talking the paper down would have published a paper or have had relevant publications of their own to point to. You know, meet the lads halfway sort of thing.So what I’m left with to judge instead is anonymous online commenters vs. the publication of 2 prestigious universities. Whose word do I take on this? Decisions, decisions.You can swap LM out with Web3 out with NFT out with Crypto in this case.

评论 #40931908 未加载