I think while the findings are interesting, the title "Vision language models are blind" is hyperbolic and misleading of the findings. It is true that VLMs have a very different way of sensing and working with image inputs and for low resolutions each image is split up into blocks and then mapped into discrete tokens. They actually really dont have access to finegrained details because this mapping is quite lossy. So, in that sense, these results completely make sense and were not surprising to me but "blind" is a strong implication and the work isnt really able to back that.<p>Take the first example of the four 2D line plots for instance. I asked Sonnet 3.5 for an answer and retried it 5 times and 2 out of 5 times, it said something like this:<p><i>Me</i>:<p>here's an image with four 2-D line plots. Can you tell me A) how many lines are in each plot? and B) how many times do they intersect each other in each plot?<p>Write your answers as:
Plot 1 (leftmost): ...<p><i>Sonnet</i>:<p>I'll analyze each plot from left to right and provide the requested information:<p>Plot 1 (leftmost):
A) 2 lines (one red, one blue)
B) 0 intersections (the lines do not cross)<p>Plot 2:
A) 2 lines (one red, one blue)
B) 1 intersection (the lines cross once near the right edge)<p>Plot 3:
A) 2 lines (one red, one blue)
B) 1 intersection (the lines cross once near the center)<p>Plot 4 (rightmost):
A) 2 lines (one red, one blue)
B) 2 intersections (the lines cross twice, once near the left side and once near the right side)<p>Each plot consistently shows two lines, one red and one blue, with varying intersection patterns across the four plots.<p>---<p>Which isn't completely correct (the 3rd plot has 2 intersections) but still pretty good and the other 3 times, it said plot 3 has 1 intersection.