Can you do better than top-level AI models on these basic vision tests?

A bit myopic —

Abstract analysis that is trivial for humans often stymies GPT-4o, Gemini, and Sonnet.

Whatever you do, don't ask the AI how many horizontal lines are in this image.

Enlarge / Whatever you do, don't ask the AI how many horizontal lines are in this image.

Getty Images

In the last couple of years, we've seen amazing advancements in AI systems when it comes to recognizing and analyzing the contents of complicated images. But a new paper highlights how many state-of-the-art "vision learning Models" (VLMs) often fail at simple, low-level visual analysis tasks that are trivially easy for a human.

In the provocatively titled pre-print paper "Vision language models are blind" (which has a PDF version that includes a dark sunglasses emoji in the title), researchers from Auburn University and the University of Alberta create eight simple visual acuity tests with objectively correct answers. These range from identifying how often two colored lines intersect to identifying which letter in a long word has been circled to counting how many nested shapes exist in an image (representative examples and results can be viewed on the research team's webpage).

  • If you can solve these kinds of puzzles, you may have better visual reasoning than state-of-the-art AIs.

  • The puzzles on the right are like something out of Highlights magazine.

  • A representative sample shows AI models failing at a task that most human children would find trivial.

Crucially, these tests are generated by custom code and don't rely on pre-existing images or tests that could be found on the public Internet, thereby "minimiz[ing] the chance that VLMs can solve by memorization," according to the researchers. The tests also "require minimal to zero world knowledge" beyond basic 2D shapes, making it difficult for the answer to be inferred from "textual question and choices alone" (which has been identified as an issue for some other visual AI benchmarks).

Are you smarter than a fifth grader?

After running multiple tests across four different visual models—GPT-4o, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5—the researchers found all four fell well short of the 100 percent accuracy you might expect for such simple visual analysis tasks (and which most sighted humans would have little trouble achieving). But the size of the AI underperformance varied greatly depending on the specific task. When asked to count the number of rows and columns in a blank grid, for instance, the best-performing model only gave an accurate answer less than 60 percent of the time. On the other hand, Gemini-1.5 Pro hit nearly 93 percent accuracy in identifying circled letters, approaching human-level performance.

  • For some reason, the models tend to incorrectly guess the "o" is circled a lot more often than all the other letters in this test.

  • The models performed perfectly in counting five interlocking circles, a pattern they might be familiar with from common images of the Olympic rings.

  • Do you have an easier time counting columns than rows in a grid? If so, you probably aren't an AI.

Even small changes to the tasks could also lead to huge changes in results. While all four tested models were able to correctly identify five overlapping hollow circles, the accuracy across all models dropped to well below 50 percent when six to nine circles were involved. The researchers hypothesize that this "suggests that VLMs are biased towards the well-known Olympic logo, which has 5 circles." In other cases, models occasionally hallucinated nonsensical answers, such as guessing "9," "n", or "©" as the circled letter in the word "Subdermatoglyphic."

Overall, the results highlight how AI models that can perform well at high-level visual reasoning have some significant "blind spots" (sorry) when it comes to low-level abstract images. It's all somewhat reminiscent of similar capability gaps that we often see in state-of-the-art large language models, which can create extremely cogent summaries of lengthy texts while at the same time failing extremely basic math and spelling questions.

These gaps in VLM capabilities could come down to the inability of these systems to generalize beyond the kinds of content they are explicitly trained on. Yet when the researchers tried fine-tuning a model using specific images drawn from one of their tasks (the "are two circles touching?" test), that model showed only modest improvement, from 17 percent accuracy up to around 37 percent. "The loss values for all these experiments were very close to zero, indicating that the model overfits the training set but fails to generalize," the researchers write.

The researchers propose that the VLM capability gap may be related to the so-called "late fusion" of vision encoders onto pre-trained large language models. An "early fusion" training approach that integrates visual encoding alongside language training could lead to better results on these low-level tasks, the researchers suggest (without providing any sort of analysis of this question).


Related stories
2 weeks ago - Exclusive: Gemini's data-analyzing abilities aren't as good as Google claims  TechCrunchI challenged Gemini Flash 1.5 in AI studio with 3 prompts — its better than the app  Tom's GuideGoogle Cloud CEO offers customers a more accurate...
2 weeks ago - Microsoft's Surface Laptop 7 Copilot+ PC is finally the best clamshell laptop on the market after 8 years of iterations  Windows CentralWindows on Arm puts Intel on notice  The VergeMicrosoft Copilot+ review: Performance on Qualcomm...
2 days ago - The former Wired editor-in-chief on why it makes sense to partner with AI firms.
2 weeks ago - ChatGPT isn't the only free AI chatbot on the net. We tested and compared them to figure out the best one for you.
3 weeks ago - In Rust, we trust. But in gen-AI to not hallucinate? Eh, that's another story Hands on Large language models (LLMs) are generally associated with chatbots such as ChatGPT, Copilot, and Gemini, but they're by no means limited to Q&A-style...
Other stories
16 minutes ago - Why You Can Trust CNET Our expert, award-winning staff selects the products we cover and rigorously researches and tests our top picks. If you buy...
16 minutes ago - Amazon Prime subscribers can unlock their full streaming libraries with these VPNs, which are crucial if you travel abroad.
16 minutes ago - The worst offenders jacked up prices by as much as $50 from one year to the next.
22 minutes ago - Amazon Prime Day Starts in 2 Days. Here Are The Best Early Deals We’ve Found So Far.  The New York TimesView Full Coverage on Google News
22 minutes ago - The Google Search app for iPhone and iPad, which basically is its own browser, now offers customizable homescreen icons...