Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

Irrelevant red herrings lead to "catastrophic" failure of logical inference.

What is going on inside that anthropomorphized digital brain? Credit: Getty Images

For a while now, companies like OpenAI and Google have been touting advanced "reasoning" capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical "reasoning" displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems.

The fragility highlighted in these new results helps support previous research suggesting that LLMs use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities. "Current LLMs are not capable of genuine logical reasoning," the researchers hypothesize based on these results. "Instead, they attempt to replicate the reasoning steps observed in their training data."

Mix it up

In "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"—currently available as a pre-print paper—the six Apple researchers start with GSM8K's standardized set of over 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs' complex reasoning capabilities. They then take the novel approach of modifying a portion of that testing set to dynamically replace certain names and numbers with new values—so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation.

This approach helps avoid any potential "data contamination" that can result from the static GSM8K questions being fed directly into an AI model's training data. At the same time, these incidental changes don't alter the actual difficulty of the inherent mathematical reasoning at all, meaning models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K.

Simply changing specific names and numbers found in GSM8K tests led to significant decreases in performance in many models.

Simply changing specific names and numbers found in GSM8K tests led to significant decreases in performance in many models. Credit: Apple Research

Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found average accuracy reduced across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent, depending on the model. The results also showed high variance across 50 separate runs of GSM-Symbolic with different names and values. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model and, for some reason, changing the numbers tended to result in worse accuracy than changing the names.

This kind of variance—both within different GSM-Symbolic runs and compared to GSM8K results—is more than a little surprising since, as the researchers point out, "the overall reasoning steps needed to solve a question remain the same." The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any "formal" reasoning but are instead "attempt[ing] to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data."

Don’t get distracted

Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things. OpenAI's ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic. That's a pretty high success rate using either benchmark, regardless of whether or not the model itself is using "formal" reasoning behind the scenes (though total accuracy for many models dropped precipitously when the researchers added just one or two additional logical steps to the problems).

An example showing how some models get mislead by irrelevant information added to the GSM8K benchmark suite.

An example showing how some models get mislead by irrelevant information added to the GSM8K benchmark suite. Credit: Apple Research

The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding "seemingly relevant but ultimately inconsequential statements" to the questions. For this "GSM-NoOp" benchmark set (short for "no operation"), a question about how many kiwis someone picks across multiple days might be modified to include the incidental detail that "five of them [the kiwis] were a bit smaller than average."

Adding in these red herrings led to what the researchers termed "catastrophic performance drops" in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the inherent limits in using simple "pattern matching" to "convert statements to operations without truly understanding their meaning," the researchers write.

Introducing irrelevant information to the prompts often led to "catastrophic" failure for most "reasoning" LLMs

Introducing irrelevant information to the prompts often led to "catastrophic" failure for most "reasoning" LLMs Credit: Apple Research

In the example with the smaller kiwis, for instance, most models try to subtract the smaller fruits from the final total because, the researchers surmise, "their training datasets included similar examples that required conversion to subtraction operations." This is the kind of "critical flaw" that the researchers say "suggests deeper issues in [the models'] reasoning processes" that can't be helped with fine-tuning or other refinements.

The illusion of understanding

The results of this new GSM-Symbolic paper aren't completely new in the world of AI research. Other recent papers have similarly suggested that LLMs don't actually perform formal reasoning and instead mimic it with probabilistic pattern-matching of the closest similar data seen in their vast training sets.

Still, the new research highlights just how fragile this kind of mimicry can be when the prompt in question pushes it in a direction that doesn't precisely match any training data. It also highlights the inherent limitations in trying to perform high-level reasoning without any underlying model of the logic or world behind it. As Ars' Benj Edwards put it in a July story about AI video generation:

One of the reasons OpenAI's GPT-4 turned heads in text synthesis is that the model finally reached a size where it was large enough to have absorbed enough information (in training data) to give the impression that it might be able to genuinely understand and model the world when, in reality, a key aspect of its success is that it "knows" far more than most humans and can impress us by combining those existing concepts in novel ways. With enough training data and computation, the AI industry will likely reach what you might call "the illusion of understanding" with AI video synthesis eventually...

We're likely seeing a similar "illusion of understanding" with AI's latest "reasoning" models, and seeing how that illusion can break when the model runs in to unexpected situations.

AI expert Gary Marcus, in his excellent analysis of the new GSM-Symbolic paper, argues that the next big leap in AI capability will only come when these neural networks can integrate true "symbol manipulation, in which some knowledge is represented truly abstractly in terms of variables and operations over those variables, much as we see in algebra and traditional computer programming..." Until then, we're going to get the kind of brittle "reasoning" that can lead AI models to fail mathematical tests in ways that calculators never do.

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

Mix it up

Don’t get distracted

The illusion of understanding

Related stories

Other stories