
Generative AI models with “reasoning” may not actually excel at solving certain types of problems when compared with conventional LLMs, according to a paper from researchers at Apple.
Even the creators of generative AI don’t know exactly how it works. Sometimes, they speak about the mystery as an accomplishment of its own, proof they’re researching something beyond human understanding. The Apple team attempted to clarify some of the mystery, delving into the “internal reasoning traces” that underpin how LLMs operate.
Specifically, the researchers focused on reasoning models, such as OpenAI o3 and Anthropic’s Claude 3.7 Sonnet Thinking, which generate a chain of thought and an explanation of their own reasoning before producing an answer.
Their findings show that these models can struggle with increasingly complex problems — at a certain point, their accuracy breaks down completely, often underperforming compared to simpler models.
Standard models outperform reasoning models in some tests
According to the research paper, standard models outperform reasoning models on low-complexity tasks, but reasoning models perform better at medium-complexity tasks. Neither type of model could perform the most complex tasks the researchers set.
Those tasks were puzzles, chosen instead of benchmarks because the team wanted to avoid contamination from training data and create controlled test conditions, the researchers wrote.
SEE: Qualcomm plans to acquire UK startup Alphawave for $2.4 billion to expand in the AI and date center market.
Instead, Apple tested reasoning models on puzzles like the Tower of Hanoi, which involves stacking disks of successive sizes on three pegs. Reasoning models were actually less accurate in solving simpler versions of the puzzle than standard large language models.
Reasoning models performed slightly better than conventional LLMs on moderate versions of the puzzle. At more difficult versions (eight disks or more), reasoning models couldn’t solve the puzzle at all, even when an algorithm to do so was provided to them. Reasoning models would “overthink” the simpler versions and could not extrapolate far enough to solve the harder ones.
Specifically, they tested Anthropic’s Claude 3.7 Sonnet with and without reasoning, as well as DeepSeek R1 vs. DeepSeek R3, to compare models with the same underlying architecture.
Reasoning models can ‘overthink’
This inability to solve certain puzzles suggests an inefficiency in the way reasoning models operate.
“At low complexity, non-thinking models are more accurate and token-efficient. As complexity increases, reasoning models outperform but require more tokens — until both collapse beyond a critical threshold, with shorter traces,” the researchers wrote.
Reasoning models may “overthink,” spending tokens on exploring incorrect ideas even after they’ve already found the correct solution.
“LRMs possess limited self-correction capabilities that, while valuable, reveal fundamental inefficiencies and clear scaling limitations,” the researchers wrote.
The researchers also observed that performance on tasks like the River Crossing puzzle may have been hampered by a lack of similar examples in the model’s training data, limiting their ability to generalize or reason through novel variations.
Is generative AI development reaching a plateau?
In 2024, Apple researchers published a similar paper on the limitations of large language models for mathematics, suggesting that AI math benchmarks were insufficient.
Throughout the industry, there are suggestions that advancements in generative AI may have reached their limits. Future releases may be more about incremental updates than major leaps. For instance, OpenAI’s GPT-5 will combine existing models in a more accessible UI, but may not be a major upgrade, depending on your use case.
Apple, which is holding its Worldwide Developers Conference this week, has been relatively slow to add generative AI features to its products.