Apple published a paper in June 2025 that called out the entire AI industry.
And the industry has not recovered from it since.
The paper is called “The Illusion of Thinking.” Six Apple researchers. Months of controlled experiments. One conclusion that landed like a grenade.
Frontier reasoning models face a complete accuracy collapse beyond certain complexities.
Complete. Not partial. Not gradual. Complete.
Here is what that actually means.
For two years, every major AI lab has been racing to build reasoning models. OpenAI’s o1, o3. Anthropic’s Claude 3.7 Sonnet Thinking. DeepSeek R1. Google’s Gemini Thinking. These models do not just answer questions, they visibly think first. They show their work. They reason step by step through a problem before arriving at an answer. The entire industry marketed this as the next evolution of intelligence.
Apple tested whether it was real.
They did not use math benchmarks or coding tests, the standard evaluations every AI company optimizes against during training. They built clean, controllable puzzle environments. Tower of Hanoi. River Crossing. Checker Jumping. Blocks World. Problems with precise, verifiable correct answers and zero possibility of data contamination.
Then they systematically turned up the complexity. And watched what happened.
For simpler, low-complexity problems, standard LLMs demonstrated greater efficiency and accuracy, the reasoning models were beaten by regular models that do not think at all. As complexity moderately increased, reasoning models gained an advantage. But when problems reached high complexity, both model types experienced complete performance collapse.
The thinking models, the ones that cost more, take longer, and are marketed as more intelligent, lost to basic models on easy tasks. Then both collapsed completely on hard ones.
But the finding that truly alarmed researchers was not the collapse itself.
It was what happened just before it.
Near the collapse point, reasoning models began reducing their reasoning effort, measured by thinking tokens, as problem complexity increased, despite operating well below their generation length limits.
The models were thinking less on the hardest problems. Not more. Producing shorter reasoning traces. On the tasks that demanded the most intelligence — the AI was quietly giving up. No error message. No warning. Just shorter thoughts and wrong answers delivered with full confidence.
Then there was the overthinking problem on the other end.
In simpler problems, reasoning models often identified correct solutions early but inefficiently continued exploring incorrect alternatives — an overthinking phenomenon. Beyond a certain complexity threshold, models completely failed to find correct solutions and fixated on early incorrect attempts, wasting the remaining inference token budget.
Too much thinking on easy problems. Too little on hard ones. Complete collapse exactly where it matters most.
Apple researchers made the case that the AI industry is grossly overstating the ability of its top models, including OpenAI’s o3, Anthropic’s Claude 3.7, and Google’s Gemini.
The paper went live on a Saturday morning in June 2025. By that afternoon, The Guardian and The Wall Street Journal were covering it. By Monday, the AI community was in open conflict.
Defenders fired back immediately. A researcher published a rebuttal within days, arguing that Apple’s findings primarily reflect experimental design limitations rather than fundamental reasoning failures, and that some River Crossing benchmarks included mathematically impossible instances that no model could have solved.
Then a third group of researchers from Spain’s National Research Council ran the experiments again with refined methods.
They found that previously reported failures on Towers of Hanoi were not purely a result of output constraints, reasoning models still stumble when complexity rises moderately around 8 disks.
Eight disks. On a puzzle designed for children. Complete failure.
Now here is why this paper is more relevant in May 2026 than it was the day it was published.
Since June 2025, every major AI lab has released a new generation of reasoning models. OpenAI shipped GPT-5.4 with extended thinking. Anthropic released Claude Opus 4.6 with enhanced reasoning traces. Google released Gemini 3 with Deep Think mode. DeepSeek released R2. xAI released Grok 3 with Think mode.
Every single one of them is marketed as having solved the reasoning problem Apple identified.
None of them have published controlled results on the specific complexity benchmarks Apple used. None of them have addressed the accuracy collapse curve directly. None of them have shown that the cliff Apple found no longer exists in their newer models.
They have simply released new models, claimed better reasoning, and moved on.
Which means the question Apple asked in June 2025, do these models actually reason, or are they producing the illusion of reasoning, has never been formally answered by the companies whose products depend on the answer being yes.
Apple’s findings show that chain-of-thought only improves accuracy up to a point. Beyond that, models collapse, even when context and planning are not constrained. This breaks the assumption that performance scales linearly with model size.
That assumption has not been retired. It has been doubled down on. The entire 2026 reasoning model race GPT-5.4, Gemini 3, Claude Opus 4.6 is built on the premise that more thinking means better answers at scale.
Apple’s paper says that premise has a cliff.
And the newest, most powerful models in the world have not shown that they found it, let alone that they cleared it.
Every enterprise deploying AI reasoning models today for legal analysis, medical diagnosis, financial modeling, or engineering decisions is operating on an assumption that a June 2025 paper from Apple formally challenged and nobody has formally refuted.
The debate is not settled. The cliff is still there.
The models just got more expensive to fall off of.
Source: Shojaee, Mirzadeh et al. · Apple · “The Illusion of Thinking” · June 2025 · http://arxiv.org/abs/2506.06941