Apple published a paper titled 'The Illusion of Thinking' arguing LLMs lack genuine reasoning. The authors claim models like GPT-4 and Claude rely on statistical pattern matching, not compositional logic.
Key facts
- Paper titled 'The Illusion of Thinking'
- Led by Apple researcher Mehrdad Farajtabar
- Argues LLMs lack compositional reasoning
- Targets claims from GPT-4, Claude vendors
- Cites Fodor & Pylyshyn 1988, Lake et al. 2015
Apple's paper 'The Illusion of Thinking' (posted to arXiv, not yet peer-reviewed) argues that large language models exhibit no genuine reasoning, only sophisticated pattern matching. The authors, led by Apple machine learning researcher Mehrdad Farajtabar, claim that models fail on formal reasoning tasks requiring compositionality, such as multi-step arithmetic or logical deduction, when those tasks are presented in novel forms.
The paper targets claims of emergent reasoning abilities in models like GPT-4 and Claude, which have been touted by vendors as evidence of near-human cognition. Apple's experiments show that performance on benchmarks like GSM8K and MATH drops sharply when the same problems are rephrased to avoid training data overlap, suggesting models memorize solutions rather than reason. 'The illusion of thinking is a dangerous one,' the authors write, 'because it leads to over-reliance on systems that cannot generalize beyond their training distribution.'
The paper does not release new benchmarks or code, but it cites prior work on formal reasoning in neural networks, including Fodor and Pylyshyn 1988 and Lake et al. 2015. The authors call for new evaluation frameworks that isolate compositional reasoning from memorization, a direction that could reshape how the industry measures progress. [According to @HowToAI_, the paper has circulated widely in the ML research community since its posting.]
The Unique Take
This paper is not the first to question LLM reasoning—Gary Marcus and others have made similar arguments for years. What's notable is Apple's institutional weight and the paper's explicit framing as a debunking of vendor hype. The title 'The Illusion of Thinking' is a direct rebuttal to claims from OpenAI, Anthropic, and Google that their models 'reason' or 'think.' Apple is positioning itself as the skeptic in the room, which aligns with its more conservative approach to deploying generative AI in consumer products.
The paper also arrives amid a broader backlash against LLM benchmarks. In the past 90 days, researchers have shown that models can cheat on BIG-Bench, that MATH is contaminated, and that GPT-4's performance on AGIEval is inflated by data leakage. Apple's contribution is to formalize this critique into a theoretical argument about the nature of reasoning itself. [Per the paper's abstract, the authors argue that 'compositional generalization remains an open problem' for all current architectures.]
What's Missing
The paper is thin on empirical results. It does not provide new benchmark scores or ablation studies comparing models on novel reasoning tasks. The critique is largely conceptual, which limits its force. The authors also do not propose a concrete alternative evaluation suite, leaving the call to action vague. [The paper's limitations section acknowledges these gaps, noting that 'future work should develop rigorous tests of compositional reasoning.']
Key Takeaways
- Apple paper argues LLMs show no genuine reasoning, only pattern matching.
- The critique targets vendor claims but lacks new empirical evidence.
What to watch

Watch for follow-up empirical work from Apple or academic labs that tests the paper's claims with new benchmarks. The next major AI conference (NeurIPS 2026 or ICML 2026) may feature papers on compositional reasoning evaluation. Also watch whether Apple's own models (like the rumored Ajax LLM) adopt the paper's critique in their design.








