What Happened
In a discussion highlighted by AI researcher Rohan Paul, renowned mathematician Terence Tao provided a clear-eyed assessment of the mathematical foundations of today's large language models (LLMs). His core argument is that the mathematical machinery required to train and run these models is not particularly advanced—it's primarily linear algebra, matrix multiplication, and basic calculus, material well within the grasp of an undergraduate mathematics or engineering student.
According to Tao, we understand the how: the architectural blueprints and optimization algorithms. The profound mystery lies in the why. We lack a predictive theory for why these models, built from simple components, exhibit such unpredictable and emergent capabilities—excelling brilliantly on some tasks while failing inexplicably on others.
Context: The Theory-Practice Gap in AI
Tao's comments pinpoint a central tension in modern machine learning. The field has achieved staggering empirical success through scaling—more data, more parameters, more compute—but its theoretical underpinnings lag far behind. Engineers can build increasingly powerful models, but researchers cannot reliably forecast a model's performance on a novel task before testing it.
He frames the problem through a mathematical lens of data structure. At the extremes, our theories are strong:
- Pure noise is well-understood.
- Perfectly structured data (like formal logic) is well-understood.
The problem is natural language and real-world data, which inhabit a messy middle ground—"partly structured and partly random." Tao draws a parallel to physics, which has robust theories for the quantum scale (atoms) and the continuum scale (classical mechanics) but struggles with the "meso-scale" in between. Similarly, mathematics lacks a mature theory for this semi-structured regime where LLMs operate.
This theoretical gap explains why progress remains largely empirical. Researchers observe "capability jumps" at certain scales but cannot derive them from first principles. We can describe the transformer's forward pass but cannot explain why it generalizes to tasks far beyond its training distribution.
The core puzzle, as Tao defines it, is this mismatch: simple, understandable machinery producing complex, hard-to-predict behavior.



