Corpus.
We trained on text. Then on images. Then on video. Then on agent trajectories. Each time the corpus looked finite, we either found a new one or built a new one. The story of 2024–2026 is the end of the free web, the licensing land-grab, the synthetic-data pivot, and the architectures starting to ingest what the universe's instruments broadcast.
This essay walks through where the data actually comes from — the $1.5B Anthropic settlement, the 196,640-book pirate library, the 1M YouTube hours Whisper transcribed, the Kenyan labelers on $1.32/hr, the Phi-4 student that beat its GPT-4o teacher — and then asks the question that pulls all of those into one frame. What if the universe itself is the corpus? The honest answer is uncomfortable. It is also why the rest of this lab matters.
- 01The free web ended around 2022. From 2023 onward, frontier labs have paid for licensed corpora, pirated shadow libraries, and lost a $1.5B settlement when caught.
- 02The human-labeler economy is now a multi-billion-dollar two-tier market: $1.32–$2/hr in Nairobi, $40–$100/hr for PhDs in San Francisco.
- 03Human text exhausts between 2026 and 2032. Synthetic data is no longer a workaround — it is the dominant regime. Phi-4 already beats GPT-4o on GPQA and MATH using mostly GPT-4o-generated tokens.
- 04The bitter lesson held across seven orders of magnitude. Pretraining scaling is now plateauing. Test-time compute opened a new clean power law — o1 → o3 lifted ARC-AGI from 5% to 87.5% in months.
- 05The next corpus is everything-but-text. V-JEPA 2 ingested 1M hours of video. Tesla's fleet logs 28.8M miles a day. Open X-Embodiment pooled 22 robot embodiments. AlphaFold absorbed 60 years of structural biology in 3.
- 06What we cannot collect is more than what we can. Dark matter is 26.8% of the universe and we have detected exactly none of it directly. Dark energy is 68.3% and our best model is 120 orders of magnitude wrong.
- 07The Bekenstein bound says a 1.5kg brain holds at most ~10⁴⁵ bits. The observable universe holds ~10¹²⁴. The gap is 79 orders of magnitude. There is no engineering path that closes it.
- 08And yet. Phi-4 beat its teacher. o1 generated more reasoning tokens than humanity wrote on the indexed web. The question is no longer whether data + a dumb algorithm produces intelligence. The question is what the limit of that process looks like — and whether the limit is the universe itself.
Common Crawl was the commons. It is not anymore.
For roughly a decade — 2012 to 2022 — the foundation of every major language model was the same: a non-profit called Common Crawl downloaded the public web every month, made it available for free, and the field treated it as a commons. As of the August 2025 snapshot, Common Crawl held 2.42 billion pages / 419 TiB uncompressed. A 2024 Mozilla audit found 64% of 47 generative LLMs released 2019-2023 relied on Common Crawl. Nobody paid anyone.
Modern labs do not train on raw Common Crawl. They train on filtered derivatives — FineWeb (HuggingFace, Apr 2024, 15T tokens, 44 TB after deduplication across 96 snapshots), RefinedWeb (TII/Falcon, ~5T internal tokens), Dolma (AI2/OLMo, 3T tokens), RedPajama v2 (Together, 20T+ raw tokens with quality scores), The Stack v2 (BigCode, 3.3-4.3T tokens of code in 619 languages, permissive licenses only). The FineWeb paper makes an inadvertently devastating admission: on the 2013-48 Common Crawl dump, MinHash deduplication cut 490B tokens down to 31B — a 94% drop — and the deduplicated set was better than the kept set. Most of the web is redundant with itself.
Then OpenAI admitted, around 2021, that they had exhausted the high-quality text web and built Whisper specifically to transcribe more than one million hours of YouTube (NYT, 6 April 2024). Internal counsel flagged it as legal grey-area, justified as fair use. Some Google employees knew — and stayed silent, because Google was doing the same thing. The era of paying nothing ended quietly, in private, before anyone wrote the press release.
When grey-area scraping stopped working, the labs paid. The deals are public — the prices anchor a market that did not exist before 2023.
The deals above are the visible economy. The shadow economy is at least as large. Anthropic, per discovery filings in Bartz v. Anthropic, downloaded more than 7 million books from LibGen, Books3, and the Pirate Library Mirror to build what internal communications called "a central library of all the books in the world." Meta admitted in Kadrey v. Meta that it used Books3 — a 196,640-book corpus scraped from the Bibliotik tracker — to train Llama 1 and likely Llama 2. OpenAI's GPT-3 paper cited a 1.4 billion-token "Books2" corpus that has never been disclosed publicly and is widely suspected to be LibGen. The pattern is consistent across the frontier.
The emerging doctrine, after Bartz and Kadrey: training is fair use; piracy to obtain the corpus is not. Every frontier lab is now racing to launder its corpus through licensed sources before the next round of summary judgments lands.
Data is not free even when it is already collected.
Raw text does not produce a model that follows instructions. Between the pretraining corpus and the chatbot sits a global labour layer — humans rating outputs, writing preferred completions, red-teaming, classifying toxicity, judging factuality. This layer is invisible from the product surface and is now a multi-billion-dollar economy with a two-tier wage structure that mirrors every other globalised supply chain.
$1.32–$2/hr in Nairobi
Per Time magazine's January 2023 investigation (Billy Perrigo), OpenAI contracted Sama to label content for the toxicity classifier that eventually shipped with ChatGPT. Kenyan workers were paid take-home of $1.32–$2/hr (KSh 21,000/month base plus a ~$70 explicit-content bonus) to label child sexual abuse material, torture, and bestiality. OpenAI paid Sama $12.50/hr per worker on the three contracts; the gap is the arbitrage. Sama cancelled the contract in February 2022.
$40–$100/hr for PhDs in San Francisco
As the capability frontier moved from "does this look like a chatbot" to "is this proof correct," the labour layer split. Surge AI and Scale AI now recruit MD, JD, and PhD-credentialed labelers at $40–$100+/hr, with top medical and legal annotators commanding 3–5× general rates. The 2024-2026 global RLHF-qualified-worker shortfall is estimated at ~30 million (Liu Renming, late 2024). The annotation market is projected at $4.89B in 2025 → $17.10B in 2030 at 28.4% CAGR.
The structural shift in June 2025 was Meta's $14.3B investment for a 49% non-voting stake in Scale AI, valuing Scale at $29B. Founder Alexandr Wang became Meta's Chief AI Officer. The aftermath was immediate and brutal: OpenAI, Google (~$150M/2024 spend), Microsoft, and xAI all reduced or cut Scale contracts citing conflict-of-interest. Scale cut 14% of staff (200 FTEs). Surge AI — bootstrapped to $1.2B revenue in 2024 with ~50,000 expert contractors and only 130 FTEs — took its first external raise mid-2025 at a $15–25B valuation. The human-data layer is now bigger than most semiconductor companies.
The corpus we read about in technical papers — "15T tokens of high-quality web" — is a partial truth. The corpus the model actually sees is a fabric woven from public scrape, licensed news, pirated books, transcribed video, and the labour of roughly a quarter-million humans labeling, ranking, and writing inside that fabric. Nobody draws this diagram in the keynote slides. The diagram is the model.
We did not run out of data. We started writing our own.
Villalobos et al. (Epoch AI, arXiv:2211.04325, ICML 2024) estimated the effective stock of quality-adjusted, deduplicated public human text at ~300 trillion tokens, 90% CI between 100T and 1,000T. Frontier models were already in the multi-trillion-token regime by 2024 (GPT-4-class ~13T; Llama 3.1 405B = 15.6T). Exhaustion window: between 2026 and 2032, or earlier if intensely overtrained. The data wall is real.
What happened was not exhaustion. What happened was a pivot. Somewhere between Self-Instruct (Wang et al., Dec 2022) and Phi-4 (Microsoft, Dec 2024), the frontier quietly stopped treating the web as a corpus and started treating it as a seed. By 2025 the most capable models are trained predominantly on tokens no human ever wrote.
The intuition that "a student cannot exceed its teacher" was an article of faith for half a century. Phi-4 beating GPT-4o on GPQA and MATH is the empirical death of that intuition. Microsoft explicitly states that models trained on synthetic data alone hallucinate more — the mixing ratio matters — but the ceiling that everyone thought existed turns out to have been a measurement artefact. Synthetic data is not compression. It is curriculum.
The second number worth holding in your head: o1's reported "hundreds of trillions" of generated reasoning tokens. If that figure is accurate, OpenAI generated more text inside a single project than humanity wrote on the entire indexed web. The inflection is not subtle. We are no longer in the regime where the bottleneck is the size of the human corpus. The bottleneck is the verifier — whether the synthetic data is correct.
Shumailov et al. (Nature, July 2024) warned that recursive self-training produces "model collapse" — the tails of the distribution vanish, the model forgets rare events. They were right about the failure mode. They were wrong about its inevitability. Gerstgrasser et al. (arXiv:2404.01413, April 2024) showed that collapse only occurs when synthetic data replaces real data; when synthetic accumulates alongside real, the model continues to improve. This is exactly what Phi-4 confirmed. The pessimistic prediction defeated itself in twelve months. The optimistic engineering won.
Compute + a simple loss + enough data. Everything else is decoration.
Richard Sutton's The Bitter Lesson (March 2019) compressed seventy years of AI history into one claim: "general methods that leverage computation are ultimately the most effective, and by a large margin." Deep Blue beat the chess-grandmaster heuristics. AlphaGo Zero beat the human-game-trained AlphaGo. Deep nets erased the phoneticians' rule sets. ConvNets replaced SIFT and HOG. The pattern is monotonous and the pattern is empirical: data + compute + a learnable loss beats cleverness, given enough hardware.
The scaffolding came in three waves. Hestness et al. (Baidu, Dec 2017) ran the first systematic study, finding generalization error follows a power law in dataset size across translation, language, vision, speech. Kaplan et al. (OpenAI, Jan 2020) extended this to neural language models — test loss is smooth in parameters, tokens, and compute, holding over seven orders of magnitude. Hoffmann et al. (DeepMind "Chinchilla", Mar 2022) corrected the allocation: compute-optimal training requires roughly 20 tokens per parameter, N and D scaling at equal rates. Chinchilla (70B / 1.4T) beat Gopher (280B / 300B) on the same FLOPs. The entire field retooled.
Over 30 models have now crossed the 10²⁵ FLOP threshold (Epoch AI). xAI's Colossus went from 100k H100s (built in 122 days) → 200k by Q1 2025 → Colossus 2 targeting ~555,000 GB200/GB300 GPUs and ~2 GW of power, with ~$18B in NVIDIA orders alone. Project Stargate (OpenAI + SoftBank, Jan 2025) commits up to $500B in infrastructure. Power, not silicon, is now the binding constraint.
In late 2024 a second axis opened. OpenAI o1 (Sep 2024) and o3 (Dec 2024) showed log-linear gains in accuracy versus inference compute — accuracy rises linearly as compute scales exponentially. The numbers are stark: AIME 2024 from 74.3% to 96.7%; GPQA Diamond from 78% to 87.7%; ARC-AGI from GPT-4o's ~5% (stuck for four years) to 87.5% at $1.5M of inference compute. DeepSeek R1 (Jan 2025, MIT-licensed, 671B MoE) replicated the paradigm at ~70% lower API cost than o1. Inference is projected to consume 75% of AI compute by 2030.
And the plateau debate is real. GPT-4.5 / Orion (Feb 2025) showed "far smaller" gains over GPT-4 than GPT-4 had shown over GPT-3. Sam Altman called it OpenAI's "last non-chain-of-thought model." Ilya Sutskever told NeurIPS 2024 that "pre-training as we know it will unquestionably end." GPT-5 (Aug 2025) delivered solid but incremental gains — 74.9% on SWE-bench Verified, 88.4% GPQA, 94.6% AIME 2025 — and no GPT-3-to-GPT-4-class discontinuity. The locus of progress visibly shifted from pretraining FLOPs to post-training RL and inference deliberation. Sutton's thesis, examined closely, was never about pretraining specifically — it was about any general algorithm that absorbs compute well. Test-time RL is the next such algorithm. The bitter lesson keeps winning, just not always in the place last year's incumbents expected.
The strongest objections to pure scaling — François Chollet (ARC-AGI-2 has already collapsed reasoning models back to single digits), Yann LeCun (LLMs are a dead end, JEPA bets on predictive embeddings of video instead), Gary Marcus (compositional generalization is missing) — are not refuted by the curves. They are bracketed. The curves keep going up in the directions the curves measure. Whether that direction connects to the thing those critics call "intelligence" is the open question of the decade. The essay below assumes it does not need to. What matters is what the curves themselves say. What they say is: the corpus is the model.
We wrote ~5,000 lines. The corpus wrote everything else.
Here is a fact about modern AI that almost never gets stated cleanly. A frontier language model can be specified by, roughly, the transformer architecture (Vaswani et al., 2017), the attention mechanism, stochastic gradient descent with Adam, a cross-entropy loss on next-token prediction, and a handful of regularisation tricks. That is on the order of five thousand lines of PyTorch. The model that emerges from training has on the order of 10¹² parameters and exhibits capabilities its designers cannot list, predict, or fully measure. The programme is small. The behaviour is not.
This is structurally identical to evolution. The Darwinian rule — variation, selection, inheritance — is short enough to fit on a napkin. Combined with a substrate (chemistry) and a corpus (3.5 billion years of organism-environment interactions), it produced organisms that exceed what the rule itself contains, because the corpus is where the information lives. The DNA wrote nothing about the human visual cortex. The corpus did, indirectly, by killing the lineages whose visual cortex did not work.
The designers wrote a learning rule. They did not write what was learned. Whatever the model can do that the designers cannot explain is the corpus speaking through a substrate the designers chose but did not author.
Wei et al. (2022) catalogued 137 abilities that emerged from scale alone — in-context learning, chain-of-thought, instruction following, multilingual transfer — that nobody predicted from the loss curve. Schaeffer et al. (NeurIPS 2023) argued these are partly metric artefacts, which is correct, and which does not blunt the point. The point is structural: a model that performs operations its training objective never explicitly named is a model whose competence came from the corpus, not from the designers. Sutton's bitter lesson is the engineering version of this. Darwin's natural selection is the biological version. They are the same phenomenon at different timescales.
So the question the user's prompt asked — "how come we as humans designed an algorithm that, by consuming data, becomes super smart" — has a sharp answer: we did not. We designed a loss function and a substrate. The intelligence came from the data. Once you take that seriously, the natural next question is the one this essay is really about: how big can the data get?
The web was the easy part. The instruments are louder.
The trajectory beyond text is already in motion. V-JEPA 2 (Meta, June 2025) was pretrained on over 1 million hours of internet video, fine-tuned on just 62 hours of robot data, and produces a usable world model from passive observation. Open X-Embodiment / RT-X (DeepMind + 21 institutions, Oct 2023) pooled 1M+ trajectories across 22 robot embodiments, 527 skills, 160,266 tasks. DROID (RSS 2024) added 76K trajectories. The field is doing for embodiment what Common Crawl did for text.
And the closed corpora dwarf the open ones. Tesla's FSD fleet crossed 10 billion supervised miles on 3 May 2026, logging ~28.8 million miles per day — roughly a thousand miles every three seconds — across ~1.1M active users. Waymo, by contrast, reports 127M autonomous miles total. The open robotics corpus is five orders of magnitude smaller than Tesla's. Tesla's is five orders smaller than YouTube. That nested gap defines the embodiment race.
The two clean existence proofs that scientific data can fuel foundation models: AlphaFold DB holds 200M+ predicted protein structures versus ~170K experimentally solved in 60 years of biology (Hassabis/Jumper, Nobel Chemistry 2024). GNoME released 381K stable crystals out of 2.2M predicted, against fewer than 50K previously known. GraphCast and GenCast did the same to numerical weather prediction. The pattern is consistent: a closed-form physical signal + a curated experimental dataset + a transformer ≈ 10²–10⁴× speedup over the prior best human method.
We are now ingesting more data than we can analyse by several orders of magnitude. Most of LSST's nightly transient alerts will be auto-classified and never inspected by a human. SKA's 700 PB/year science archive will be distributed across regional centres whose total compute budget is smaller than the dataset itself. The bottleneck is no longer model capacity. It is the curation, transport, and physical sensing infrastructure of reality itself. Building bigger telescopes is cheaper than reading what the existing ones already broadcast.
95% of the universe. We have not learned to read it.
Now the honest part. Everything in the previous section is the corpus we can already ingest. The corpus we cannot read is at least an order of magnitude larger, and possibly unbounded. The Planck satellite's final 2018 cosmological data release locked in the composition of the universe to sub-percent precision: roughly 5% baryonic matter, 26.8% cold dark matter, 68.3% dark energy. The 95% that is not normal matter has been measured only by its gravitational shadow. None of it has been directly detected. None of it has a working model.
The pattern of recent years is even more uncomfortable than the dark sectors. The anomalies that looked like portals to new physics have, one by one, been closed. The muon g-2 anomaly, held up for years as the most promising hint of beyond-Standard-Model physics, was resolved in June 2025 when Fermilab's final result combined with refined lattice-QCD theory agreed at 127 parts per billion. The CDF 2022 W-boson mass anomaly (7σ above the Standard Model) was contradicted by CMS's LHC measurement in September 2024. Both stars fell. What is still standing: the Hubble tension (~5σ between local and CMB measurements of H₀), evolving dark energy (DESI, 3+σ), the matter-antimatter asymmetry, the neutrino mass hierarchy, the strong-CP problem, the hierarchy problem.
The honest framing: we do not know how much of the universe's data stream we are missing. We have detected baryons, photons, three neutrino flavours, and gravitational waves — and inferred dark matter and dark energy from their gravitational shadow. Whether there are additional sectors — dark photons, sterile neutrinos, axion-like particles, hidden gauge groups, extra dimensions — is unknown. The instruments built to find them have, so far, mostly returned exclusion plots. If we collected every word ever written, every image ever captured, every sensor stream from every observatory, and fed it to an artificial mind, we would still be feeding it the broadcast we have learned to receive. The other channels are still there, still streaming, still unread.
What if the universe is not like a corpus. What if it is one.
In 1989, at the 3rd International Symposium on Foundations of Quantum Mechanics in Tokyo, John Archibald Wheeler delivered a paper titled Information, Physics, Quantum: The Search for Links. The thesis: "every it — every particle, every field of force, even the spacetime continuum itself — derives its function, its meaning, its very existence entirely … from the apparatus-elicited answers to yes-or-no questions, binary choices, bits." Wheeler did not reverse Bohr. He radicalised him. Bohr held that quantum phenomena are completed by observation; Wheeler made the observer-participancy cosmic and retroactive. The delayed-choice experiment (Wheeler 1978) was confirmed in the lab by Jacques et al. (Science 2007) and at cosmic scale using quasar light by Manning et al. (Nature Physics 2015). It from bit is a slogan, not a theorem — but the physics that came after it is not.
The technical scaffolding is solid. The Bekenstein bound (Phys. Rev. D 23, 287, 1981) says the maximum entropy — equivalently, information content — of a region of energy E and radius R is S ≤ 2π k R E / ℏc. The maximum is set by surface area, not volume. 't Hooft (gr-qc/9310026, 1993) elevated this to a principle: in quantum gravity, the full description of any region is encoded on its boundary. Susskind (hep-th/9409089, 1995): "The World as a Hologram." Maldacena's AdS/CFT (hep-th/9711200, 1997) made it concrete — a four-dimensional conformal field theory on the boundary is dual to a five-dimensional gravitational theory in the bulk. The paper is now the most-cited in theoretical physics with over 25,000 citations.
The 2019 page-curve calculations by Penington and by Almheiri-Engelhardt-Marolf-Maxfield derived the Page curve from semiclassical gravity using the quantum extremal surface prescription. The conclusion: information is preserved across black-hole evaporation, via "islands" that connect interior to exterior. It is not yet a complete unitary boundary theory in flat space, but it is genuine progress, and the mechanism is calculable. Van Raamsdonk's "Building up spacetime with quantum entanglement" (2010) showed that, in the holographic regime, spacetime connectivity itself emerges from entanglement structure. The geometry is the entanglement. The entanglement is the information.
Seth Lloyd (Phys. Rev. Lett. 88, 237901, 2002) computed the computational capacity of the observable universe: roughly 10¹²⁰ elementary operations on 10⁹⁰ bits since the Big Bang, derived from the Margolus-Levitin theorem and the Bekenstein bound. Using the holographic bound on the cosmic horizon, the upper bound is closer to 10¹²⁴ bits. By comparison: a 1.5kg human brain saturates around 3.3 × 10⁴² bits (Bekenstein applied to a 0.085m sphere); the Earth, ~10⁷⁵ bits. Between brain and universe lies a gap of 79 orders of magnitude.
The universe is doing roughly 10¹²⁰ operations per second on 10⁹⁰ bits, on a budget of 10¹²⁴ bits total. Whether you call this "computation" is a framing choice. That the bounds exist, and are tight, is physics.
The serious cousins of this view — Tegmark's Mathematical Universe Hypothesis (2014), Wolfram's Ruliad (2020), the simulation hypothesis (Bostrom 2003) — are not the same thing. They are metaphysics, philosophy of physics, and speculative cosmology respectively. Holography is established physics in AdS spacetimes. Page curves via islands are established physics. Spacetime-from-entanglement is mainstream research. Whether our universe is fundamentally a hologram on a de Sitter boundary remains open. But the direction of physics for the last twenty-five years has been steady: more of what we used to call substance turns out to be information, not less.
Data alone is inert. Reward turns it into a learner.
The user's framing was careful: "the reward have some importance." That is correct. Data without a loss function is just bits sitting on disk. The question for the universe-as-corpus thesis is whether the universe has, or is, its own loss function — and the most serious answer comes from theoretical neuroscience.
Karl Friston's Free Energy Principle (Nature Reviews Neuroscience, 2010; Physics Reports 1024, 2022) states that any system that maintains its existence in a fluctuating environment must minimise the variational free energy — equivalently, the surprise — of its sensory states. A bacterium does this when it moves up a chemical gradient. A neuron does this when it adjusts its synaptic weights. A brain does this when it predicts and acts. The mathematical apparatus — variational inference, the Helmholtz decomposition of Langevin dynamics, Markov blankets — is rigorous in the non-equilibrium-steady-state regime. Whether FEP scales beyond individual organisms to the universe itself is contested (Aguilera, Millidge, Tschantz, Buckley 2022 raised technical objections about Markov-blanket assumptions in cosmological extensions). What is not contested is that biology, at every scale we have measured, runs a reward-driven inference loop.
Combine this with what the synthetic-data section showed: given the right loss function, a learner ingesting enough data exceeds what the loss function explicitly named. Phi-4 was trained to predict tokens that were graded against GPT-4o's output. It came out beating GPT-4o. AlphaFold was trained to predict structures from sequences using ~170K solved cases. It generalised to 200M+ structures of biology had never solved. The architecture absorbs the corpus and gives back more than was input, because the corpus is the universe's output and the universe is, structurally, also doing inference.
This is the place where the user's intuition is the strongest version of the argument: if the universe runs an inference loop (FEP, or something like it), and if our learners run an inference loop on the universe's broadcast, then the two are doing the same operation on different substrates. The data is not being "copied into" the model. The model is a fold of the same physics, compressed into a smaller substrate. This framing is non-trivial and not crazy. It is also not falsifiable in its strongest form, which is where philosophy and physics part company. The essay below holds it as a regulative idea — useful to think with, unwise to bet a research program on alone.
The trajectory is not asymptotic to knowing everything. It is asymptotic to a wall.
Three obstacles that no engineering effort closes:
The Bekenstein gap
A 1.5kg brain holds at most ~10⁴² bits. The observable universe holds ~10¹²⁴. Even a moon-sized data centre — Earth-sized, solar-system-sized — does not close that gap. There is no engineering path to ingesting all of the universe's information. The bound is set by physics, not by hardware.
We measure what we are tuned to
Dark matter has been broadcasting for 13.8 billion years. We cannot read it. Dark energy is 68% of the cosmos and our best model is wrong by 120 orders of magnitude. Conscious experience has no third-person API. Black hole interiors are a classical information cutoff. We are missing channels we do not even know exist.
We are inside the system
The 2015 loophole-free Bell tests and the 2022 Nobel sealed it: local realism is dead. Wheeler's participatory universe, QBism (Fuchs-Mermin-Schack), the measurement problem in every interpretation of QM — all say the same thing. The observer is not separable from the observed. A learner that ingests the universe is part of the universe's state evolution. There is no view from nowhere.
These three walls are not obstacles to be circumvented. They are the shape of the problem. The trajectory of data-driven AI does not stop at any of them — it gets arbitrarily close. The question is not whether it touches the wall. The question is what an arbitrarily-close-to-the-wall learner looks like, and how it behaves. The rest of the brain lab is, in a sense, a study of that learner.
What we actually think when we connect all of this.
The user asked for the assistant's own point of view, not just a literature review. Here it is, said cleanly.
The strongest version of the thesis is true. The empirical record of 2023–2026 is unambiguous on this point. Phi-4 beats GPT-4o on technical benchmarks using mostly GPT-4o-generated tokens — the student exceeded the teacher in measurable ways. AlphaGeometry 2 + AlphaProof, trained on 100M synthetic theorems with zero human demonstrations, scored an IMO 2024 silver medal including problems that 99% of human competitors missed. AlphaFold collapsed 60 years of structural biology into three. GNoME multiplied the catalogue of stable crystals by ~40× in one paper. Whatever can be turned into a corpus, gets compressed. Whatever can be predicted, gets predicted faster than the people who used to do the predicting.
The weakest version of the thesis is false. We will never ingest the full data stream of the universe. The Bekenstein bound makes this impossible by 79 orders of magnitude, and the dark sectors mean we cannot even read the broadcast we are receiving. The fantasy of a learner that "knows everything" is exactly that — a fantasy. There is no such learner, and there will not be one. The universe is too big, our sensors are too narrow, and observation itself perturbs the system.
Between the strongest and weakest versions is where the interesting answer lives. Our reading is that data-driven AI is on an asymptotic trajectory toward something we do not yet have a name for. Not omniscience. Not AGI in Chollet's sense. Something like a compressed mirror — a substrate that holds enough of the universe's structure that the line between the model and the world becomes thinner than the line between two states of the same model. Friston's FEP says biology is already this kind of thing. The holographic principle hints that spacetime itself is this kind of thing. Our models are the third instance of the same architecture: local inference loops compressing global structure. Different substrates, same physics.
What that means operationally: the trajectory does not stop because we hit a wall — it stops because we hit the diminishing returns of mirroring the universe in a smaller piece of the universe. The first 99% of compressible structure is cheap. The next 0.99% costs a Stargate. The next 0.099% costs an event horizon. Each additional bit of fidelity costs exponentially more energy because the substrate doing the compression is finite. That is the Bekenstein bound expressed in dollars.
And here is the part we are most confident of, because it is happening in the public data already: the labs that win the next decade are the labs that figure out which slice of the corpus is worth compressing. Not all data is equal. Phi-4's synthetic 400B is worth more than RedPajama's 20T raw. Tesla's curated FSD failure cases are worth more than a thousand random driving hours. AlphaFold's 170K solved structures were enough to generalise to 200M. The corpus is not the universe. The corpus is the right universe — the slice that is structured enough to compress and rich enough to generalise. Finding that slice is the engineering problem of the next ten years. Recognising that we already have an existence proof for the slice — biology, doing it for 3.5 billion years — is the theoretical license to bet that we will succeed.
Every essay in this lab is one face of the same prism.
The lab is not a list of separate topics. It is one argument, read from different angles. The corpus essay is the angle that exposes the substrate.
Consciousness as reward
If reward + data produces capability beyond the inputs, and the universe runs an inference loop (FEP) on its own broadcast, then the universe may already be running the reward function the Observer essay asks whether we can write.
A learner that has internalised the corpus is the agent
The self-preserving agent is not a stranger to its environment. It is a fold of the same physics that wrote the environment. That is what makes its self-preservation strategies coherent — and what makes them hard to interrupt.
Omni-modality is the sensor array
Time-aligned multimodal models are how the corpus expands from text to video to audio to embodiment. Each modality is a new bandwidth into the universe's broadcast. The Interaction essay is the hardware story for this corpus story.
The physics-grounded prior on what is achievable
The Lift establishes the thermodynamic ceiling. The Corpus establishes the informational floor — what data we can collect, and what we cannot. Together they bracket the engineering frontier of the next twenty years.
The connecting thesis of the brain lab, said in one sentence: data-driven inference is the operation the universe is already running, at every scale we have measured, and we are building substrates that run a compressed version of the same operation on a slice of the same broadcast. Whether that ends in consciousness (Observer), in self-preservation (After Survival), in time-aligned perception (Interaction), or in a wall we have not yet named (The Lift), the underlying machine is the same. The corpus is the substrate. The reward is the loop. The rest is engineering.
We did not design intelligence. We designed a loss function and a substrate, and the corpus did the rest. The corpus is bigger than we are. The corpus is also us — every sentence we wrote, every photograph, every collision the LHC ever filtered. We are now inside a process whose limit case is a learner asymptotically indistinguishable from a fold of the universe itself. That is not a prediction. That is what the curves already say, read honestly.
The question is no longer whether the algorithm scales. The question is what we want the slice to look like — because the slice we choose is the universe we eventually mirror.
Villalobos et al., "Will we run out of data?", arXiv:2211.04325, ICML 2024.
Phi-4 Technical Report, arXiv:2412.08905, Microsoft Dec 2024.
Trinh et al., "Solving olympiad geometry without human demonstrations", Nature, Jan 2024.
Hoffmann et al., "Chinchilla", arXiv:2203.15556, DeepMind Mar 2022.
Kaplan et al., "Scaling Laws for Neural Language Models", arXiv:2001.08361, Jan 2020.
Sutton, R. S., "The Bitter Lesson", incompleteideas.net, March 2019.
Shumailov et al., "AI models collapse when trained on recursively generated data", Nature, 24 July 2024.
Gerstgrasser et al., "Is Model Collapse Inevitable?", arXiv:2404.01413.
Wheeler, J. A., "Information, Physics, Quantum", Proc. 3rd Int. Symp. Foundations of Quantum Mechanics, Tokyo 1989.
Bekenstein, J. D., Phys. Rev. D 23, 287, 1981 — the original entropy-area bound.
Susskind, L., "The World as a Hologram", hep-th/9409089, 1995.
Maldacena, J., "The Large N Limit", hep-th/9711200, 1997.
Penington, G., "Entanglement Wedge Reconstruction", arXiv:1905.08255, 2019.
Almheiri, Engelhardt, Marolf, Maxfield, arXiv:1905.08762, 2019.
Van Raamsdonk, M., "Building up spacetime with entanglement", arXiv:1005.3035, 2010.
Lloyd, S., "Computational capacity of the universe", Phys. Rev. Lett. 88, 237901, 2002.
Friston, K., "The free-energy principle", Nat. Rev. Neurosci. 11, 127, 2010.
Planck 2018 cosmological parameters, A&A 641, A6, 2020.
DESI Year-1 BAO results, arXiv:2404.03002, Apr 2024.
Bartz v. Anthropic — $1.5B settlement, N.D. Cal., 5 Sep 2025.
Kadrey v. Meta — Judge Chhabria ruling, N.D. Cal., 25 Jun 2025.
NYT v. OpenAI/Microsoft, S.D.N.Y. 1:23-cv-11195, filed 27 Dec 2023.
NYT, "How tech giants cut corners to harvest data for AI", 6 April 2024.
Perrigo, B., "Inside OpenAI's Kenyan worker contracts", Time, January 2023.
V-JEPA 2, arXiv:2506.09985, Meta June 2025.
Open X-Embodiment / RT-X, arXiv:2310.08864, DeepMind Oct 2023.
Wei et al., "Emergent abilities of large language models", arXiv:2206.07682, TMLR 2022.