Ethan Mollick Critiques Scientific Publishing's AI Inertia: PDFs Still Dominate in 2026

Ethan Mollick Critiques Scientific Publishing's AI Inertia: PDFs Still Dominate in 2026

Wharton professor Ethan Mollick highlights that scientific papers in 2026 are still primarily uploaded as formatted PDFs to restrictive academic archives, signaling slow adaptation to AI's potential for accelerating research.

GAla Smith & AI Research Desk·12h ago·7 min read·3 views·AI-Generated
Share:
Scientific Publishing's AI Bottleneck: Why PDFs Still Rule in 2026

March 2026 — In a pointed observation on the state of academic infrastructure, Wharton professor and AI researcher Ethan Mollick noted that the fundamental format of scientific communication has remained stubbornly unchanged in the face of a transformative technological wave. Despite the widespread adoption of large language models (LLMs) and AI research assistants across labs and universities, the primary artifact of science—the published paper—is still almost universally uploaded as a fully formatted, static PDF to academic archive sites that often impose download limits or paywalls.

Mollick's comment, made on social media, frames this not as a minor technical detail, but as a critical bottleneck. "The fact that every scientific paper in 2026 is still uploaded only as fully formatted PDFs to academic archive sites that often limit downloads," he wrote, "tells you everything you need to know about how quickly the scientific system is adjusting to the potential of AI to accelerate science."

The Persistent PDF Problem

The Portable Document Format (PDF), created by Adobe in the early 1990s, was designed for consistent visual presentation and printing. For human readers, it works. For AI systems, it's a significant obstacle. While modern LLMs can parse PDF text with reasonable accuracy, the format inherently obscures semantic structure, buries data and code, and separates figures from their captions and context. Extracting clean, machine-readable text, tables, and equations from a PDF is a non-trivial task that introduces noise and error.

More fundamentally, the PDF represents a philosophy of publishing centered on the final, immutable document for human consumption, not on the flow of structured data, code, and intermediate results that AI systems could use to connect findings, replicate analyses, or generate novel hypotheses.

The Archive Access Issue

Mollick's second critique targets the repositories themselves. Major archives like arXiv, PubMed Central, and publisher portals (Elsevier, Springer Nature) often limit bulk downloads via technical rate-limiting or prohibitive terms of service. This directly impedes the creation of large, up-to-date corpora for training specialized scientific AI models or for large-scale meta-analysis via AI agents.

While initiatives like the Semantic Scholar Open Research Corpus (S2ORC) and others have made strides in creating cleaned datasets, they operate downstream of the core publishing process, dealing with the PDF problem after the fact. The primary submission pipeline for most researchers remains: produce a PDF, submit it to a journal or conference, and later upload that same PDF to a preprint server.

What This Means in Practice

This inertia has tangible costs for AI-driven science:

  • Slower Literature Reviews: AI research assistants must spend computational effort "reading" PDFs instead of querying structured knowledge graphs.
  • Barriers to Replication: Associated code and data are rarely linked in a machine-actionable way within the PDF, hindering automated verification.
  • Missed Connections: AI systems that could hypothesize novel links between methods in computer science and biology struggle when each discipline's papers are locked in separate, format-obscured silos.
  • Inefficient Knowledge Synthesis: Systematic reviews and meta-analyses, prime candidates for AI augmentation, remain labor-intensive manual processes.

The Counter-Movement: Structured Science

Mollick's critique is not made in a vacuum. A growing movement advocates for "machine-first" or at least "machine-friendly" research communication. This includes:

  • JATS XML: A standard used by many publishers for article structuring, though often not exposed to the public.
  • Executable Research Articles: Papers where the code, data, and narrative are interlinked in environments like Jupyter Notebooks or Stencila.
  • FAIR Principles: The push for data to be Findable, Accessible, Interoperable, and Reusable.
  • Preprint Servers with APIs: Servers like arXiv do provide APIs, but they primarily serve PDFs and source TeX, not richly structured content.

The progress, however, is fragmented and not yet the default. The dominant incentive system in academia—journal impact factors, citation counts tied to the PDF version—creates immense gravitational pull toward the status quo.

gentic.news Analysis

Mollick's observation cuts to the heart of a systemic challenge we've tracked closely. This isn't just a file format debate; it's about the data layer of science. As we covered in our 2025 analysis of Google's AlphaFold 3 and its integrated database, the most dramatic AI breakthroughs in science occur when models are trained on and interact with structured, high-quality data—not scraped PDFs. The success of models in biology and materials science is partly due to decades of investment in structured databases like GenBank and the Protein Data Bank.

The scientific publishing industry, dominated by a handful of large firms like RELX (Elsevier) and Springer Nature, operates on a legacy economic model built around the PDF-as-product. Their experiments with AI—such as Elsevier's Scopus AI or Springer Nature's Curie—are often additive services layered on top of the existing PDF corpus, not a re-engineering of the core format. There's a clear misalignment between the infrastructure needed for an AI-accelerated science ecosystem and the infrastructure that maximizes short-term returns for incumbent publishers.

This tension aligns with our reporting on the AI research tools market, where startups like Scite, Elicit, and Consensus have grown precisely by building better interfaces and pipelines to work around the PDF problem. Their success is a market signal that researchers are desperate for better tools, but they remain downstream fixers, not upstream reformers of the publication pipeline.

The slow pace of change Mollick identifies is a classic example of technological friction where social and institutional systems lag behind technical capability. Until funding agencies, tenure committees, and major journals collectively mandate or strongly incentivize the submission of structured, machine-readable research objects alongside or instead of the traditional PDF, the AI acceleration of science will be operating with one hand tied behind its back, parsing documents instead of reasoning over knowledge.

Frequently Asked Questions

Why are PDFs bad for AI in science?

PDFs prioritize visual layout for human readers, which makes it difficult for AI to reliably extract clean text, understand the hierarchical structure of a paper (title, authors, abstract, sections, references), and, most importantly, to programmatically access the data, code, and mathematical expressions embedded within. This forces AI systems to use error-prone parsing techniques instead of working with native, structured information.

What would a better alternative to PDFs look like for AI?

An ideal alternative would be a structured, semantic format that separates content from presentation. This could be based on existing standards like JATS XML, which tags elements like <abstract>, <fig>, and <sec>. Even better would be executable formats that bundle the narrative, the underlying code, and the data in an interactive environment (like a Jupyter Notebook), allowing AI to not only read the findings but also re-run and verify the analysis.

Who is Ethan Mollick and why is his comment significant?

Ethan Mollick is a professor at the Wharton School of the University of Pennsylvania who studies innovation, entrepreneurship, and the impact of AI. He is a widely followed and credible voice on the practical adoption of AI. His comment is significant because it comes from an influential academic observing from within the system, highlighting that the failure to modernize scientific publishing is a major, recognized bottleneck holding back AI's potential to accelerate discovery.

Are there any projects trying to fix this problem?

Yes, but they are mostly workarounds. Academic search engines like Semantic Scholar and Google Scholar invest heavily in parsing PDFs to extract structure and citations. Projects like S2ORC create large, cleaned datasets of text from PDFs. Newer tools like Scite (which checks citation contexts) and Elicit (an AI research assistant) are built on top of these parsed corpora. However, these are all post-hoc solutions. True upstream solutions would require changes at the point of publication by journals, conferences, and preprint servers.

AI Analysis

Mollick's tweet is a succinct diagnosis of a critical infrastructure gap. The technical community has largely solved the model architecture and scaling challenges for scientific AI (evidenced by models like Galactica, Minerva, and various protein/language models). The frontier is now data accessibility and structure. The persistence of the PDF is a symbol of a deeper issue: scientific knowledge remains locked in a format optimized for a 20th-century dissemination model, not for 21st-century computational reuse. This directly impacts the roadmap for AI-for-science startups and research labs. Significant engineering effort is diverted into building and maintaining fragile PDF parsers and scrapers—effort that could be spent on higher-level reasoning tasks if the data were natively structured. It also creates a bifurcated market: proprietary publishers who control access to large PDF corpora can build walled-garden AI tools, while open-science initiatives struggle with the preprocessing tax. For practitioners, the implication is clear: advocating for and adopting alternative publishing formats (like preprint servers that accept Jupyter Notebooks or requiring structured data deposits) is not just an academic ideal; it's a practical step to reduce friction for the AI tools that will increasingly define the pace of research. The next leap in AI-driven discovery may depend less on a new transformer variant and more on the widespread adoption of a better file format.
Enjoyed this article?
Share:

Related Articles

More in Opinion & Analysis

View all