The Benchmarking Revolution: How AI Systems Are Now Co-Evolving With Their Own Tests
In the rapidly advancing field of artificial intelligence, a fundamental tension exists between creating powerful systems and properly evaluating them. Traditional benchmarking—where static datasets with fixed labels serve as the ultimate test—has become increasingly inadequate for complex AI tasks. Now, researchers from arXiv have proposed a radical solution: co-evolving benchmarks and AI agents together.
According to a new paper titled "DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality," the team addresses a critical gap in AI verification systems. While search-augmented large language model (LLM) agents can produce comprehensive deep research reports (DRRs), verifying the factuality of individual claims within these documents remains exceptionally challenging.
The Problem with Static Benchmarks
The researchers discovered that existing fact-checking systems are primarily designed for simple, factoid-style claims in general domains. These systems struggle when applied to the nuanced, interconnected claims found in deep research reports. More troublingly, there was no suitable benchmark to test whether current verification methods could handle this complexity.
Building such a benchmark proved unexpectedly difficult. In a controlled study involving PhD-level specialists, unassisted experts achieved only 60.8% accuracy on a hidden "micro-gold" set of verifiable claims. This startling finding revealed that even domain experts struggle with one-shot labeling of complex factual claims, suggesting that traditional static benchmarks built through expert annotation may be fundamentally flawed for this application.
The Audit-then-Score Solution
The research team proposed a novel approach called Evolving Benchmarking via Audit-then-Score (AtS). This framework treats benchmark labels and rationales as explicitly revisable rather than fixed. The process works through a continuous feedback loop:

- When a verification agent disagrees with the current benchmark label
- The agent must submit supporting evidence for its position
- A human auditor adjudicates the dispute
- Accepted revisions update the benchmark before models are evaluated
This iterative process transforms benchmarking from a static snapshot into a dynamic, self-improving system. Across four AtS rounds, expert accuracy on the micro-gold set rose dramatically from 60.8% to 90.9%—demonstrating that experts are substantially more reliable as auditors reviewing evidence than as one-shot labelers making isolated judgments.
DeepFact Implementation
The researchers instantiated their approach in two concrete systems:

DeepFact-Bench: A versioned DRR factuality benchmark with auditable rationales that evolves over time as new evidence emerges and disagreements are resolved through the AtS process.
DeepFact-Eval: A document-level verification agent (with a grouped "lite" variant) that outperforms existing verifiers on DeepFact-Bench and shows strong transfer capabilities to external factuality datasets.
Parallel Developments in Computer Vision
Interestingly, this breakthrough in benchmarking methodology coincides with parallel developments in computer vision. Another arXiv paper (arXiv:2603.05729v1) addresses similar limitations in traditional evaluation approaches, specifically the single-label assumption in the iconic ImageNet benchmark.

Researchers developed an automated pipeline to convert the ImageNet training set into a multi-label dataset without human annotation, using self-supervised Vision Transformers for unsupervised object discovery. Models trained with these multi-label annotations showed consistent improvements: up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2, along with stronger transferability to downstream tasks.
Implications for AI Development
These developments signal a paradigm shift in how we evaluate and improve AI systems. The traditional approach of creating static benchmarks, training models against them, and declaring victory when metrics improve has shown its limitations. The co-evolution approach recognizes that both the systems being evaluated and the standards for evaluation must improve together.
For the field of AI fact-checking specifically, DeepFact offers a path toward more reliable verification of complex research claims. The framework acknowledges that factual verification is often iterative and evidence-dependent rather than binary and immediate.
The Future of AI Evaluation
The DeepFact approach suggests several important directions for future research:
- Dynamic benchmarking across more AI domains beyond fact-checking
- Automated auditing systems that could scale the AtS process
- Integration with continuous learning systems that update based on benchmark evolution
- Cross-domain applications of the co-evolution principle
As AI systems grow more sophisticated, our methods for evaluating them must evolve in tandem. The DeepFact framework represents a significant step toward benchmarks that reflect the complexity of real-world tasks and the iterative nature of human knowledge refinement.
Source: arXiv:2603.05912v1 "DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality" and arXiv:2603.05729v1 (ImageNet multi-label annotation)


