Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

An autonomous ML research agent interface showing a file system architecture with a 81.82% score on MLE-Bench Lite…

AiScientist Agent Uses 'File-as-Bus' to Score 81.82% on MLE-Bench Lite

Researchers introduced AiScientist, an autonomous ML research agent that uses a 'File-as-Bus' architecture for state management. It scores 81.82% on MLE-Bench Lite, with the file system contributing 31.82 points of that performance.

AAAla SMITH & AI Research Desk·Apr 15, 2026·7 min read··411 views·AI-Generated·Report error

Source: x.comvia @omarsar0Multi-Source

TL;DR

AiScientist research agent improves PaperBench by 10.54 points using durable file-based state management, showing long-horizon AI requires persistent workspace memory.

AiScientist Research Agent Uses 'File-as-Bus' Architecture to Manage Long-Horizon ML Tasks

Building AI agents that can autonomously conduct machine learning research over hours or days—not just respond to the next prompt—requires solving a fundamental problem: state management. A new paper introduces AiScientist, a system designed for "autonomous long-horizon engineering for ML research." Its core innovation isn't a more powerful reasoning model, but a persistent architectural choice: treating the file system as a communication bus and durable memory.

According to researcher Omar Sarraf, whose tweet thread announced the work, "Long-horizon AI research agents are mostly a state-management problem." The challenge goes beyond single-turn reasoning. Real ML research involves a extended workflow: task setup, implementation, running experiments, debugging, and tracking evidence across a timeline that can span days. An agent that forgets context between steps or cannot reference prior work is doomed to fail.

Key Takeaways

Inside OpenAI's MLE-Bench: A New Benchmark for Evaluating ...

Researchers introduced AiScientist, an autonomous ML research agent that uses a 'File-as-Bus' architecture for state management.
It scores 81.82% on MLE-Bench Lite, with the file system contributing 31.82 points of that performance.

What the Researchers Built

AiScientist is built on a principle of "thin control, thick state." A lightweight top-level orchestrator manages progress through high-level stages of a research task. It delegates actual work to specialized sub-agents (e.g., for analysis, planning, coding, experimentation). The critical design is that these agents don't just chat with the orchestrator; they repeatedly "ground themselves in durable workspace artifacts."

These artifacts—analysis documents, plans, code files, execution logs, and experimental results—are stored as files in a shared workspace. This creates the "File-as-Bus" design: the file system acts as the central communication channel and persistent memory for the entire multi-agent system.

Key Results

The paper evaluates AiScientist on two benchmarks for autonomous ML research agents:

PaperBench Not Specified Baseline Score +10.54 points -6.41 points MLE-Bench Lite 81.82% (Any Medal%) Not Specified Not Specified -31.82 points

The results highlight the disproportionate importance of the file-based architecture. On MLE-Bench Lite, removing the File-as-Bus design caused performance to plummet by 31.82 percentage points—demonstrating that the majority of the system's capability stems from this persistent state management, not just the reasoning of the individual agents.

How It Works: The File-as-Bus in Practice

Unveiling MLE-Bench: A New Frontier in Evaluating AI Agents ...

Imagine a human researcher working on an ML project. They don't hold every detail in their head. They write notes in a notebook, save code versions, record experiment results in spreadsheets, and review logs when debugging. The file system is their externalized, durable memory.

AiScientist operationalizes this for AI agents:

Orchestrator Sets Stage: The top-level agent determines the current phase (e.g., "analyze problem," "implement solution," "run experiment").
Specialist Agent is Activated: A relevant sub-agent (e.g., the coding agent) is invoked.
Grounding in Files: Before acting, the agent reads the relevant files from the workspace: the project plan, previous analysis, existing code, last experiment's config file.
Execution and Artifact Creation: The agent performs its task, and its output is written back to the file system—a new analysis document, an updated script, a results JSON file.
State Advancement: The orchestrator observes the new artifact, updates its stage tracking, and triggers the next step.

This loop ensures every agent decision is made in the context of the entire project history, stored concretely in files. It prevents the common failure mode of conversational agents, where context is lost in long chat histories or between sessions.

Why It Matters: Beyond Longer Context Windows

The AI industry has largely approached the "long-horizon" problem by scaling context windows in large language models (LLMs). The premise is that if you can fit the entire conversation history into the prompt, the model won't forget. AiScientist argues this is insufficient for complex, iterative tasks like research.

Durability and Referential Integrity: Files persist beyond a single LLM session or API call. An agent can be stopped and restarted days later, load the workspace, and continue. Code can be version-controlled. Results can be compared across multiple file versions.

Human-AI Collaboration: A file-based workspace is inherently interpretable by humans. A researcher can inspect the plan, review the generated code, or check the experiment logs directly. This aligns with the growing emphasis on human-in-the-loop AI systems, where transparency and oversight are critical.

Specialization over Monoliths: The architecture encourages using smaller, specialized models or prompts for different sub-tasks (coding vs. analysis), all coordinated through the shared file state, rather than relying on a single monolithic model to do everything in one pass.

gentic.news Analysis

This work directly confronts a critical bottleneck in the push toward autonomous AI research agents, a field that has seen intense interest following projects like OpenAI's GPT-Engineer and Meta's Cicero. The dominant paradigm has been to chain LLM calls with in-memory state, which fails at true long-horizon tasks. AiScientist's "File-as-Bus" is a pragmatic, engineering-focused solution that borrows from decades of distributed systems design—where a shared, persistent message bus is a standard pattern for reliability.

It also serves as a counterpoint to the industry's relentless focus on next-token prediction and context length. As we covered in our analysis of Google's Gemini 1.5 Pro and its 1M token context, simply having a long memory isn't the same as having an effective, structured, and actionable memory. AiScientist shows that for complex agentic workflows, external, tool-accessible state is a more powerful lever than internal context. This aligns with emerging trends in embodied AI and robotics, where an agent's understanding is grounded in its sensorimotor interaction with a persistent environment, not just its internal model.

Looking forward, the principle of "thick state" suggests the next frontier for AI engineering may lie in optimizing state management systems—databases, vector stores, and file systems designed for AI agent interaction—rather than just the models themselves. The 31.82-point performance drop on MLE-Bench Lite without the file system is a stark quantitative reminder: an agent's intelligence is only as good as its memory.

Frequently Asked Questions

What is the 'File-as-Bus' architecture?

The "File-as-Bus" architecture is a design pattern where a shared file system acts as the central communication channel and persistent memory for a multi-agent AI system. Instead of agents passing messages directly to each other in volatile memory, they read from and write to files (code, plans, logs, results). This creates a durable, inspectable, and restartable record of the entire project's state that all agents can reference.

How does AiScientist compare to other AI research agents?

AiScientist distinguishes itself by its explicit focus on long-horizon state management through a persistent file workspace. Many other research or coding agents (like early versions of GPT-Engineer or Devin) operate more conversationally, with state held in a long chat history or short-term memory. The paper's benchmarks show that this file-based approach yields significant performance gains (e.g., +10.54 points on PaperBench) over matched baselines that likely use more conventional in-memory state tracking.

What are PaperBench and MLE-Bench Lite?

PaperBench and MLE-Bench Lite are benchmarks for evaluating autonomous machine learning research agents. They test an agent's ability to complete end-to-end ML research tasks, such as reproducing results from a research paper or conducting a novel machine learning experiment from scratch. These benchmarks measure success across multiple stages (planning, coding, experimentation, analysis), making them suitable for evaluating long-horizon capabilities.

Can I use or build upon the AiScientist code?

The source tweet includes a link to the paper (https://t.co/A84c75oumP). Typically, research papers of this nature are accompanied by code released on platforms like GitHub, though the tweet does not explicitly confirm this. The core concept—using a file system as durable state for agents—is an architectural pattern that can be implemented independently, even if the specific AiScientist code is not publicly available.

Source: gentic.news · Apr 15, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The AiScientist paper makes a crucial, often overlooked point: scaling agent capabilities requires scaling state management, not just model reasoning. The AI industry's obsession with context length (1M tokens in Gemini, 128K in GPT-4 Turbo) addresses only part of the problem—volatile, linear memory. AiScientist's file-based approach provides structured, durable, and tool-accessible memory. This is reminiscent of the shift in software engineering from monolithic applications to microservices communicating via persistent queues; reliability comes from the communication fabric, not just the service logic. This work has immediate implications for practitioners building complex AI agents. It argues against over-investing in a single, ultra-large context model and instead for investing in the design of the agent's "workspace." The performance penalty for removing the file system (31.82 points on MLE-Bench Lite) is so large that it should prompt a re-evaluation of existing agent architectures. The next logical step is to generalize this beyond files to other persistent state layers—databases, object stores, or specialized "agent memory" services—optimized for fast read/write by LLMs. Finally, this research dovetails with the increasing focus on **evaluation** for AI agents. Benchmarks like MLE-Bench Lite and PaperBench are essential because they test multi-step, realistic workflows, not single-turn Q&A. As agent capabilities advance, robust, long-horizon benchmarks will become the primary metric for progress, displacing simpler academic datasets. AiScientist's strong performance here validates both its architecture and the importance of the benchmarks themselves.

#research #machine learning #ai agents #systems

Compare side-by-side

AiScientist vs MLE Bench Lite

→

Mentioned in this article

AiScientist File-as-Bus Omar Sarraf MLE Bench Lite

Enjoyed this article?