Building AI agents that can autonomously conduct machine learning research over hours or days—not just respond to the next prompt—requires solving a fundamental problem: state management. A new paper introduces AiScientist, a system designed for "autonomous long-horizon engineering for ML research." Its core innovation isn't a more powerful reasoning model, but a persistent architectural choice: treating the file system as a communication bus and durable memory.
According to researcher Omar Sarraf, whose tweet thread announced the work, "Long-horizon AI research agents are mostly a state-management problem." The challenge goes beyond single-turn reasoning. Real ML research involves a extended workflow: task setup, implementation, running experiments, debugging, and tracking evidence across a timeline that can span days. An agent that forgets context between steps or cannot reference prior work is doomed to fail.
What the Researchers Built
AiScientist is built on a principle of "thin control, thick state." A lightweight top-level orchestrator manages progress through high-level stages of a research task. It delegates actual work to specialized sub-agents (e.g., for analysis, planning, coding, experimentation). The critical design is that these agents don't just chat with the orchestrator; they repeatedly "ground themselves in durable workspace artifacts."
These artifacts—analysis documents, plans, code files, execution logs, and experimental results—are stored as files in a shared workspace. This creates the "File-as-Bus" design: the file system acts as the central communication channel and persistent memory for the entire multi-agent system.
Key Results
The paper evaluates AiScientist on two benchmarks for autonomous ML research agents:
PaperBench Not Specified Baseline Score +10.54 points -6.41 points MLE-Bench Lite 81.82% (Any Medal%) Not Specified Not Specified -31.82 pointsThe results highlight the disproportionate importance of the file-based architecture. On MLE-Bench Lite, removing the File-as-Bus design caused performance to plummet by 31.82 percentage points—demonstrating that the majority of the system's capability stems from this persistent state management, not just the reasoning of the individual agents.
How It Works: The File-as-Bus in Practice
Imagine a human researcher working on an ML project. They don't hold every detail in their head. They write notes in a notebook, save code versions, record experiment results in spreadsheets, and review logs when debugging. The file system is their externalized, durable memory.
AiScientist operationalizes this for AI agents:
- Orchestrator Sets Stage: The top-level agent determines the current phase (e.g., "analyze problem," "implement solution," "run experiment").
- Specialist Agent is Activated: A relevant sub-agent (e.g., the coding agent) is invoked.
- Grounding in Files: Before acting, the agent reads the relevant files from the workspace: the project plan, previous analysis, existing code, last experiment's config file.
- Execution and Artifact Creation: The agent performs its task, and its output is written back to the file system—a new analysis document, an updated script, a results JSON file.
- State Advancement: The orchestrator observes the new artifact, updates its stage tracking, and triggers the next step.
This loop ensures every agent decision is made in the context of the entire project history, stored concretely in files. It prevents the common failure mode of conversational agents, where context is lost in long chat histories or between sessions.
Why It Matters: Beyond Longer Context Windows
The AI industry has largely approached the "long-horizon" problem by scaling context windows in large language models (LLMs). The premise is that if you can fit the entire conversation history into the prompt, the model won't forget. AiScientist argues this is insufficient for complex, iterative tasks like research.
Durability and Referential Integrity: Files persist beyond a single LLM session or API call. An agent can be stopped and restarted days later, load the workspace, and continue. Code can be version-controlled. Results can be compared across multiple file versions.
Human-AI Collaboration: A file-based workspace is inherently interpretable by humans. A researcher can inspect the plan, review the generated code, or check the experiment logs directly. This aligns with the growing emphasis on human-in-the-loop AI systems, where transparency and oversight are critical.
Specialization over Monoliths: The architecture encourages using smaller, specialized models or prompts for different sub-tasks (coding vs. analysis), all coordinated through the shared file state, rather than relying on a single monolithic model to do everything in one pass.
gentic.news Analysis
This work directly confronts a critical bottleneck in the push toward autonomous AI research agents, a field that has seen intense interest following projects like OpenAI's GPT-Engineer and Meta's Cicero. The dominant paradigm has been to chain LLM calls with in-memory state, which fails at true long-horizon tasks. AiScientist's "File-as-Bus" is a pragmatic, engineering-focused solution that borrows from decades of distributed systems design—where a shared, persistent message bus is a standard pattern for reliability.
It also serves as a counterpoint to the industry's relentless focus on next-token prediction and context length. As we covered in our analysis of Google's Gemini 1.5 Pro and its 1M token context, simply having a long memory isn't the same as having an effective, structured, and actionable memory. AiScientist shows that for complex agentic workflows, external, tool-accessible state is a more powerful lever than internal context. This aligns with emerging trends in embodied AI and robotics, where an agent's understanding is grounded in its sensorimotor interaction with a persistent environment, not just its internal model.
Looking forward, the principle of "thick state" suggests the next frontier for AI engineering may lie in optimizing state management systems—databases, vector stores, and file systems designed for AI agent interaction—rather than just the models themselves. The 31.82-point performance drop on MLE-Bench Lite without the file system is a stark quantitative reminder: an agent's intelligence is only as good as its memory.
Frequently Asked Questions
What is the 'File-as-Bus' architecture?
The "File-as-Bus" architecture is a design pattern where a shared file system acts as the central communication channel and persistent memory for a multi-agent AI system. Instead of agents passing messages directly to each other in volatile memory, they read from and write to files (code, plans, logs, results). This creates a durable, inspectable, and restartable record of the entire project's state that all agents can reference.
How does AiScientist compare to other AI research agents?
AiScientist distinguishes itself by its explicit focus on long-horizon state management through a persistent file workspace. Many other research or coding agents (like early versions of GPT-Engineer or Devin) operate more conversationally, with state held in a long chat history or short-term memory. The paper's benchmarks show that this file-based approach yields significant performance gains (e.g., +10.54 points on PaperBench) over matched baselines that likely use more conventional in-memory state tracking.
What are PaperBench and MLE-Bench Lite?
PaperBench and MLE-Bench Lite are benchmarks for evaluating autonomous machine learning research agents. They test an agent's ability to complete end-to-end ML research tasks, such as reproducing results from a research paper or conducting a novel machine learning experiment from scratch. These benchmarks measure success across multiple stages (planning, coding, experimentation, analysis), making them suitable for evaluating long-horizon capabilities.
Can I use or build upon the AiScientist code?
The source tweet includes a link to the paper (https://t.co/A84c75oumP). Typically, research papers of this nature are accompanied by code released on platforms like GitHub, though the tweet does not explicitly confirm this. The core concept—using a file system as durable state for agents—is an architectural pattern that can be implemented independently, even if the specific AiScientist code is not publicly available.









