A research team from Shanghai Jiao Tong University (SJTU) has demonstrated one of the longest continuous runs of an autonomous machine learning research agent to date. Their system, ML-Master 2.0, operated on the MLE-Bench for a full 24 hours, achieving a 56.44% medal rate—a result described as "one of the strongest marks the benchmark has seen."
The work, highlighted in a paper titled "Towards Ultra-Long-Horizon Agentic Science," argues that the primary bottleneck for scaling autonomous research agents is not reasoning capability, but state management. The team's solution is an architecture called Hierarchical Cognitive Caching, which structures memory across three time horizons to prevent agents from repeating mistakes and stalling out over extended sessions.
Key Takeaways
- Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate.
- The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.
What the Researchers Built: Hierarchical Cognitive Caching

The core innovation is a memory architecture designed explicitly for the marathon, not the sprint, of scientific research. Most AI agents are built for short-horizon tasks, completing a single query or experiment in one session. ML-Master 2.0 is engineered to maintain coherence and learning over a full day of continuous operation.
The Hierarchical Cognitive Caching system breaks down as follows:
- Short-term Memory: Holds the immediate context for the current step or experiment (e.g., the code being written, the immediate error message).
- Medium-term Memory: Identifies and stores patterns across multiple experiments within a session (e.g., "hyperparameter X consistently leads to overfitting with this dataset family").
- Long-term Memory: Stores refined, validated knowledge that carries between sessions (e.g., "for image classification on noisy data, start with architecture Y").
This structure allows the agent to avoid cyclical failures—a common pitfall where an agent without structured memory forgets why an approach failed and retries it hours later.
Key Results: A 24-Hour Marathon on MLE-Bench
The team evaluated ML-Master 2.0 on MLE-Bench, a benchmark designed to assess an AI agent's ability to perform real-world machine learning engineering tasks, such as model selection, hyperparameter tuning, and debugging. Performance is measured by a "medal rate"—the percentage of tasks where the agent's solution meets a high-quality threshold (akin to earning a gold, silver, or bronze medal).
Duration 24 hours Continuous, unattended operation Benchmark MLE-Bench ML engineering task suite Medal Rate 56.44% One of the strongest published marks Core Architecture Hierarchical Cognitive Caching Three-tiered memory systemThe 56.44% medal rate is significant. While not a 90%+ score seen on narrower QA benchmarks, MLE-Bench tasks are complex, open-ended, and reflective of real research workflows. A score above 50% over a 24-hour period indicates an agent that can not only execute tasks but also learn and adapt its strategy over time without human intervention.
How It Works: Treating Long-Horizon Agency as a State Problem
The paper's central thesis is provocative: "Long-horizon agents are not a reasoning problem; they are a state-management problem."
Most efforts in agentic AI focus on improving the reasoning loop—giving models better planning, tool-use, or chain-of-thought capabilities. The SJTU team argues that for ultra-long-horizon tasks (like multi-day research), even perfect reasoning will fail if the agent cannot maintain a coherent state. Without a system to cache, retrieve, and refine knowledge, the agent's context window becomes a leaky bucket.
Hierarchical Cognitive Caching is the plug. The system actively manages what information to keep, at what level of abstraction, and for how long. Medium-term memory might compress dozens of failed experiments into a single heuristic rule. Long-term memory undergoes a validation step before promotion, ensuring only robust insights persist. This creates a growing "playbook" for the agent, allowing it to approach new problems with accumulated wisdom rather than from scratch each time.
Why It Matters: The Path to Autonomous Research

This work is a concrete step toward autonomous science—AI systems that can formulate hypotheses, design and run experiments, analyze results, and iterate, potentially for weeks or months. The 24-hour run is a proof-of-concept for the memory architecture needed to make this feasible.
For AI engineers, the takeaway is architectural. Building agents for long tasks requires designing explicit state-management layers, not just relying on a large context window or hoping the reasoning model will "remember." This has immediate implications for anyone developing agents for code generation, data analysis, or systematic testing that lasts longer than a single chat session.
The result also validates MLE-Bench as a testing ground for agentic systems. As performance on shorter, simpler benchmarks saturates, the field needs harder evaluations that stress-test durability and long-term learning. MLE-Bench, with its medal rate metric over extended runs, fits that need.
gentic.news Analysis
This development from Shanghai Jiao Tong University fits directly into the accelerating trend of agentic AI moving from proof-of-concept demos to sustained, operational systems. For much of 2024 and 2025, the discourse was dominated by flashy, short-task agents. The focus is now shifting to the unglamorous, systems-level challenges of reliability, statefulness, and cost over long time horizons—the exact problem ML-Master 2.0 tackles.
Hierarchical Cognitive Caching is a pragmatic answer to a limitation of current large language models: they are stateless by default. While providers like OpenAI and Anthropic have introduced some session-level memory features, they are generic and not optimized for the structured, iterative workflow of scientific research. SJTU's approach is a specialized, task-aware memory architecture, which may become a common pattern as agents are productized for specific verticals like coding, research, or logistics.
The 56.44% medal rate on MLE-Bench is a strong data point in the competitive landscape of AI coding and research agents. It places ML-Master 2.0 in the upper tier of published results, alongside systems like Meta's SWE-agent and projects from Cognition Labs. However, the real differentiator here isn't the score alone, but the 24-hour duration. Most benchmark runs are measured in minutes. This work proves that with the right architecture, agents can maintain performance over a timescale that begins to match human research cycles.
Looking forward, the next hurdle is multi-modal and multi-tool state management. ML-Master 2.0 likely operated primarily in a code-centric environment. The true vision of agentic science involves managing state across wet-lab robotics, literature databases, simulation software, and coding environments simultaneously. Extending hierarchical caching to that heterogeneous, noisy reality is the next frontier.
Frequently Asked Questions
What is MLE-Bench?
MLE-Bench is a benchmark suite for evaluating machine learning engineering agents. It presents tasks that mirror real-world research and development workflows, such as fixing buggy model code, tuning hyperparameters for a given dataset, or selecting an appropriate architecture. Performance is scored via a "medal rate," which measures how often the agent's solution meets a high-quality bar.
How does Hierarchical Cognitive Caching differ from just using a large context window?
A large context window (like 1M tokens) lets an agent see more of its past, but it doesn't help it organize or learn from that past. Hierarchical Cognitive Caching actively structures memory into short, medium, and long-term tiers, compressing experiences into heuristics and validated knowledge. This is more efficient and effective than dumping the raw transcript of the last 20 hours into the prompt.
What is a "medal rate"?
In the context of MLE-Bench, a medal rate is the percentage of tasks where the agent's final solution is deemed good enough to earn a gold, silver, or bronze medal (based on predefined evaluation criteria). It's a pass/fail style metric for solution quality, aggregated across many diverse tasks.
Can I use ML-Master 2.0 or its architecture now?
The research paper is publicly available (linked in the source tweet), so the architectural principles are open for anyone to implement. The specific ML-Master 2.0 system may not be released as a product, but the Hierarchical Cognitive Caching concept is a blueprint that AI engineers can adapt for building their own long-horizon agents.









