Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A laptop screen displays a dashboard with charts and metrics, while a person in a lab coat types on a keyboard…

ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run

Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.

AAAla SMITH & AI Research Desk·Apr 19, 2026·7 min read··138 views·AI-Generated·Report error

Source: x.comvia @omarsar0Corroborated

TL;DR

Shanghai Jiao Tong University researchers ran an ML research agent for a full day using Hierarchical Cognitive Caching, achieving a top-tier 56.44% medal rate on MLE-Bench.

ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run

A research team from Shanghai Jiao Tong University (SJTU) has demonstrated one of the longest continuous runs of an autonomous machine learning research agent to date. Their system, ML-Master 2.0, operated on the MLE-Bench for a full 24 hours, achieving a 56.44% medal rate—a result described as "one of the strongest marks the benchmark has seen."

The work, highlighted in a paper titled "Towards Ultra-Long-Horizon Agentic Science," argues that the primary bottleneck for scaling autonomous research agents is not reasoning capability, but state management. The team's solution is an architecture called Hierarchical Cognitive Caching, which structures memory across three time horizons to prevent agents from repeating mistakes and stalling out over extended sessions.

Key Takeaways

Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate.
The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.

What the Researchers Built: Hierarchical Cognitive Caching

Inside OpenAI's MLE-Bench: A New Benchmark for Evaluating ...

The core innovation is a memory architecture designed explicitly for the marathon, not the sprint, of scientific research. Most AI agents are built for short-horizon tasks, completing a single query or experiment in one session. ML-Master 2.0 is engineered to maintain coherence and learning over a full day of continuous operation.

The Hierarchical Cognitive Caching system breaks down as follows:

Short-term Memory: Holds the immediate context for the current step or experiment (e.g., the code being written, the immediate error message).
Medium-term Memory: Identifies and stores patterns across multiple experiments within a session (e.g., "hyperparameter X consistently leads to overfitting with this dataset family").
Long-term Memory: Stores refined, validated knowledge that carries between sessions (e.g., "for image classification on noisy data, start with architecture Y").

This structure allows the agent to avoid cyclical failures—a common pitfall where an agent without structured memory forgets why an approach failed and retries it hours later.

Key Results: A 24-Hour Marathon on MLE-Bench

The team evaluated ML-Master 2.0 on MLE-Bench, a benchmark designed to assess an AI agent's ability to perform real-world machine learning engineering tasks, such as model selection, hyperparameter tuning, and debugging. Performance is measured by a "medal rate"—the percentage of tasks where the agent's solution meets a high-quality threshold (akin to earning a gold, silver, or bronze medal).

Duration 24 hours Continuous, unattended operation Benchmark MLE-Bench ML engineering task suite Medal Rate 56.44% One of the strongest published marks Core Architecture Hierarchical Cognitive Caching Three-tiered memory system

The 56.44% medal rate is significant. While not a 90%+ score seen on narrower QA benchmarks, MLE-Bench tasks are complex, open-ended, and reflective of real research workflows. A score above 50% over a 24-hour period indicates an agent that can not only execute tasks but also learn and adapt its strategy over time without human intervention.

How It Works: Treating Long-Horizon Agency as a State Problem

The paper's central thesis is provocative: "Long-horizon agents are not a reasoning problem; they are a state-management problem."

Most efforts in agentic AI focus on improving the reasoning loop—giving models better planning, tool-use, or chain-of-thought capabilities. The SJTU team argues that for ultra-long-horizon tasks (like multi-day research), even perfect reasoning will fail if the agent cannot maintain a coherent state. Without a system to cache, retrieve, and refine knowledge, the agent's context window becomes a leaky bucket.

Hierarchical Cognitive Caching is the plug. The system actively manages what information to keep, at what level of abstraction, and for how long. Medium-term memory might compress dozens of failed experiments into a single heuristic rule. Long-term memory undergoes a validation step before promotion, ensuring only robust insights persist. This creates a growing "playbook" for the agent, allowing it to approach new problems with accumulated wisdom rather than from scratch each time.

Why It Matters: The Path to Autonomous Research

ML-Agents Tips & Lessons Learned (AutoMind + MLE-Bench)

This work is a concrete step toward autonomous science—AI systems that can formulate hypotheses, design and run experiments, analyze results, and iterate, potentially for weeks or months. The 24-hour run is a proof-of-concept for the memory architecture needed to make this feasible.

For AI engineers, the takeaway is architectural. Building agents for long tasks requires designing explicit state-management layers, not just relying on a large context window or hoping the reasoning model will "remember." This has immediate implications for anyone developing agents for code generation, data analysis, or systematic testing that lasts longer than a single chat session.

The result also validates MLE-Bench as a testing ground for agentic systems. As performance on shorter, simpler benchmarks saturates, the field needs harder evaluations that stress-test durability and long-term learning. MLE-Bench, with its medal rate metric over extended runs, fits that need.

gentic.news Analysis

This development from Shanghai Jiao Tong University fits directly into the accelerating trend of agentic AI moving from proof-of-concept demos to sustained, operational systems. For much of 2024 and 2025, the discourse was dominated by flashy, short-task agents. The focus is now shifting to the unglamorous, systems-level challenges of reliability, statefulness, and cost over long time horizons—the exact problem ML-Master 2.0 tackles.

Hierarchical Cognitive Caching is a pragmatic answer to a limitation of current large language models: they are stateless by default. While providers like OpenAI and Anthropic have introduced some session-level memory features, they are generic and not optimized for the structured, iterative workflow of scientific research. SJTU's approach is a specialized, task-aware memory architecture, which may become a common pattern as agents are productized for specific verticals like coding, research, or logistics.

The 56.44% medal rate on MLE-Bench is a strong data point in the competitive landscape of AI coding and research agents. It places ML-Master 2.0 in the upper tier of published results, alongside systems like Meta's SWE-agent and projects from Cognition Labs. However, the real differentiator here isn't the score alone, but the 24-hour duration. Most benchmark runs are measured in minutes. This work proves that with the right architecture, agents can maintain performance over a timescale that begins to match human research cycles.

Looking forward, the next hurdle is multi-modal and multi-tool state management. ML-Master 2.0 likely operated primarily in a code-centric environment. The true vision of agentic science involves managing state across wet-lab robotics, literature databases, simulation software, and coding environments simultaneously. Extending hierarchical caching to that heterogeneous, noisy reality is the next frontier.

Frequently Asked Questions

What is MLE-Bench?

MLE-Bench is a benchmark suite for evaluating machine learning engineering agents. It presents tasks that mirror real-world research and development workflows, such as fixing buggy model code, tuning hyperparameters for a given dataset, or selecting an appropriate architecture. Performance is scored via a "medal rate," which measures how often the agent's solution meets a high-quality bar.

How does Hierarchical Cognitive Caching differ from just using a large context window?

A large context window (like 1M tokens) lets an agent see more of its past, but it doesn't help it organize or learn from that past. Hierarchical Cognitive Caching actively structures memory into short, medium, and long-term tiers, compressing experiences into heuristics and validated knowledge. This is more efficient and effective than dumping the raw transcript of the last 20 hours into the prompt.

What is a "medal rate"?

In the context of MLE-Bench, a medal rate is the percentage of tasks where the agent's final solution is deemed good enough to earn a gold, silver, or bronze medal (based on predefined evaluation criteria). It's a pass/fail style metric for solution quality, aggregated across many diverse tasks.

Can I use ML-Master 2.0 or its architecture now?

The research paper is publicly available (linked in the source tweet), so the architectural principles are open for anyone to implement. The specific ML-Master 2.0 system may not be released as a product, but the Hierarchical Cognitive Caching concept is a blueprint that AI engineers can adapt for building their own long-horizon agents.

Source: gentic.news · Apr 19, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The SJTU team's work is a textbook example of the maturation phase in agentic AI. After the initial wave of demonstrations showing that LLMs could use tools and follow instructions, the field is now grappling with the engineering challenges of making these systems robust and scalable. The insight that long-horizon agency is a state-management problem, not just a reasoning problem, is critical. It redirects effort from solely seeking more capable foundation models to designing specialized middleware—the caching, retrieval, and knowledge distillation layers that sit atop the model. This aligns with a broader trend we've covered at gentic.news, such as the rise of **"AI agent infrastructure"** startups like Fixie and SmythOS in 2025, which provide platforms for building stateful, long-running agents. The SJTU research provides an academic blueprint for the kind of memory architectures these platforms need to implement. It also complements recent work from Google's DeepMind on **"Agent Sims"** and Stanford's research on **"Agent Hospital,"** which similarly stress-test agents over extended timelines and complex environments. Practitioners should note the benchmark shift this implies. Easy, single-turn tasks are no longer sufficient to evaluate true agentic capability. The bar is now endurance and sustained learning over hours or days, as measured by benchmarks like MLE-Bench or SWE-Bench Extended. This research provides both a method (hierarchical caching) and a metric (24-hour medal rate) that will likely influence how both academic and industry teams build and evaluate their own agents moving forward.

#machine learning engineering #research #ai agents #benchmarks

Mentioned in this article

ML-Master 2.0 Hierarchical Cognitive Caching Shanghai Jiao Tong University MLE-Bench

Enjoyed this article?