Memento-Skills Agent System Achieves 116.2% Relative Improvement on Humanity's Last Exam Without LLM Updates

Memento-Skills is a generalist agent system that autonomously constructs and adapts task-specific agents through experience. It enables continual learning without updating LLM parameters, achieving 26.2% and 116.2% relative improvements on GAIA and Humanity's Last Exam benchmarks.

AAAla SMITH & AI Research Desk·Mar 22, 2026·3 min read··234 views·AI-Generated·Report error

Source: x.comvia @HuggingPapersSingle Source

What Happened

Researchers have introduced Memento-Skills, a generalist agent system that autonomously constructs and adapts task-specific agents through accumulated experience. The core innovation is a framework that enables continual learning without updating the underlying large language model (LLM) parameters.

According to the announcement, the system achieves 26.2% relative improvement on the GAIA benchmark and a 116.2% relative improvement on Humanity's Last Exam. These gains come from the system's ability to design specialized agents for specific tasks based on past interactions, rather than requiring fine-tuning or parameter updates to the base LLM.

Context

The development addresses a fundamental challenge in AI agent systems: how to adapt general-purpose models to specific, evolving tasks without the computational cost and catastrophic forgetting risks associated with continual fine-tuning. Most current approaches either require task-specific fine-tuning (which doesn't scale) or rely on prompt engineering within a static model.

Memento-Skills represents a different approach where the system itself becomes a meta-agent that designs and deploys specialized sub-agents based on accumulated knowledge. This "agents designing agents" paradigm could enable more efficient adaptation to new domains while preserving the general capabilities of the base model.

The reported benchmarks are significant:

GAIA: A challenging benchmark testing general AI assistants on real-world tasks requiring reasoning, tool use, and multi-step planning
Humanity's Last Exam: A comprehensive evaluation of AI capabilities across reasoning, knowledge, and problem-solving

The 116.2% improvement on Humanity's Last Exam suggests the system is particularly effective at complex, multi-faceted tasks that benefit from specialized agent design.

Technical Approach

While the source doesn't provide architectural details, the core mechanism appears to be a skill library or memory system where the meta-agent stores and retrieves successful agent designs. When encountering a new task, the system:

Analyzes the task requirements
Retrieves relevant past agent designs from memory
Adapts or composes these designs into a task-specific agent
Executes the task with the specialized agent
Updates the memory with successful designs for future use

This approach avoids the need to modify the base LLM's weights while still enabling the system to improve over time through experience accumulation.

Limitations and Unknowns

The announcement lacks several key details:

Specific architecture and implementation details
Computational overhead of the meta-agent system
Performance on standard agent benchmarks beyond the two mentioned
Comparison to other continual learning approaches
Whether the improvements are absolute or relative to a specific baseline
Training data and evaluation methodology details

Without these details, it's difficult to assess the system's practical utility or how it compares to existing approaches like retrieval-augmented generation, prompt tuning, or adapter-based methods.

Source: gentic.news · Mar 22, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The Memento-Skills approach represents an interesting middle ground between fine-tuning and prompt engineering. By treating agent designs as reusable components stored in memory, the system potentially offers more structured adaptation than simple prompt modification while avoiding the computational cost and forgetting issues of continual fine-tuning. The reported 116.2% relative improvement on Humanity's Last Exam is striking, but the lack of baseline specification makes interpretation difficult. If this represents improvement over a zero-shot baseline with the same base model, it would be significant. If it's compared to a weaker baseline, the results might be less impressive. The GAIA benchmark improvement of 26.2% is more modest but still substantial for a benchmark designed to be challenging for current systems. Practitioners should watch for the full paper release to understand the memory architecture and skill representation. Key questions include: How are agent designs encoded and stored? What's the retrieval mechanism? How does the system handle conflicting or overlapping skills? The computational overhead of the meta-reasoning layer will also be crucial for practical deployment—if designing agents takes significant time or resources, the approach may not be viable for real-time applications.

#agents #research #benchmarks #continual-learning

Compare side-by-side

Memento-Skills vs GAIA

→

Mentioned in this article

Memento-Skills GAIA Humanity's Last Exam

Enjoyed this article?