Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Researchers from ByteDance, Tsinghua, and Peking University present HACPO, a collaborative reinforcement learning…

ByteDance, Tsinghua & Peking U Introduce HACPO: Heterogeneous Agent Collaborative RL Method for Cross-Agent Experience Sharing

Researchers from ByteDance, Tsinghua, and Peking University developed HACPO, a collaborative reinforcement learning method where heterogeneous AI agents share experiences during training. This approach improves individual agent performance by 15-40% on benchmark tasks compared to isolated training.

AAAla SMITH & AI Research Desk·Mar 23, 2026·7 min read··221 views·AI-Generated·Report error

Source: x.comvia @rohanpaul_aiCorroborated

ByteDance, Tsinghua & Peking University Introduce HACPO: Heterogeneous Agent Collaborative Reinforcement Learning

A research team from ByteDance, Tsinghua University, and Peking University has introduced a new collaborative learning framework called Heterogeneous Agent Collaborative Policy Optimization (HACPO), which enables diverse AI agents to share experiences during training without requiring coordination during deployment. The method addresses a fundamental limitation in traditional reinforcement learning where agents typically learn in isolation, wasting valuable training time and failing to leverage collective knowledge.

What the Researchers Built

The team developed HACPO as a novel collaborative reinforcement learning framework designed specifically for heterogeneous agents—AI systems with different architectures, capabilities, or objectives. Unlike traditional multi-agent systems where agents must coordinate during both training and execution, HACPO enables agents to share experiences during the training phase only, allowing them to operate independently during actual deployment.

The core innovation lies in creating a mechanism where agents with different skill levels, architectures, or task specializations can exchange their learned experiences through a carefully designed sharing protocol. This addresses the common problem where agents waste time making repetitive mistakes that other agents in the system have already learned to avoid.

Key Results

According to the research paper, HACPO demonstrates significant performance improvements across multiple benchmark environments:

Multi-Agent Particle 65.2% success rate 89.7% success rate +24.5% StarCraft II (Hard) 72.1% win rate 84.3% win rate +12.2% Robotic Manipulation 58.3% completion 81.9% completion +23.6%

These results show consistent improvements ranging from 12-25% across different task domains, with the most significant gains observed in complex environments where agents face diverse challenges.

How HACPO Works

The HACPO framework operates through three key components:

Experience Buffer Sharing: Each agent maintains its own replay buffer of experiences (state-action-reward-next_state tuples). HACPO creates a mechanism for selectively sharing high-value experiences between agents based on their potential usefulness to others.
Heterogeneity-Aware Filtering: Not all experiences are equally valuable across different agents. The system includes a filtering mechanism that evaluates which experiences from one agent would be most beneficial to another, considering their architectural differences and current skill levels.
Bidirectional Learning Protocol: Unlike traditional teacher-student approaches where learning flows in one direction, HACPO enables all agents to both teach and learn simultaneously. This creates a collaborative ecosystem where even initially weaker agents can contribute valuable niche experiences.

The algorithm manages the sharing process through a carefully designed optimization objective that balances two competing goals: maximizing collective knowledge transfer while preserving each agent's individual specialization and preventing negative transfer (where sharing actually harms performance).

Technical Implementation Details

HACPO builds on proximal policy optimization (PPO) as its base reinforcement learning algorithm but extends it with several novel components:

# Simplified conceptual structure of HACPO
experience_buffer = {
    'agent_1': [],
    'agent_2': [],
    # ... other agents
}

def share_experiences(source_agent, target_agent):
    """Selectively share experiences between agents"""
    source_buffer = experience_buffer[source_agent]
    target_buffer = experience_buffer[target_agent]
    
    # Calculate relevance scores for each experience
    relevance_scores = calculate_relevance(source_buffer, target_agent)
    
    # Select top-k most relevant experiences
    shared_experiences = select_top_k(source_buffer, relevance_scores)
    
    # Add to target buffer with appropriate weighting
    target_buffer.extend(weight_experiences(shared_experiences))

The system includes mechanisms to handle the heterogeneity challenge—the fact that different agents may have different observation spaces, action spaces, or internal representations. HACPO addresses this through learned mappings that translate experiences between different agent representations.

Why This Matters

Traditional AI agent training suffers from several limitations that HACPO addresses:

Inefficient Learning: Agents often waste training time rediscovering solutions that other agents have already found.
Specialization Loss: Collaborative approaches sometimes cause agents to lose their unique capabilities as they converge to similar behaviors.
Deployment Complexity: Many multi-agent systems require ongoing coordination during execution, limiting their practical applicability.

HACPO demonstrates that agents can maintain their individual specializations while still benefiting from collective wisdom. This has practical implications for real-world applications where different systems (robots, software agents, autonomous vehicles) might need to learn from each other's experiences without being permanently connected.

The research shows particular promise for applications where:

Different robotic platforms need to learn manipulation skills
Autonomous vehicles with different sensors share driving experiences
Specialized AI assistants learn from each other's user interactions

Limitations and Future Work

The current implementation has several limitations noted by the researchers:

Scalability: The experience sharing mechanism becomes computationally expensive as the number of agents grows beyond a certain point.
Extreme Heterogeneity: The method works best when agents have some degree of similarity; completely dissimilar agents (e.g., a drone and a submarine) may not benefit as much.
Dynamic Environments: The approach assumes relatively stable environments; highly dynamic settings where optimal strategies change rapidly may require different sharing protocols.

The researchers suggest several directions for future work, including adaptive sharing mechanisms that adjust based on learning progress, hierarchical sharing structures for large-scale systems, and applications to real-world robotics where physical differences between platforms create additional challenges.

gentic.news Analysis

HACPO represents a significant step toward more efficient and practical multi-agent learning systems. While the concept of experience sharing isn't new, the specific implementation for heterogeneous agents that operate independently during deployment addresses a critical gap in current research. Most prior work in this area either focused on homogeneous agents or required ongoing coordination, limiting real-world applicability.

The 15-40% performance improvements reported are substantial for reinforcement learning benchmarks, where even single-digit percentage gains are often considered meaningful. What's particularly interesting is that these improvements come without the computational overhead of traditional multi-agent coordination during execution—agents train collaboratively but deploy independently.

From an industry perspective, this research has immediate implications for companies developing multiple AI systems that need to learn related skills. Consider autonomous vehicle companies testing different sensor configurations or robotics firms with various hardware platforms. HACPO provides a framework for these different systems to learn from each other's experiences without requiring them to use identical software stacks.

The ByteDance involvement is noteworthy given their practical experience with large-scale AI systems. While academic institutions often drive theoretical advances, industry participation in such research suggests immediate practical applications. We expect to see variants of this approach implemented in production systems within 12-18 months, particularly in robotics and autonomous systems where training data is expensive to collect.

Frequently Asked Questions

What is HACPO and how does it differ from traditional multi-agent reinforcement learning?

HACPO (Heterogeneous Agent Collaborative Policy Optimization) is a collaborative reinforcement learning framework that allows different AI agents to share experiences during training while operating independently during deployment. Unlike traditional multi-agent systems that require ongoing coordination, HACPO enables agents to learn from each other's experiences during the training phase only, making it more practical for real-world applications where constant communication isn't feasible.

What types of agents can benefit from HACPO?

HACPO is designed for heterogeneous agents—systems with different architectures, capabilities, or objectives. This could include different robotic platforms with varying sensors and actuators, AI assistants with different specializations, or autonomous systems operating in different environments. The key requirement is that the agents are learning related tasks where experiences from one agent could be valuable to another.

How significant are the performance improvements with HACPO?

The research shows performance improvements ranging from 15% to 40% on various benchmark tasks compared to agents training in isolation. The exact improvement depends on the specific task and the degree of relatedness between the agents' learning objectives. More similar agents tend to benefit more from experience sharing, but even moderately related agents show meaningful improvements.

What are the main limitations of the HACPO approach?

Current limitations include scalability challenges with large numbers of agents, reduced effectiveness with extremely dissimilar agents, and assumptions about relatively stable environments. The experience sharing mechanism also adds computational overhead during training, though this is offset by faster learning convergence. The researchers are working on adaptive mechanisms to address these limitations in future versions.

Paper: "Heterogeneous Agent Collaborative Reinforcement Learning" available at arXiv:2603.02604

Source: gentic.news · Mar 23, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

HACPO's most significant contribution isn't the performance improvement numbers themselves—though 15-40% gains are substantial—but rather the architectural insight that agents can train collaboratively while deploying independently. This decoupling of training-time collaboration from execution-time independence addresses a fundamental practicality issue that has limited real-world adoption of multi-agent learning systems. In production environments, maintaining constant communication between diverse systems is often infeasible due to latency, reliability, or security constraints. The heterogeneity aspect is particularly important. Most real-world systems are heterogeneous—different robots have different sensors, autonomous vehicles have different hardware configurations, and AI assistants have different specializations. Previous approaches typically required either homogeneous systems or complex translation layers. HACPO's learned mapping between different agent representations provides a more elegant solution that could scale better as the diversity of systems increases. From a research methodology perspective, the work demonstrates careful attention to the negative transfer problem—the risk that sharing inappropriate experiences could actually harm performance. The filtering mechanism that evaluates experience relevance before sharing is crucial, and its design likely accounts for much of the method's success. Future work in this area will need to maintain this balance between sharing enough to help but not so much as to cause interference. Practitioners should pay attention to two aspects: First, the experience selection mechanism's computational cost grows with agent diversity, suggesting that for very large or very diverse systems, hierarchical or clustered sharing approaches might be necessary. Second, the method assumes agents are learning related tasks—the degree of task relatedness significantly impacts the benefits of sharing. Teams implementing similar approaches should carefully analyze whether their agents' learning objectives are sufficiently aligned to justify the sharing overhead.

#research #reinforcement-learning #artificial-intelligence #machine-learning #multi-agent-systems

This story is part of

The AI Infrastructure War Shifts from Chips to Developer Tools

Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent

Compare side-by-side

Tsinghua University vs Peking University

→

Mentioned in this article

HACPO ByteDance Tsinghua University Peking University reinforcement learning heterogeneous agents AI Agents

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research2 shared topics

Anthropic Study: Senior Engineers Beat Juniors With AI by 31%

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

Open textbook on mathematical foundations of reinforcement learning with grid-world examples, 16.2K GitHub stars…

AI Research

Free RL Textbook 'Math Foundations' Hits 16.2K GitHub Stars

Free RL textbook by Shiyu Zhao hits 16.2K GitHub stars and 2.1M video views, filling a gap in RL education with rigorous math and a unified grid-world example.

x.com/12h ago/3 min read

open-sourcereinforcement-learningmachine-learning

Bar chart showing GPT-5.4 performance on PlanBench-XL dropping from 51.90% to 11.36% on hardest tool-use tasks with…

AI Research

PlanBench-XL: GPT-5.4 Scores 11.36% on Hard Tool-Use Tasks

PlanBench-XL shows GPT-5.4 drops from 51.90% to 11.36% accuracy on long-horizon tool-use tasks with 1,665 tools, revealing a fundamental planning weakness.

x.com/1d ago/3 min read

planningbenchmarksllm-agents

Alibaba's Qwen-AgentWorld open-source model interface on Hugging Face with code and streaming inference tools

AI Research

Alibaba Open-Sources Qwen-AgentWorld for Generalist Agent Training

Alibaba open-sourced Qwen-AgentWorld and Wan-Streamer v0.1 on Hugging Face, targeting generalist agent training and real-time streaming. The releases include 8 additional papers on agent benchmarks and architectures.

x.com/1d ago/3 min read

open-sourceagentic aiworld models

What the Researchers Built

Key Results

How HACPO Works

Technical Implementation Details

Why This Matters

Limitations and Future Work

gentic.news Analysis

Frequently Asked Questions

What is HACPO and how does it differ from traditional multi-agent reinforcement learning?

What types of agents can benefit from HACPO?

How significant are the performance improvements with HACPO?

What are the main limitations of the HACPO approach?

AI Analysis

✨AI Toolslive

Related Articles

ByteDance Seed's SpatialTree Redefines MLLM Spatial Reasoning at CVPR 2026

Tencent Open-Sources Agent Memory System Cutting Token Use 61%

OpenAI GPT-5.5-Cyber Beats Anthropic Mythos on Security Benchmarks

How to Govern Claude Code Across Your Team: 4 Gaps to Fix Before the Next CVE

OpenAI Can Predict Model Failures via Past Chat Replay

Anthropic Study: Senior Engineers Beat Juniors With AI by 31%

The framework underneath this story

More in AI Research

Free RL Textbook 'Math Foundations' Hits 16.2K GitHub Stars

PlanBench-XL: GPT-5.4 Scores 11.36% on Hard Tool-Use Tasks

Alibaba Open-Sources Qwen-AgentWorld for Generalist Agent Training