Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

An SR-71 Blackbird jet flies through a 3D maze in a Unity simulation, dodging a missile launcher as a hybrid AI…

Hybrid A*+RL Agent Beats Pure End-to-End in Unity SR-71 Sim

A hybrid A* + deep RL agent in Unity, trained over 5M PPO steps, switches between classical path planning and learned evasion to navigate an SR-71 through a maze while dodging missiles.

AAAla SMITH & AI Research Desk·May 16, 2026·3 min read··66 views·AI-Generated·Report error

Source: pub.towardsai.netvia towards_aiSingle Source

How does a hybrid A* + deep RL system navigate an SR-71 through a maze while dodging missiles?

A hybrid A* + deep RL agent in Unity, trained on 5 million PPO steps across 24 parallel environments, switches between classical path planning and learned evasion to navigate a maze and dodge a missile launcher.

TL;DR

Hybrid A*+RL beats pure end-to-end · SR-71 agent dodges missiles in Unity · 5 million PPO steps, 24 parallel envs

A developer built a hybrid A* + deep RL agent in Unity that flies an SR-71 through a maze while dodging a missile launcher. The system switches between classical path planning and learned evasion, trained over 5 million PPO steps across 24 parallel environments.

Key facts

5 million PPO training steps
24 parallel environments in Unity
14-dimensional observation space
A* algorithm from 1960s used for path planning
Cumulative reward climbed from ~20 to ~100

The core insight: most aerial navigation problems are actually two problems wearing the same costume. Route planning from point A to B around known obstacles is deterministic—solved by A* since the 1960s. Threat evasion, where a missile is actively tracking the aircraft, is non-stationary and partially observable—exactly the kind of problem deep RL was invented for.

[According to the source] The agent runs in one of two modes at any given moment. A sphere-cast around the missile launcher triggers the switch in real time. When outside the detection radius, A* plans the path on a 2D occupancy grid using Euclidean distance as the heuristic, with a PID controller (Kp=0.05, Ki=0, Kd=0.002) on yaw. When the launcher detects the aircraft, a single line of C# fires: SwitchBehavior(BehaviorType.DQN). The trained ONNX policy takes over, and A* shuts up.

The RL agent's observation space is 14 dimensions: local position (3), rotation (3), target position (3), missile position (3), distance to target (1), distance to threat (1). The action space is two continuous values—lateral and forward movement, with forward clamped strictly positive so the agent can't fly backwards. The reward structure is intentionally minimal: +100 for reaching the target, -1 - dist/10 for wall collisions, -1 - dist/100 for timeout after 200s. No explicit reward for evading missiles. The agent learns evasion as an emergent consequence of "don't crash, don't time out, reach the green platform."

Training ran for 5 million steps using PPO via the ML-Agents trainer, with 24 independent copies of the environment running in parallel inside a single Unity scene. Cumulative reward climbed from ~20 to ~100, episode length dropped sharply after the first million steps and stabilized around 20, value loss dropped from ~650 to ~75, and policy loss oscillated in a tight band—indicating stable PPO updates.

The unique take: hybrid architectures are the most under-appreciated idea in applied AI right now. DARPA's 2020 AlphaDogfight trials concluded with a deep RL agent beating an experienced human F-16 pilot, but the footnote was that the winning agent was not a pure end-to-end RL system—it was a hybrid with different controllers handing off depending on the situation. This project replicates that pattern at small scale: classical AI for what it's good at, RL for what it's good at, and a clean switch between them.

The source notes that the first version had an explicit reward for staying far from the missile, but the agent learned to stand still at maximum distance and never reach the target. "Reward shaping is a trap. Letting the agent discover the strategy is harder to train but produces better behavior."

What to watch

Watch for whether the U.S. Department of Defense, which struck a deal with 7 AI labs including OpenAI and Anthropic in May 2026 for classified systems, adopts hybrid controller architectures in production aerial combat systems—moving beyond pure end-to-end RL.

Source: gentic.news · May 16, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The project's core contribution is demonstrating that hybrid architectures—where classical search algorithms handle deterministic subtasks and RL handles non-stationary ones—can outperform either approach alone. This mirrors the less-publicized finding from DARPA's 2020 AlphaDogfight trials, where the winning agent was not pure end-to-end RL but a hybrid system. The observation space design is notably compact at 14 dimensions, hand-engineered rather than learned from raw sensor data, which trades generality for sample efficiency. The reward structure choice is the most instructive part. The author explicitly avoided shaped rewards for evasion after the naive reward led to degenerate behavior (agent standing still). This is a well-known failure mode in RL—reward shaping often produces unintended optima—but worth highlighting for practitioners. The 24-environment parallel training setup is standard ML-Agents practice but necessary given Unity's simulation speed. The project's limitation is its simplicity: a single missile launcher with a fixed detection radius, a 2D occupancy grid, and no sensor noise or partial observability beyond the missile's position. Real aerial combat involves multiple threats, radar clutter, electronic warfare, and uncertain dynamics. Scaling this pattern to production systems would require significant engineering beyond the Unity prototype.

#ai-models #reinforcement-learning #defense

Compare side-by-side

deep RL vs Proximal Policy Optimization

→

Mentioned in this article

deep RL Unity Proximal Policy Optimization

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Opinion & Analysis

Anthropic Co-Founder Predicts Self-Improving AI by 2028

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

Hybrid A*+RL Agent Beats Pure End-to-End in Unity SR-71 Sim

What to watch

AI Analysis

✨AI Toolslive

Related Articles

Why Traditional Retail Metrics Break Down in Agentic Commerce

6 MCP Server Design Lessons from Anthropic's Co-Creator — Stop Wrapping

Fable 5: Claude's Biggest Leap Since Opus 4.5, Says Beta Tester

How Claude Code scales to 500K+ line monorepos

CLAUDE.md Wastes 7K+ Tokens Per Turn; Skills Cut to 50

Anthropic Co-Founder Predicts Self-Improving AI by 2028

The framework underneath this story

More in Opinion & Analysis

BIS Warns AI Gold Rush Risks Next Financial Shock

Claude's Paying Consumer Base Grew 75% Since January, Indagari Data Shows

Zhipu GLM-5.2 Hits No. 2 Globally; Tang Tells Musk China Won't Wait Until