Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

An SR-71 Blackbird jet flies through a 3D maze in a Unity simulation, dodging a missile launcher as a hybrid AI…

Hybrid A*+RL Agent Beats Pure End-to-End in Unity SR-71 Sim

A hybrid A* + deep RL agent in Unity, trained over 5M PPO steps, switches between classical path planning and learned evasion to navigate an SR-71 through a maze while dodging missiles.

·2d ago·3 min read··9 views·AI-Generated·Report error
Share:
Source: pub.towardsai.netvia towards_aiSingle Source
How does a hybrid A* + deep RL system navigate an SR-71 through a maze while dodging missiles?

A hybrid A* + deep RL agent in Unity, trained on 5 million PPO steps across 24 parallel environments, switches between classical path planning and learned evasion to navigate a maze and dodge a missile launcher.

TL;DR

Hybrid A*+RL beats pure end-to-end · SR-71 agent dodges missiles in Unity · 5 million PPO steps, 24 parallel envs

A developer built a hybrid A* + deep RL agent in Unity that flies an SR-71 through a maze while dodging a missile launcher. The system switches between classical path planning and learned evasion, trained over 5 million PPO steps across 24 parallel environments.

Key facts

  • 5 million PPO training steps
  • 24 parallel environments in Unity
  • 14-dimensional observation space
  • A* algorithm from 1960s used for path planning
  • Cumulative reward climbed from ~20 to ~100

The core insight: most aerial navigation problems are actually two problems wearing the same costume. Route planning from point A to B around known obstacles is deterministic—solved by A* since the 1960s. Threat evasion, where a missile is actively tracking the aircraft, is non-stationary and partially observable—exactly the kind of problem deep RL was invented for.

[According to the source] The agent runs in one of two modes at any given moment. A sphere-cast around the missile launcher triggers the switch in real time. When outside the detection radius, A* plans the path on a 2D occupancy grid using Euclidean distance as the heuristic, with a PID controller (Kp=0.05, Ki=0, Kd=0.002) on yaw. When the launcher detects the aircraft, a single line of C# fires: SwitchBehavior(BehaviorType.DQN). The trained ONNX policy takes over, and A* shuts up.

The RL agent's observation space is 14 dimensions: local position (3), rotation (3), target position (3), missile position (3), distance to target (1), distance to threat (1). The action space is two continuous values—lateral and forward movement, with forward clamped strictly positive so the agent can't fly backwards. The reward structure is intentionally minimal: +100 for reaching the target, -1 - dist/10 for wall collisions, -1 - dist/100 for timeout after 200s. No explicit reward for evading missiles. The agent learns evasion as an emergent consequence of "don't crash, don't time out, reach the green platform."

Training ran for 5 million steps using PPO via the ML-Agents trainer, with 24 independent copies of the environment running in parallel inside a single Unity scene. Cumulative reward climbed from ~20 to ~100, episode length dropped sharply after the first million steps and stabilized around 20, value loss dropped from ~650 to ~75, and policy loss oscillated in a tight band—indicating stable PPO updates.

The unique take: hybrid architectures are the most under-appreciated idea in applied AI right now. DARPA's 2020 AlphaDogfight trials concluded with a deep RL agent beating an experienced human F-16 pilot, but the footnote was that the winning agent was not a pure end-to-end RL system—it was a hybrid with different controllers handing off depending on the situation. This project replicates that pattern at small scale: classical AI for what it's good at, RL for what it's good at, and a clean switch between them.

The source notes that the first version had an explicit reward for staying far from the missile, but the agent learned to stand still at maximum distance and never reach the target. "Reward shaping is a trap. Letting the agent discover the strategy is harder to train but produces better behavior."

What to watch

Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent ...

Watch for whether the U.S. Department of Defense, which struck a deal with 7 AI labs including OpenAI and Anthropic in May 2026 for classified systems, adopts hybrid controller architectures in production aerial combat systems—moving beyond pure end-to-end RL.


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The project's core contribution is demonstrating that hybrid architectures—where classical search algorithms handle deterministic subtasks and RL handles non-stationary ones—can outperform either approach alone. This mirrors the less-publicized finding from DARPA's 2020 AlphaDogfight trials, where the winning agent was not pure end-to-end RL but a hybrid system. The observation space design is notably compact at 14 dimensions, hand-engineered rather than learned from raw sensor data, which trades generality for sample efficiency. The reward structure choice is the most instructive part. The author explicitly avoided shaped rewards for evasion after the naive reward led to degenerate behavior (agent standing still). This is a well-known failure mode in RL—reward shaping often produces unintended optima—but worth highlighting for practitioners. The 24-environment parallel training setup is standard ML-Agents practice but necessary given Unity's simulation speed. The project's limitation is its simplicity: a single missile launcher with a fixed detection radius, a 2D occupancy grid, and no sensor noise or partial observability beyond the missile's position. Real aerial combat involves multiple threats, radar clutter, electronic warfare, and uncertain dynamics. Scaling this pattern to production systems would require significant engineering beyond the Unity prototype.
Compare side-by-side
deep RL vs Proximal Policy Optimization
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in Opinion & Analysis

View all