A developer built a hybrid A* + deep RL agent in Unity that flies an SR-71 through a maze while dodging a missile launcher. The system switches between classical path planning and learned evasion, trained over 5 million PPO steps across 24 parallel environments.
Key facts
- 5 million PPO training steps
- 24 parallel environments in Unity
- 14-dimensional observation space
- A* algorithm from 1960s used for path planning
- Cumulative reward climbed from ~20 to ~100
The core insight: most aerial navigation problems are actually two problems wearing the same costume. Route planning from point A to B around known obstacles is deterministic—solved by A* since the 1960s. Threat evasion, where a missile is actively tracking the aircraft, is non-stationary and partially observable—exactly the kind of problem deep RL was invented for.
[According to the source] The agent runs in one of two modes at any given moment. A sphere-cast around the missile launcher triggers the switch in real time. When outside the detection radius, A* plans the path on a 2D occupancy grid using Euclidean distance as the heuristic, with a PID controller (Kp=0.05, Ki=0, Kd=0.002) on yaw. When the launcher detects the aircraft, a single line of C# fires: SwitchBehavior(BehaviorType.DQN). The trained ONNX policy takes over, and A* shuts up.
The RL agent's observation space is 14 dimensions: local position (3), rotation (3), target position (3), missile position (3), distance to target (1), distance to threat (1). The action space is two continuous values—lateral and forward movement, with forward clamped strictly positive so the agent can't fly backwards. The reward structure is intentionally minimal: +100 for reaching the target, -1 - dist/10 for wall collisions, -1 - dist/100 for timeout after 200s. No explicit reward for evading missiles. The agent learns evasion as an emergent consequence of "don't crash, don't time out, reach the green platform."
Training ran for 5 million steps using PPO via the ML-Agents trainer, with 24 independent copies of the environment running in parallel inside a single Unity scene. Cumulative reward climbed from ~20 to ~100, episode length dropped sharply after the first million steps and stabilized around 20, value loss dropped from ~650 to ~75, and policy loss oscillated in a tight band—indicating stable PPO updates.
The unique take: hybrid architectures are the most under-appreciated idea in applied AI right now. DARPA's 2020 AlphaDogfight trials concluded with a deep RL agent beating an experienced human F-16 pilot, but the footnote was that the winning agent was not a pure end-to-end RL system—it was a hybrid with different controllers handing off depending on the situation. This project replicates that pattern at small scale: classical AI for what it's good at, RL for what it's good at, and a clean switch between them.
The source notes that the first version had an explicit reward for staying far from the missile, but the agent learned to stand still at maximum distance and never reach the target. "Reward shaping is a trap. Letting the agent discover the strategy is harder to train but produces better behavior."
What to watch
![]()
Watch for whether the U.S. Department of Defense, which struck a deal with 7 AI labs including OpenAI and Anthropic in May 2026 for classified systems, adopts hybrid controller architectures in production aerial combat systems—moving beyond pure end-to-end RL.









