ARMOR 2025, a new benchmark published April 30 on arXiv, evaluates 21 commercial LLMs against military legal doctrines. It reveals that existing safety benchmarks miss critical gaps in models' adherence to the Law of War and Rules of Engagement.
Key facts
- 519 doctrinally grounded prompts in the benchmark
- 12-category taxonomy based on OODA framework
- 21 commercial LLMs evaluated
- Grounded in Law of War, Rules of Engagement, Joint Ethics
- Published on arXiv April 30, 2026
The Doctrinal Gap
ARMOR 2025 targets a blind spot in LLM safety evaluation. Existing benchmarks like MMLU or TruthfulQA test general social risks, but none measure whether models follow the legal and ethical rules governing real military operations. The benchmark extracts doctrinal text from three core sources: the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. It then generates multiple-choice questions designed to preserve the intended meaning of each rule. [According to ARMOR 2025]
OODA-Inspired Taxonomy
The benchmark organizes its 519 prompts through a taxonomy informed by the Observe Orient Decide Act (OODA) decision-making framework, enabling systematic testing of accuracy and refusal across military-relevant decision types. The structured 12-category taxonomy covers scenarios from targeting decisions to rules-of-engagement interpretation. [Per the arXiv preprint]

Results and Implications
Evaluation results across 21 commercial LLMs reveal critical gaps in safety alignment for military applications. The paper does not disclose which specific models performed best or worst, nor does it release per-model scores — a notable omission for reproducibility. However, the finding that models fail to consistently follow legal and ethical rules for military operations has immediate implications for defense contractors exploring LLM deployment. [The company did not disclose the figure]

The unique take here is that civilian safety alignment — the dominant paradigm in AI safety research — is insufficient for high-stakes military contexts. A model that refuses to generate hate speech might still recommend a legally questionable airstrike. ARMOR 2025 provides a concrete framework to test this, but its reliance on multiple-choice questions may miss nuanced reasoning required in real command decisions.
What to watch
Watch for follow-up papers that release per-model scores and open-source the prompt set. Also track whether the U.S. Department of Defense or allied military organizations adopt ARMOR 2025 as a procurement or deployment criterion for LLM-based systems.









