Keygraph's Shannon AI Pentester Hits 96.15% on XBOW, Finds Real Exploits

Keygraph released Shannon, a fully autonomous AI pentester that hunts real exploits in source code with a 96.15% success rate on the hint-free XBOW Benchmark. It runs a full test in about an hour for roughly $50 using Claude Sonnet.

AAAla SMITH & AI Research Desk·Apr 7, 2026·6 min read··160 views·AI-Generated·Report error

Source: x.comvia @heygurisinghSingle Source

TL;DR

Keygraph launches Shannon, an open-source autonomous AI pentester with a 96.15% success rate on the XBOW benchmark, finding real exploits in web apps for ~$50 per test.

Keygraph's Shannon AI Pentester Hits 96.15% on XBOW, Finds Real Exploits

Keygraph has launched Shannon, an open-source, fully autonomous AI agent designed to perform penetration testing on web applications by finding and executing real exploits, not just generating alerts. The system claims a 96.15% success rate on the hint-free XBOW Benchmark and is positioned as a continuous security layer for development teams that ship code daily.

The core proposition addresses a critical gap in modern DevOps: while teams may use AI coding assistants like Claude Code or Cursor to push code daily, traditional pentests are often annual events. Shannon aims to close that window by providing automated, on-demand security testing.

What Shannon Actually Does

Shannon operates as a multi-phase autonomous agent. Its stated workflow is:

Reconnaissance: Scans the target application and infrastructure.
Vulnerability Analysis: Identifies potential attack vectors in the source code.
Exploitation: Actively attempts to exploit found vulnerabilities using a built-in browser, capable of handling complex flows like 2FA/TOTP logins without human intervention.
Reporting: Delivers copy-paste Proof-of-Concept (PoC) evidence for any successful exploit. It enforces a strict "No Exploit, No Report" policy to eliminate false positives.

Under the hood, it orchestrates specialized sub-agents targeting specific vulnerability classes (Injection, XSS, SSRF, Broken Auth) and runs tools like Nmap, Subfinder, WhatWeb, and Schemathesis in parallel.

Key Results and Performance

The primary benchmark cited is the XBOW Benchmark (eXtended Benchmark for Offensive Web agents), a hint-free evaluation for autonomous penetration testing agents. Shannon's reported 96.15% success rate on this benchmark is its headline technical figure.

In a demonstration run against the OWASP Juice Shop—a deliberately vulnerable web application—Shannon is reported to have found:

20+ high-impact vulnerabilities
A complete authentication bypass leading to full database exfiltration
Privilege escalation to admin via a registration bypass
Server-Side Request Forgery (SSRF) enabling internal network reconnaissance
Systemic Insecure Direct Object Reference (IDOR) across user data

Technical and Operational Details

Architecture: Multi-agent system with parallel execution for speed and coverage of critical OWASP Top 10 classes (Injection, XSS, SSRF, Broken Authentication & Authorization).
Runtime: Approximately 1 hour for a full pentest cycle.
Cost: Estimated at ~$50 per test, primarily attributed to the cost of running the underlying Claude Sonnet model that powers the AI agents.
License: Released as 100% open-source under the AGPL-3.0 license. The repository had gained 10.6k stars on GitHub at the time of announcement.
Output: Focuses on actionable, verified exploits with PoCs, moving beyond traditional SAST/DAST tools that often produce alert noise.

gentic.news Analysis

Shannon represents a significant step toward operationalizing AI for offensive security tasks beyond vulnerability scanning. A 96.15% success rate on a hint-free benchmark like XBOW, if independently verified, would place it at the forefront of autonomous security agents. This isn't just a scanner; its ability to chain logic, handle multi-step authentication, and deliver functional exploits positions it closer to a junior human pentester in capability.

This development sits at the convergence of two major trends we've been tracking: the rise of AI-powered software development (exemplified by Claude Code and Cursor) and the growing field of AI for cybersecurity. As development velocity increases, the security review bottleneck becomes more acute. Tools like Shannon propose a compelling solution: integrating continuous, exploit-proven security testing directly into the development lifecycle at a relatively low cost-per-run.

The open-source AGPL release is a strategic move. It allows for rapid community adoption, testing, and improvement, which is crucial for a tool whose effectiveness relies on understanding evolving attack vectors. However, it also raises immediate questions about responsible deployment and potential weaponization. The "No Exploit, No Report" philosophy is a strong differentiator in a market flooded with tools that generate overwhelming false positives, making it particularly attractive for lean security teams.

The ~$50 cost estimate is interesting, as it directly ties the operational expense to a leading LLM API (Anthropic's Claude Sonnet). This makes Shannon's running cost subject to the broader market dynamics of cloud AI pricing. Its success will depend not only on its technical efficacy but also on maintaining a favorable cost/benefit ratio as both the tool and the underlying models evolve.

Frequently Asked Questions

How does Shannon differ from traditional SAST/DAST tools?

Traditional Static (SAST) and Dynamic (DAST) Application Security Testing tools primarily identify potential vulnerabilities and generate alerts, often with high false-positive rates. Shannon is designed as an autonomous agent that actively attempts to exploit found vulnerabilities. Its "No Exploit, No Report" policy means it only reports issues for which it has successfully executed a proof-of-concept, significantly reducing noise and providing verified, high-fidelity results.

What is the XBOW Benchmark, and why is a 96.15% score significant?

The XBOW Benchmark (eXtended Benchmark for Offensive Web agents) is a standardized test for evaluating autonomous penetration testing AI agents. It is "hint-free," meaning the agent receives no guiding clues about potential vulnerabilities and must discover and exploit them entirely on its own. A 96.15% success rate suggests the agent can independently navigate complex web applications, understand context, and chain attacks with a high degree of reliability, a notable advancement over previous automated systems.

Is it safe for companies to run an autonomous AI pentester on their live applications?

Running any aggressive security tool, automated or not, carries risk. A tool designed to exploit vulnerabilities could potentially cause data corruption or service disruption if not configured carefully. The standard and safest practice is to run pentests against staging or pre-production environments that mirror the live system. Keygraph's documentation and the responsible security community will likely emphasize strict controls and testing in isolated environments before any production use.

What does the AGPL-3.0 license mean for commercial use?

The GNU Affero General Public License (AGPL-3.0) is a strong copyleft license. It requires that if you run a modified version of Shannon as a service over a network (e.g., integrating it into a commercial SaaS security platform), you must make the complete source code of your modified version available to the users of that service. This encourages contributions back to the open-source project but can influence how commercial entities choose to integrate the tool.

Sources cited in this article

Shannon's

Source: gentic.news · Apr 7, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The launch of Shannon by Keygraph is a tangible escalation in the AI-for-cybersecurity arms race. For years, the narrative has been about AI *defending* systems. Shannon flips the script, applying state-of-the-art agentic AI to the *offensive* side with remarkable autonomy. The reported XBOW score isn't just a number; it's a signal that AI agents can now reliably navigate the non-linear, stateful paths required for real exploitation, a task that has long been a benchmark for human intelligence in security. Practically, this creates immediate pressure on several fronts. For security vendors, it raises the bar for what constitutes a "finding"—moving from alerts to verified exploits. For developers and DevOps teams, it offers a plausible path to shift-left security that is actually actionable, not just another dashboard to ignore. The cost structure (~$50/test) makes it accessible for continuous integration, potentially changing the pentest from a quarterly or annual audit to a per-commit check. However, the open-source release under AGPL is a double-edged sword. While it accelerates development and trust through transparency, it also democratizes advanced offensive capabilities. The security community will need to rapidly develop norms and safeguards around tools like Shannon. Furthermore, its performance is intrinsically linked to the capabilities of Claude Sonnet. Any regression or policy change by the upstream LLM provider could directly impact Shannon's effectiveness, introducing a new form of supply-chain risk for security operations.

#product launch #open source #ai agents #cybersecurity

Compare side-by-side

Claude Code vs Shannon

→

Mentioned in this article

Shannon Keygraph XBOW Benchmark Claude Code Claude 3.5 Sonnet

Enjoyed this article?