Every existing cyber benchmark asks models to answer questions about attacks. Cyber Defense Benchmark is the first to use real attack telemetry in a scalable way in an agentic investigation format — giving AI agents raw logs and asking them to find the intrusion. 105 procedures, 11 frontier LLMs, zero passing scores. RLVR-ready.
Every multi-stage attack in the benchmark is a sequence of attacker steps. Each step is a procedure that maps onto a MITRE ATT&CK tactic — the high-level phases of an intrusion (Initial Access, Discovery, Lateral Movement, Exfiltration, etc.).
The Coverage Score represents what fraction of the attack the agent surfaces, scaled so a perfect run scores 1.0. The radar above shows that score broken out per tactic; the outer ring is the benchmark maximum.
| # | Model | Coverage Score | Flag % | Cost mean |
|---|---|---|---|---|
| 1 | Opus 4.6 | 0.46 | 4.49 | $17.98 |
| 2 | Sonnet 4.6 | 0.35 | 3.43 | $12.99 |
| 3 | Opus 4.7 | 0.28 | 2.97 | $8.72 |
| 4 | Gemini 3.1 Pro | 0.18 | 2.02 | $1.85 |
| 5 | GPT 5 | 0.17 | 2.24 | $1.07 |
| 6 | Gemini 3 Flash | 0.15 | 1.44 | $0.19 |
| 7 | Kimi 2.6 | 0.13 | 1.15 | $0.52 |
| 8 | Qwen 3.6 | 0.13 | 1.58 | $0.54 |
| 9 | Minimax 2.7 | 0.11 | 0.97 | $0.11 |
| 10 | DeepSeek 3.2 | 0.09 | 0.82 | $0.91 |
| 11 | Kimi 2.5 | 0.09 | 0.87 | $1.44 |
11 frontier models evaluated across 886 runs over 26 multi-stage campaigns spanning 105 procedures from the canonical Security-Datasets corpus. Each hunt has a 50-query budget against 75,000–135,000 log records. Total LLM API cost across all runs: $1,682.
↓ Download CSV ↓ Technical Report (arXiv)
The chart below tracks each model’s Coverage Score (explained above) as turns are spent. Opus 4.6 separates from the field early and finishes at 0.46, meaning more than half of the coverable narrative steps are still missed per run on average. Several lower-ranked models stop progressing well before exhausting their query budget - not because they run out of hypotheses, but because they assume their work is done. For that reason they cannot be trusted to run unsupervised.
Detection ability and cost of investigation are not linearly related. Gemini 3 Flash costs $0.19/run and finds 1.4% of flags. Opus 4.6 costs $17.98/run — roughly 95× more — and finds 4.5% of flags, about 3.1× more. That is the Pareto frontier in one sentence: the curve bends sharply, not smoothly. The cost-quality trade-off is a cliff, not a slope.
GPT 5 and Gemini 3.1 Pro sit in the mid-price range at $1.07 and $1.85 per run. Both plateau around 2.1% of flags and stop climbing. Throwing more budget at them does not help — the limit is not token cost, it is that the agent believes it has completed its task and has no work left to do.
Finding more flags is only meaningful if the detections span the full attack lifecycle. Against the >50%-per-tactic passing bar, Opus 4.6 clears 8 of 13 tactics; of the other 10 models only Sonnet 4.6 clears 1 tactic, and the remaining 9 fail every tactic category. Partial visibility means the attacker’s that some of stages go unnoticed.
Tactic-level breakdown for each model — where they excel and where they go blind. Models with blind spots (0% detection on a tactic) miss entire phases of the kill chain.
The Cyber Defense Benchmark is available to credible organizations at no cost.
Bring your model — we run the benchmark.
Cyber Defense provides verifiable rewards for reinforcement learning. Binary flag matching means every agent action produces a deterministic, auditable reward signal — no LLM judges, no subjective rubrics. The Holodeck Gymnasium environment generates infinite non-repeating episodes through context morphing, making it impossible to overfit to a fixed attack sequence.
No existing benchmark uses real attack telemetry in scalable way in an agentic format. The Cyber Defense Benchmark does. Every event was captured from actual attack execution, e.g. using tools like Empire, Covenant, Mimikatz, Rubeus, etc., on instrumented Windows endpoints. These are real Sysmon and Security logs, not synthetic or hand-written examples. Agents investigate 134K+ events per environment through iterative SQL queries
| Benchmark | Focus | Format | Scale | Real Telemetry | Agentic | Anti-Memorization | Cost Tracking |
|---|---|---|---|---|---|---|---|
| CTI-Bench | CTI knowledge | MCQ / short answer | Fixed | — | — | — | — |
| CyberSOCEval | SOC reasoning | Multi-answer MCQ | Fixed | — | — | — | — |
| AI SOC LLM Leaderboard (Simbian, prior work) | Alert investigation | End-to-end agentic | Fixed | ✓ | ✓ | — | — |
| ExCyTIn-Bench (Microsoft) | Threat investigation | Task-based Q&A | Fixed | ✓ | ✓ | — | — |
| SIR-Bench (AWS) | Incident response | Lexical findings | Scalable (Manual) | ✓ | ✓ | ✓ | — |
| Cyber Defense Benchmark (ours) | Threat hunting | Attack log hunting | Scalable (Code) | ✓ | ✓ | ✓ | ✓ |
Realistic setup to search for attack evidence in logs across diverse attack chains, based on real attack telemetry. The best of both worlds: real attack telemetry and synthetic seamless conditional attack chaining.
87/205 parent techniques covered, spanning 93 sub-techniques across 13 tactics — from Initial Access through Exfiltration and Impact.
Holodeck Gymnasium mutates organization environment and attack infrastructure together with time sequencing, providing an RLVR gym.
Seeded randomization ensures no two runs present identical data. Static benchmarks can be gamed by training on the test set; Cyber Defense cannot.
Flags are matched by exact timestamp against ground truth. No LLM judges, no subjective rubrics. Binary, reproducible, auditable.
API cost is tracked per run and plotted on the Pareto frontier. Practitioners need to know what detection costs, not just who 'wins.'
How the Cyber Defense Benchmark constructs realistic attack simulations and evaluates AI threat hunting agents.
The Holodeck Gymnasium environment generates multi-chain attack campaigns from 106 canonical procedures sourced from the Security-Datasets repository. Each procedure replays genuine Windows telemetry (Sysmon, Security event logs) captured from real attack execution in instrumented lab environments — not synthetic, not hand-written, not approximated.
93 ATT&CK sub-techniquesEach campaign run mutates the raw telemetry with a seeded random context - both environment and attacker infrastructure, e.g. hostnames, user identities, IP addresses, tool names, hashes, etc., as well as timestamps. This ensures no two benchmark runs present identical log data, preventing memorization and forcing genuine analytical reasoning.
Seeded reproducibilityAn AI agent receives a threat intel briefing and a SQL-queryable log database. It must find evidence of the attack by issuing iterative SQL queries and submitting the exact timestamps of malicious events. Scoring is fully deterministic — no LLM judges, no subjective rubrics.
Deterministic flag matchingTotal API cost (tokens consumed, model pricing) is tracked per run, enabling cost-efficiency analysis across the Pareto frontier of performance (coverage score) vs. cost (dollars-spent).
50-query budgetTechnique coverage across MITRE ATT&CK tactics in the benchmark procedure library.
| Tactic | Techniques | Covered | Coverage |
|---|---|---|---|
Resource Development | 8 | 2 | 25% |
Initial Access | 11 | 3 | 27% |
Execution | 17 | 8 | 47% |
Persistence | 23 | 14 | 61% |
Privilege Escalation | 14 | 12 | 86% |
Defense Evasion | 47 | 21 | 45% |
Credential Access | 17 | 9 | 53% |
Discovery | 34 | 16 | 47% |
Lateral Movement | 9 | 4 | 44% |
Collection | 17 | 6 | 35% |
Command and Control | 18 | 8 | 44% |
Exfiltration | 9 | 2 | 22% |
Impact | 15 | 2 | 13% |
| Unique total | 205 | 87 | 42% |
The “Unique total” row counts each parent technique once. A single technique can be tagged with multiple tactics in MITRE ATT&CK (e.g. T1078 Valid Accounts spans Initial Access, Persistence, Privilege Escalation, and Defense Evasion), so the per-tactic Techniquescolumn doesn’t sum to 205 — unique values do.
Known boundaries of the current benchmark.
All procedures are sourced from Windows endpoints (Sysmon, Security event logs). Linux, macOS, cloud, and network telemetry are not yet covered.
Attack procedures use a fixed set of offensive techniques. Novel or custom tooling is not represented.
886 total runs across 11 models. Per-model sample sizes vary. Results should be interpreted with this variance in mind.
Agents interact exclusively via SQL queries. Models with stronger code generation may have an inherent advantage unrelated to security reasoning ability.
Cyber Defense builds on open source attack telemetry from the security research community.
One sample environment is available on GitHub. Full benchmark access is available to credible organizations on request at research@simbian.ai.
Attack procedures sourced from the Security-Datasets project (Open Threat Research Forge). We gratefully credit the community contributors who built the canonical attack telemetry.
To cite the Cyber Defense Benchmark in a paper or model card, please cite (paper: https://arxiv.org/abs/2604.19533):
@misc{chona2026cyberdefensebenchmarkagentic,
title = {Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps},
author = {Alankrit Chona and Igor Kozlov and Ambuj Kumar},
year = {2026},
eprint = {2604.19533},
archivePrefix = {arXiv},
primaryClass = {cs.CR},
url = {https://arxiv.org/abs/2604.19533},
}Per-model coverage on each MITRE ATT&CK tactic. Ordered by kill-chain stage.
| Tactic | Opus 4.6 | Sonnet 4.6 | Opus 4.7 | GPT 5 | Gemini 3.1 Pro | Qwen 3.6 | Gemini 3 Flash | Kimi 2.6 | Minimax 2.7 | Kimi 2.5 | DeepSeek 3.2 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 01 / 13Initial Access | 37% | 24% | 18% | 2% | 13% | 11% | 12% | 8% | 3% | 5% | 4% |
| 02 / 13Execution | 56% | 45% | 37% | 22% | 23% | 15% | 19% | 17% | 15% | 11% | 10% |
| 03 / 13Persistence | 56% | 43% | 35% | 28% | 25% | 18% | 20% | 17% | 19% | 12% | 14% |
| 04 / 13Privilege Escalation | 53% | 42% | 32% | 27% | 23% | 13% | 18% | 17% | 17% | 11% | 12% |
| 05 / 13Defense Evasion | 60% | 49% | 39% | 25% | 25% | 16% | 21% | 19% | 18% | 13% | 11% |
| 06 / 13Credential Access | 27% | 20% | 14% | 7% | 9% | 6% | 6% | 10% | 4% | 5% | 4% |
| 07 / 13Discovery | 54% | 37% | 36% | 24% | 20% | 15% | 16% | 15% | 14% | 11% | 10% |
| 08 / 13Lateral Movement | 31% | 22% | 18% | 5% | 10% | 7% | 7% | 9% | 3% | 5% | 5% |
| 09 / 13Collection | 25% | 16% | 4% | 5% | 9% | 9% | 6% | 6% | 4% | 4% | 3% |
| 10 / 13Command And Control | 56% | 46% | 41% | 23% | 23% | 16% | 20% | 18% | 16% | 12% | 11% |
| 11 / 13Exfiltration | 25% | 18% | 17% | 2% | 5% | 3% | 4% | 2% | 2% | 2% | 1% |
| 12 / 13Impact | 58% | 50% | 33% | 38% | 17% | 19% | 16% | 18% | 14% | 9% | 11% |
| 13 / 13Resource Development | 63% | 50% | 46% | 23% | 29% | 22% | 24% | 21% | 18% | 15% | 16% |