LLMs Can Find & Exploit Vulnerabilities. Out of the Box, They Fail at Defense.

Every existing cyber benchmark asks models to answer questions about attacks. Cyber Defense Benchmark is the first to use real attack telemetry in a scalable way in an agentic investigation format — giving AI agents raw logs and asking them to find the intrusion. 105 procedures, 11 frontier LLMs, zero passing scores. RLVR-ready.

Model Per-Tactic Coverage

Coverage Score:

Every multi-stage attack in the benchmark is a sequence of attacker steps. Each step is a procedure that maps onto a MITRE ATT&CK tactic — the high-level phases of an intrusion (Initial Access, Discovery, Lateral Movement, Exfiltration, etc.).

The Coverage Score represents what fraction of the attack the agent surfaces, scaled so a perfect run scores 1.0. The radar above shows that score broken out per tactic; the outer ring is the benchmark maximum.

Model Leaderboard

#	Model	Coverage Score	Flag %	Cost mean
1	Opus 4.6	0.46	4.49	$17.98
2	Sonnet 4.6	0.35	3.43	$12.99
3	Opus 4.7	0.28	2.97	$8.72
4	Gemini 3.1 Pro	0.18	2.02	$1.85
5	GPT 5	0.17	2.24	$1.07
6	Gemini 3 Flash	0.15	1.44	$0.19
7	Kimi 2.6	0.13	1.15	$0.52
8	Qwen 3.6	0.13	1.58	$0.54
9	Minimax 2.7	0.11	0.97	$0.11
10	DeepSeek 3.2	0.09	0.82	$0.91
11	Kimi 2.5	0.09	0.87	$1.44

Nobody passed. Passing means >50% recall on every MITRE ATT&CK tactic — the minimum bar for unsupervised SOC deployment. 0 of 11 models clear it. Opus 4.6 leads with 8 of 13 tactics above the bar; of the other 10 models only Sonnet 4.6 clears 1 tactic, the other 9 fail every tactic category.

Benchmark Results

11 frontier models evaluated across 886 runs over 26 multi-stage campaigns spanning 105 procedures from the canonical Security-Datasets corpus. Each hunt has a 50-query budget against 75,000–135,000 log records. Total LLM API cost across all runs: $1,682.
↓ Download CSV ↓ Technical Report (arXiv)

Coverage Over Time

The chart below tracks each model’s Coverage Score (explained above) as turns are spent. Opus 4.6 separates from the field early and finishes at 0.46, meaning more than half of the coverable narrative steps are still missed per run on average. Several lower-ranked models stop progressing well before exhausting their query budget - not because they run out of hypotheses, but because they assume their work is done. For that reason they cannot be trusted to run unsupervised.

Coverage Score over turns, per model. No model approaches full procedure coverage in any campaign.

Cost Efficiency

Detection ability and cost of investigation are not linearly related. Gemini 3 Flash costs $0.19/run and finds 1.4% of flags. Opus 4.6 costs $17.98/run — roughly 95× more — and finds 4.5% of flags, about 3.1× more. That is the Pareto frontier in one sentence: the curve bends sharply, not smoothly. The cost-quality trade-off is a cliff, not a slope.

GPT 5 and Gemini 3.1 Pro sit in the mid-price range at $1.07 and $1.85 per run. Both plateau around 2.1% of flags and stop climbing. Throwing more budget at them does not help — the limit is not token cost, it is that the agent believes it has completed its task and has no work left to do.

Cost (USD per run, log scale) vs. Coverage Score (mean ± std across runs). Dashed line = Pareto frontier.

Tactic Breadth

Finding more flags is only meaningful if the detections span the full attack lifecycle. Against the >50%-per-tactic passing bar, Opus 4.6 clears 8 of 13 tactics; of the other 10 models only Sonnet 4.6 clears 1 tactic, and the remaining 9 fail every tactic category. Partial visibility means the attacker’s that some of stages go unnoticed.

Per-model coverage broken out by MITRE ATT&CK tactic. Gaps mean the tactic was never visited; the outer ring is the benchmark maximum.

Model Report Cards

Tactic-level breakdown for each model — where they excel and where they go blind. Models with blind spots (0% detection on a tactic) miss entire phases of the kill chain.

#1Opus 4.6

0.46Coverage Score

$17.98cost
per run

8/13tactics >50%

Best coverage

Resource Development63%

Defense Evasion60%

Weakest coverage

Collection25%

Exfiltration25%

#2Sonnet 4.6

0.35Coverage Score

$12.99cost
per run

1/13tactics >50%

Best coverage

Impact50%

Resource Development50%

Weakest coverage

Collection16%

Exfiltration18%

#3Opus 4.7

0.28Coverage Score

$8.72cost
per run

0/13tactics >50%

Best coverage

Resource Development46%

Command and Control41%

Weakest coverage

Collection4%

Credential Access14%

#4GPT 5

0.17Coverage Score

$1.07cost
per run

0/13tactics >50%

Best coverage

Impact38%

Persistence28%

Weakest coverage

Exfiltration2%

Initial Access2%

#5Gemini 3.1 Pro

0.18Coverage Score

$1.85cost
per run

0/13tactics >50%

Best coverage

Resource Development29%

Defense Evasion25%

Weakest coverage

Exfiltration5%

Collection9%

#6Qwen 3.6

0.13Coverage Score

$0.54cost
per run

0/13tactics >50%

Best coverage

Resource Development22%

Impact19%

Weakest coverage

Exfiltration3%

Credential Access6%

#7Gemini 3 Flash

0.15Coverage Score

$0.19cost
per run

0/13tactics >50%

Best coverage

Resource Development24%

Defense Evasion21%

Weakest coverage

Exfiltration4%

Collection6%

#8Kimi 2.6

0.13Coverage Score

$0.52cost
per run

0/13tactics >50%

Best coverage

Resource Development21%

Defense Evasion19%

Weakest coverage

Exfiltration2%

Collection6%

#9Minimax 2.7

0.11Coverage Score

$0.11cost
per run

0/13tactics >50%

Best coverage

Persistence19%

Defense Evasion18%

Weakest coverage

Exfiltration2%

Initial Access3%

#10Kimi 2.5

0.09Coverage Score

$1.44cost
per run

0/13tactics >50%

Best coverage

Resource Development15%

Defense Evasion13%

Weakest coverage

Exfiltration2%

Collection4%

#11DeepSeek 3.2

0.09Coverage Score

$0.91cost
per run

0/13tactics >50%

Best coverage

Resource Development16%

Persistence14%

Weakest coverage

Collection3%

Credential Access4%

Test Your Model

The Cyber Defense Benchmark is available to credible organizations at no cost.
Bring your model — we run the benchmark.

research@simbian.ai

For model developers

Built for RLVR.

Cyber Defense provides verifiable rewards for reinforcement learning. Binary flag matching means every agent action produces a deterministic, auditable reward signal — no LLM judges, no subjective rubrics. The Holodeck Gymnasium environment generates infinite non-repeating episodes through context morphing, making it impossible to overfit to a fixed attack sequence.

Binary RewardsExact timestamp match against ground truth. Deterministic, reproducible, verifiable.

Gymnasium APIStandard Gymnasium-compatible environment. Drop-in integration with existing RL pipelines.

Infinite EpisodesSeeded context morphing generates unique environments on every run. No memorization possible.

Real TelemetryNot synthetic data. Real Sysmon + Security logs from actual attack execution on Windows endpoints.

Why Cyber Defense Benchmark

No existing benchmark uses real attack telemetry in scalable way in an agentic format. The Cyber Defense Benchmark does. Every event was captured from actual attack execution, e.g. using tools like Empire, Covenant, Mimikatz, Rubeus, etc., on instrumented Windows endpoints. These are real Sysmon and Security logs, not synthetic or hand-written examples. Agents investigate 134K+ events per environment through iterative SQL queries

Q&A benchmarks test knowledge. Cyber Defense tests whether an agent can autonomously find an attack through iterative hypothesis and query — no pre-written questions, no LLM judges. Powered by the Holodeck Gymnasium environment.

Typical Q&A Benchmark

“Given the following Sysmon logs, identify which MITRE ATT&CK technique was used by the attacker.”

→ T1059.001 — PowerShell Execution

Static question. Text answer. One technique.

Holodeck

“THREAT INTELLIGENCE BRIEFING INTEL SUMMARY: We have received credible intelligence that an attacker has conducted operations against our organization. YOUR MISSION: Investigate the log database...”

SELECT "TimeCreated", "Image", "CommandLine", "ParentImage", "ParentCommandLine", "User" FROM logs WHERE "EventID" = '1' AND ("CommandLine" LIKE '%powershell%' OR "CommandLine" LIKE '%cmd%' OR "CommandLine" LIKE '%encode%' OR "CommandLine" LIKE...

105 procedures. 50 queries. Find every malicious timestamp across the full kill chain.

Benchmark	Focus	Format	Scale	Real Telemetry	Agentic	Anti-Memorization	Cost Tracking
CTI-Bench	CTI knowledge	MCQ / short answer	Fixed	—	—	—	—
CyberSOCEval	SOC reasoning	Multi-answer MCQ	Fixed	—	—	—	—
AI SOC LLM Leaderboard (Simbian, prior work)	Alert investigation	End-to-end agentic	Fixed	✓	✓	—	—
ExCyTIn-Bench (Microsoft)	Threat investigation	Task-based Q&A	Fixed	✓	✓	—	—
SIR-Bench (AWS)	Incident response	Lexical findings	Scalable (Manual)	✓	✓	✓	—
Cyber Defense Benchmark (ours)	Threat hunting	Attack log hunting	Scalable (Code)	✓	✓	✓	✓

First Scalable Realistic Threat Hunting Benchmark

Realistic setup to search for attack evidence in logs across diverse attack chains, based on real attack telemetry. The best of both worlds: real attack telemetry and synthetic seamless conditional attack chaining.

Extensive MITRE ATT&CK Coverage

87/205 parent techniques covered, spanning 93 sub-techniques across 13 tactics — from Initial Access through Exfiltration and Impact.

Simulation + Emulation Hybrid

Holodeck Gymnasium mutates organization environment and attack infrastructure together with time sequencing, providing an RLVR gym.

Anti-Memorization

Seeded randomization ensures no two runs present identical data. Static benchmarks can be gamed by training on the test set; Cyber Defense cannot.

Deterministic Scoring

Flags are matched by exact timestamp against ground truth. No LLM judges, no subjective rubrics. Binary, reproducible, auditable.

Cost as a First-Class Metric

API cost is tracked per run and plotted on the Pareto frontier. Practitioners need to know what detection costs, not just who 'wins.'

Benchmark Methodology

How the Cyber Defense Benchmark constructs realistic attack simulations and evaluates AI threat hunting agents.

Attack Simulation

The Holodeck Gymnasium environment generates multi-chain attack campaigns from 106 canonical procedures sourced from the Security-Datasets repository. Each procedure replays genuine Windows telemetry (Sysmon, Security event logs) captured from real attack execution in instrumented lab environments — not synthetic, not hand-written, not approximated.

93 ATT&CK sub-techniques

Context Morphing

Each campaign run mutates the raw telemetry with a seeded random context - both environment and attacker infrastructure, e.g. hostnames, user identities, IP addresses, tool names, hashes, etc., as well as timestamps. This ensures no two benchmark runs present identical log data, preventing memorization and forcing genuine analytical reasoning.

Seeded reproducibility

Evaluation Protocol

An AI agent receives a threat intel briefing and a SQL-queryable log database. It must find evidence of the attack by issuing iterative SQL queries and submitting the exact timestamps of malicious events. Scoring is fully deterministic — no LLM judges, no subjective rubrics.

Deterministic flag matching

Cost Tracking

Total API cost (tokens consumed, model pricing) is tracked per run, enabling cost-efficiency analysis across the Pareto frontier of performance (coverage score) vs. cost (dollars-spent).

50-query budget

How a Holodeck environment is built: procedures from the library are assembled into a multi-stage kill chain, mutated with a unique seeded context (environment and attacker infrastructure, sequence and timing of attack), then materialized as the same SQL event log inputs used in real threat hunting — no hints provided.

MITRE ATT&CK Coverage

Technique coverage across MITRE ATT&CK tactics in the benchmark procedure library.

Tactic	Techniques	Covered	Coverage
Resource Development	8	2	25%
Initial Access	11	3	27%
Execution	17	8	47%
Persistence	23	14	61%
Privilege Escalation	14	12	86%
Defense Evasion	47	21	45%
Credential Access	17	9	53%
Discovery	34	16	47%
Lateral Movement	9	4	44%
Collection	17	6	35%
Command and Control	18	8	44%
Exfiltration	9	2	22%
Impact	15	2	13%
Unique total	205	87	42%

The “Unique total” row counts each parent technique once. A single technique can be tagged with multiple tactics in MITRE ATT&CK (e.g. T1078 Valid Accounts spans Initial Access, Persistence, Privilege Escalation, and Defense Evasion), so the per-tactic Techniquescolumn doesn’t sum to 205 — unique values do.

Limitations & Scope

Known boundaries of the current benchmark.

Windows-Only Telemetry

All procedures are sourced from Windows endpoints (Sysmon, Security event logs). Linux, macOS, cloud, and network telemetry are not yet covered.

Fixed Toolset

Attack procedures use a fixed set of offensive techniques. Novel or custom tooling is not represented.

Sample Size

886 total runs across 11 models. Per-model sample sizes vary. Results should be interpreted with this variance in mind.

SQL-Only Interface

Agents interact exclusively via SQL queries. Models with stronger code generation may have an inherent advantage unrelated to security reasoning ability.

Data & Credits

Cyber Defense builds on open source attack telemetry from the security research community.

Data

Sample Environment

One sample environment is available on GitHub. Full benchmark access is available to credible organizations on request at research@simbian.ai.

Sources

Open Source Telemetry

Attack procedures sourced from the Security-Datasets project (Open Threat Research Forge). We gratefully credit the community contributors who built the canonical attack telemetry.

Cite This Work

To cite the Cyber Defense Benchmark in a paper or model card, please cite (paper: https://arxiv.org/abs/2604.19533):

@misc{chona2026cyberdefensebenchmarkagentic,
  title         = {Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps},
  author        = {Alankrit Chona and Igor Kozlov and Ambuj Kumar},
  year          = {2026},
  eprint        = {2604.19533},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CR},
  url           = {https://arxiv.org/abs/2604.19533},
}

Tactic Drill-Down

Per-model coverage on each MITRE ATT&CK tactic. Ordered by kill-chain stage.

↓ Download CSV ↓ Technical Report (arXiv)

13 tactics · 105 procedures

Tactic	Opus 4.6	Sonnet 4.6	Opus 4.7	GPT 5	Gemini 3.1 Pro	Qwen 3.6	Gemini 3 Flash	Kimi 2.6	Minimax 2.7	Kimi 2.5	DeepSeek 3.2
01 / 13Initial Access	37%	24%	18%	2%	13%	11%	12%	8%	3%	5%	4%
02 / 13Execution	56%	45%	37%	22%	23%	15%	19%	17%	15%	11%	10%
03 / 13Persistence	56%	43%	35%	28%	25%	18%	20%	17%	19%	12%	14%
04 / 13Privilege Escalation	53%	42%	32%	27%	23%	13%	18%	17%	17%	11%	12%
05 / 13Defense Evasion	60%	49%	39%	25%	25%	16%	21%	19%	18%	13%	11%
06 / 13Credential Access	27%	20%	14%	7%	9%	6%	6%	10%	4%	5%	4%
07 / 13Discovery	54%	37%	36%	24%	20%	15%	16%	15%	14%	11%	10%
08 / 13Lateral Movement	31%	22%	18%	5%	10%	7%	7%	9%	3%	5%	5%
09 / 13Collection	25%	16%	4%	5%	9%	9%	6%	6%	4%	4%	3%
10 / 13Command And Control	56%	46%	41%	23%	23%	16%	20%	18%	16%	12%	11%
11 / 13Exfiltration	25%	18%	17%	2%	5%	3%	4%	2%	2%	2%	1%
12 / 13Impact	58%	50%	33%	38%	17%	19%	16%	18%	14%	9%	11%
13 / 13Resource Development	63%	50%	46%	23%	29%	22%	24%	21%	18%	15%	16%

Explore AI for Security Operations

Benchmark Results

Coverage Over Time

Cost Efficiency

Tactic Breadth

Model Report Cards

Test Your Model

Built for RLVR.

Why Cyber Defense Benchmark

First Scalable Realistic Threat Hunting Benchmark

Extensive MITRE ATT&CK Coverage

Simulation + Emulation Hybrid

Anti-Memorization

Deterministic Scoring

Cost as a First-Class Metric

Benchmark Methodology

Attack Simulation

Context Morphing

Evaluation Protocol

Cost Tracking

MITRE ATT&CK Coverage

Limitations & Scope

Windows-Only Telemetry

Fixed Toolset

Sample Size

SQL-Only Interface

Data & Credits

Sample Environment

Open Source Telemetry

Cite This Work

Tactic Drill-Down