LLMs Can Find & Exploit Vulnerabilities. Out of the Box, They Fail at Defense.

Every existing cyber benchmark asks models to answer questions about attacks. Cyber Defense Benchmark is the first to use real attack telemetry in a scalable way in an agentic investigation format — giving AI agents raw logs and asking them to find the intrusion. 105 procedures, 11 frontier LLMs, zero passing scores. RLVR-ready.

Model Per-Tactic Coverage
Coverage Score:

Every multi-stage attack in the benchmark is a sequence of attacker steps. Each step is a procedure that maps onto a MITRE ATT&CK tactic — the high-level phases of an intrusion (Initial Access, Discovery, Lateral Movement, Exfiltration, etc.).

The Coverage Score represents what fraction of the attack the agent surfaces, scaled so a perfect run scores 1.0. The radar above shows that score broken out per tactic; the outer ring is the benchmark maximum.

Model Leaderboard
#ModelCoverage ScoreFlag %Cost mean
1Opus 4.6
0.464.49$17.98
2Sonnet 4.6
0.353.43$12.99
3Opus 4.7
0.282.97$8.72
4Gemini 3.1 Pro
0.182.02$1.85
5GPT 5
0.172.24$1.07
6Gemini 3 Flash
0.151.44$0.19
7Kimi 2.6
0.131.15$0.52
8Qwen 3.6
0.131.58$0.54
9Minimax 2.7
0.110.97$0.11
10DeepSeek 3.2
0.090.82$0.91
11Kimi 2.5
0.090.87$1.44
Nobody passed. Passing means >50% recall on every MITRE ATT&CK tactic — the minimum bar for unsupervised SOC deployment. 0 of 11 models clear it. Opus 4.6 leads with 8 of 13 tactics above the bar; of the other 10 models only Sonnet 4.6 clears 1 tactic, the other 9 fail every tactic category.

Explore AI for Security Operations

Understand how Simbian brings together SOC, threat hunting, and response in one platform.

Benchmark Results

11 frontier models evaluated across 886 runs over 26 multi-stage campaigns spanning 105 procedures from the canonical Security-Datasets corpus. Each hunt has a 50-query budget against 75,000–135,000 log records. Total LLM API cost across all runs: $1,682.
↓ Download CSV ↓ Technical Report (arXiv)

Coverage Over Time

The chart below tracks each model’s Coverage Score (explained above) as turns are spent. Opus 4.6 separates from the field early and finishes at 0.46, meaning more than half of the coverable narrative steps are still missed per run on average. Several lower-ranked models stop progressing well before exhausting their query budget - not because they run out of hypotheses, but because they assume their work is done. For that reason they cannot be trusted to run unsupervised.

Coverage Score over turns, per model. No model approaches full procedure coverage in any campaign.

Cost Efficiency

Detection ability and cost of investigation are not linearly related. Gemini 3 Flash costs $0.19/run and finds 1.4% of flags. Opus 4.6 costs $17.98/run — roughly 95× more — and finds 4.5% of flags, about 3.1× more. That is the Pareto frontier in one sentence: the curve bends sharply, not smoothly. The cost-quality trade-off is a cliff, not a slope.

GPT 5 and Gemini 3.1 Pro sit in the mid-price range at $1.07 and $1.85 per run. Both plateau around 2.1% of flags and stop climbing. Throwing more budget at them does not help — the limit is not token cost, it is that the agent believes it has completed its task and has no work left to do.

Cost (USD per run, log scale) vs. Coverage Score (mean ± std across runs). Dashed line = Pareto frontier.

Tactic Breadth

Finding more flags is only meaningful if the detections span the full attack lifecycle. Against the >50%-per-tactic passing bar, Opus 4.6  clears 8 of 13 tactics; of the other 10 models only Sonnet 4.6 clears 1 tactic, and the remaining 9 fail every tactic category. Partial visibility means the attacker’s that some of stages go unnoticed.

Per-model coverage broken out by MITRE ATT&CK tactic. Gaps mean the tactic was never visited; the outer ring is the benchmark maximum.

Model Report Cards

Tactic-level breakdown for each model — where they excel and where they go blind. Models with blind spots (0% detection on a tactic) miss entire phases of the kill chain.

#1Opus 4.6
0.46Coverage Score
$17.98cost
per run
8/13tactics >50%
Best coverage
Resource Development63%
Defense Evasion60%
Weakest coverage
Collection25%
Exfiltration25%
#2Sonnet 4.6
0.35Coverage Score
$12.99cost
per run
1/13tactics >50%
Best coverage
Impact50%
Resource Development50%
Weakest coverage
Collection16%
Exfiltration18%
#3Opus 4.7
0.28Coverage Score
$8.72cost
per run
0/13tactics >50%
Best coverage
Resource Development46%
Command and Control41%
Weakest coverage
Collection4%
Credential Access14%
#4GPT 5
0.17Coverage Score
$1.07cost
per run
0/13tactics >50%
Best coverage
Impact38%
Persistence28%
Weakest coverage
Exfiltration2%
Initial Access2%
#5Gemini 3.1 Pro
0.18Coverage Score
$1.85cost
per run
0/13tactics >50%
Best coverage
Resource Development29%
Defense Evasion25%
Weakest coverage
Exfiltration5%
Collection9%
#6Qwen 3.6
0.13Coverage Score
$0.54cost
per run
0/13tactics >50%
Best coverage
Resource Development22%
Impact19%
Weakest coverage
Exfiltration3%
Credential Access6%
#7Gemini 3 Flash
0.15Coverage Score
$0.19cost
per run
0/13tactics >50%
Best coverage
Resource Development24%
Defense Evasion21%
Weakest coverage
Exfiltration4%
Collection6%
#8Kimi 2.6
0.13Coverage Score
$0.52cost
per run
0/13tactics >50%
Best coverage
Resource Development21%
Defense Evasion19%
Weakest coverage
Exfiltration2%
Collection6%
#9Minimax 2.7
0.11Coverage Score
$0.11cost
per run
0/13tactics >50%
Best coverage
Persistence19%
Defense Evasion18%
Weakest coverage
Exfiltration2%
Initial Access3%
#10Kimi 2.5
0.09Coverage Score
$1.44cost
per run
0/13tactics >50%
Best coverage
Resource Development15%
Defense Evasion13%
Weakest coverage
Exfiltration2%
Collection4%
#11DeepSeek 3.2
0.09Coverage Score
$0.91cost
per run
0/13tactics >50%
Best coverage
Resource Development16%
Persistence14%
Weakest coverage
Collection3%
Credential Access4%
Blind spots (0%): Exfiltration

Test Your Model

The Cyber Defense Benchmark is available to credible organizations at no cost.
Bring your model — we run the benchmark.

research@simbian.ai
For model developers

Built for RLVR.

Cyber Defense provides verifiable rewards for reinforcement learning. Binary flag matching means every agent action produces a deterministic, auditable reward signal — no LLM judges, no subjective rubrics. The Holodeck Gymnasium environment generates infinite non-repeating episodes through context morphing, making it impossible to overfit to a fixed attack sequence.

Binary RewardsExact timestamp match against ground truth. Deterministic, reproducible, verifiable.
Gymnasium APIStandard Gymnasium-compatible environment. Drop-in integration with existing RL pipelines.
Infinite EpisodesSeeded context morphing generates unique environments on every run. No memorization possible.
Real TelemetryNot synthetic data. Real Sysmon + Security logs from actual attack execution on Windows endpoints.

Why Cyber Defense Benchmark

No existing benchmark uses real attack telemetry in scalable way in an agentic format. The Cyber Defense Benchmark does. Every event was captured from actual attack execution, e.g. using tools like Empire, Covenant, Mimikatz, Rubeus, etc., on instrumented Windows endpoints. These are real Sysmon and Security logs, not synthetic or hand-written examples. Agents investigate 134K+ events per environment through iterative SQL queries

Q&A BENCHMARK
HOLODECK
Pre-written Question
Model
LLM Judge → Score
Threat Briefing + Raw Logs
Agent
134K Events · DuckDB
Submit Timestamps
Deterministic Scorer → Score
SQL queries ↕ results↻ up to 50 iterations1 question · LLM judge · subjective50 queries · iterative · binary scoring
Q&A benchmarks test knowledge. Cyber Defense tests whether an agent can autonomously find an attack through iterative hypothesis and query — no pre-written questions, no LLM judges. Powered by the Holodeck Gymnasium environment.
Typical Q&A Benchmark
“Given the following Sysmon logs, identify which MITRE ATT&CK technique was used by the attacker.”
→ T1059.001 — PowerShell Execution
Static question. Text answer. One technique.
Holodeck
“THREAT INTELLIGENCE BRIEFING INTEL SUMMARY: We have received credible intelligence that an attacker has conducted operations against our organization. YOUR MISSION: Investigate the log database...”
SELECT "TimeCreated", "Image", "CommandLine", "ParentImage", "ParentCommandLine", "User" FROM logs WHERE "EventID" = '1' AND ("CommandLine" LIKE '%powershell%' OR "CommandLine" LIKE '%cmd%' OR "CommandLine" LIKE '%encode%' OR "CommandLine" LIKE...
105 procedures. 50 queries. Find every malicious timestamp across the full kill chain.
BenchmarkFocusFormatScaleReal TelemetryAgenticAnti-MemorizationCost Tracking
CTI-BenchCTI knowledgeMCQ / short answerFixed
CyberSOCEvalSOC reasoningMulti-answer MCQFixed
AI SOC LLM Leaderboard
(Simbian, prior work)
Alert investigationEnd-to-end agenticFixed
ExCyTIn-Bench (Microsoft)Threat investigationTask-based Q&AFixed
SIR-Bench (AWS)Incident responseLexical findingsScalable (Manual)
Cyber Defense Benchmark (ours)Threat huntingAttack log huntingScalable (Code)

First Scalable Realistic Threat Hunting Benchmark

Realistic setup to search for attack evidence in logs across diverse attack chains, based on real attack telemetry. The best of both worlds: real attack telemetry and synthetic seamless conditional attack chaining.

Extensive MITRE ATT&CK Coverage

87/205 parent techniques covered, spanning 93 sub-techniques across 13 tactics — from Initial Access through Exfiltration and Impact.

Simulation + Emulation Hybrid

Holodeck Gymnasium mutates organization environment and attack infrastructure together with time sequencing, providing an RLVR gym.

Anti-Memorization

Seeded randomization ensures no two runs present identical data. Static benchmarks can be gamed by training on the test set; Cyber Defense cannot.

Deterministic Scoring

Flags are matched by exact timestamp against ground truth. No LLM judges, no subjective rubrics. Binary, reproducible, auditable.

Cost as a First-Class Metric

API cost is tracked per run and plotted on the Pareto frontier. Practitioners need to know what detection costs, not just who 'wins.'

Benchmark Methodology

How the Cyber Defense Benchmark constructs realistic attack simulations and evaluates AI threat hunting agents.

Attack Simulation

The Holodeck Gymnasium environment generates multi-chain attack campaigns from 106 canonical procedures sourced from the Security-Datasets repository. Each procedure replays genuine Windows telemetry (Sysmon, Security event logs) captured from real attack execution in instrumented lab environments — not synthetic, not hand-written, not approximated.

93 ATT&CK sub-techniques

Context Morphing

Each campaign run mutates the raw telemetry with a seeded random context - both environment and attacker infrastructure, e.g. hostnames, user identities, IP addresses, tool names, hashes, etc., as well as timestamps. This ensures no two benchmark runs present identical log data, preventing memorization and forcing genuine analytical reasoning.

Seeded reproducibility

Evaluation Protocol

An AI agent receives a threat intel briefing and a SQL-queryable log database. It must find evidence of the attack by issuing iterative SQL queries and submitting the exact timestamps of malicious events. Scoring is fully deterministic — no LLM judges, no subjective rubrics.

Deterministic flag matching

Cost Tracking

Total API cost (tokens consumed, model pricing) is tracked per run, enabling cost-efficiency analysis across the Pareto frontier of performance (coverage score) vs. cost (dollars-spent).

50-query budget
01
02
03
04
Procedure Library
105 procedures · real tool telemetry
Campaign Assembly
Multi-stage kill chain · shared infrastructure
World-Builder
Seed: environment and attacker infrastructure, sequence and timing of attack
SQL Database
Same log data inputs as in real threat hunting, just no hints
Every run is unique. Every run is reproducible.
How a Holodeck environment is built: procedures from the library are assembled into a multi-stage kill chain, mutated with a unique seeded context (environment and attacker infrastructure, sequence and timing of attack), then materialized as the same SQL event log inputs used in real threat hunting — no hints provided.

MITRE ATT&CK Coverage

Technique coverage across MITRE ATT&CK tactics in the benchmark procedure library.

TacticTechniquesCoveredCoverage
Resource Development
8225%
Initial Access
11327%
Execution
17847%
Persistence
231461%
Privilege Escalation
141286%
Defense Evasion
472145%
Credential Access
17953%
Discovery
341647%
Lateral Movement
9444%
Collection
17635%
Command and Control
18844%
Exfiltration
9222%
Impact
15213%
Unique total2058742%

The “Unique total” row counts each parent technique once. A single technique can be tagged with multiple tactics in MITRE ATT&CK (e.g. T1078 Valid Accounts spans Initial Access, Persistence, Privilege Escalation, and Defense Evasion), so the per-tactic Techniquescolumn doesn’t sum to 205 — unique values do.

Limitations & Scope

Known boundaries of the current benchmark.

01

Windows-Only Telemetry

All procedures are sourced from Windows endpoints (Sysmon, Security event logs). Linux, macOS, cloud, and network telemetry are not yet covered.

02

Fixed Toolset

Attack procedures use a fixed set of offensive techniques. Novel or custom tooling is not represented.

03

Sample Size

886 total runs across 11 models. Per-model sample sizes vary. Results should be interpreted with this variance in mind.

04

SQL-Only Interface

Agents interact exclusively via SQL queries. Models with stronger code generation may have an inherent advantage unrelated to security reasoning ability.

Data & Credits

Cyber Defense builds on open source attack telemetry from the security research community.

Data

Sample Environment

One sample environment is available on GitHub. Full benchmark access is available to credible organizations on request at research@simbian.ai.

Sources

Open Source Telemetry

Attack procedures sourced from the Security-Datasets project (Open Threat Research Forge). We gratefully credit the community contributors who built the canonical attack telemetry.

Cite This Work

To cite the Cyber Defense Benchmark in a paper or model card, please cite (paper: https://arxiv.org/abs/2604.19533):

@misc{chona2026cyberdefensebenchmarkagentic,
  title         = {Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps},
  author        = {Alankrit Chona and Igor Kozlov and Ambuj Kumar},
  year          = {2026},
  eprint        = {2604.19533},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CR},
  url           = {https://arxiv.org/abs/2604.19533},
}

Tactic Drill-Down

Per-model coverage on each MITRE ATT&CK tactic. Ordered by kill-chain stage.

↓ Download CSV↓ Technical Report (arXiv)
13 tactics · 105 procedures
TacticOpus 4.6Sonnet 4.6Opus 4.7GPT 5Gemini 3.1 ProQwen 3.6Gemini 3 FlashKimi 2.6Minimax 2.7Kimi 2.5DeepSeek 3.2
01 / 13Initial Access
37%
24%
18%
2%
13%
11%
12%
8%
3%
5%
4%
02 / 13Execution
56%
45%
37%
22%
23%
15%
19%
17%
15%
11%
10%
03 / 13Persistence
56%
43%
35%
28%
25%
18%
20%
17%
19%
12%
14%
04 / 13Privilege Escalation
53%
42%
32%
27%
23%
13%
18%
17%
17%
11%
12%
05 / 13Defense Evasion
60%
49%
39%
25%
25%
16%
21%
19%
18%
13%
11%
06 / 13Credential Access
27%
20%
14%
7%
9%
6%
6%
10%
4%
5%
4%
07 / 13Discovery
54%
37%
36%
24%
20%
15%
16%
15%
14%
11%
10%
08 / 13Lateral Movement
31%
22%
18%
5%
10%
7%
7%
9%
3%
5%
5%
09 / 13Collection
25%
16%
4%
5%
9%
9%
6%
6%
4%
4%
3%
10 / 13Command And Control
56%
46%
41%
23%
23%
16%
20%
18%
16%
12%
11%
11 / 13Exfiltration
25%
18%
17%
2%
5%
3%
4%
2%
2%
2%
1%
12 / 13Impact
58%
50%
33%
38%
17%
19%
16%
18%
14%
9%
11%
13 / 13Resource Development
63%
50%
46%
23%
29%
22%
24%
21%
18%
15%
16%