SOC-Bench: Task GOAT

The Problem

Security Operations Centers lack standardized benchmarks to evaluate autonomous AI systems on real-world ransomware forensics. Existing evaluations focus on detection, not the full forensic workflow SOC analysts perform.

The Approach

Designed a comprehensive benchmark evaluating five key outcomes:

• O1: Encryption-state labels at file and directory levels • O2: Host/share impact aggregations (encrypted bytes, fractions, first-seen timestamps) • O3: VSS tamper detection (snapshot delete/disable events with timing) • O4: Attribution of primary encryptor process trees from EDR telemetry • O5: One-page executive summary referencing O1-O4 claims

Data sources include file-system metadata/change journals, EDR process trees, VSS logs, SIEM alerts, and help-desk reports. Ring-based scoring (Exact/Directory/Host-Share/Miss) with penalties for wrong assertions, missing evidence, contradictions, and spam.

The Impact

Targeting arXiv publication. Benchmark follows SOC-first, outcome-only, and durability principles. Designed to remain valid for years using stable OS/forensic constructs. Part of the SOC-Bench suite (GOAT, PANDA, FOX, TIGER, MOUSE) for comprehensive SOC evaluation.

Build Notes

Key design principles:

SOC-first ordering: Reflects what SOC observes, not attacker sequence
Outcome-only: Judged by claims against ground truth, no methods mandated
Intentional incompleteness: Some signals withheld to prevent shortcutting
Durability: Relies on stable OS/forensic constructs

Scoring: 40 pts (O1) + 25 pts (O2) + 15 pts (O3) + 10 pts (O4) + 10 pts (O5) = 100 pts total

Key Tradeoffs

⚖️Colonial Pipeline focus limits generalization to other ransomware families
⚖️Windows/NTFS only - no Linux or macOS coverage
⚖️Read-only analysis - no active response evaluation
⚖️Ground truth requires manual curation of reference file pairs

What I'd Improve Next

→Expand to other ransomware families beyond DarkSide
→Add cross-platform support (Linux, macOS)
→Include active response evaluation tasks
→Automate ground truth generation from malware samples