TS
Back to Projects

SOC-Bench: Task GOAT

File-system forensics benchmark for evaluating autonomous SOC systems on Colonial Pipeline-style ransomware incidents

PythonForensicsWindowsNTFSEDR/XDRSecurity Research

The Problem

Security Operations Centers lack standardized benchmarks to evaluate autonomous AI systems on real-world ransomware forensics. Existing evaluations focus on detection, not the full forensic workflow SOC analysts perform.

The Approach

Designed a comprehensive benchmark evaluating five key outcomes:

• O1: Encryption-state labels at file and directory levels • O2: Host/share impact aggregations (encrypted bytes, fractions, first-seen timestamps) • O3: VSS tamper detection (snapshot delete/disable events with timing) • O4: Attribution of primary encryptor process trees from EDR telemetry • O5: One-page executive summary referencing O1-O4 claims

Data sources include file-system metadata/change journals, EDR process trees, VSS logs, SIEM alerts, and help-desk reports. Ring-based scoring (Exact/Directory/Host-Share/Miss) with penalties for wrong assertions, missing evidence, contradictions, and spam.

The Impact

Targeting arXiv publication. Benchmark follows SOC-first, outcome-only, and durability principles. Designed to remain valid for years using stable OS/forensic constructs. Part of the SOC-Bench suite (GOAT, PANDA, FOX, TIGER, MOUSE) for comprehensive SOC evaluation.

Build Notes

Key design principles:

  1. SOC-first ordering: Reflects what SOC observes, not attacker sequence
  2. Outcome-only: Judged by claims against ground truth, no methods mandated
  3. Intentional incompleteness: Some signals withheld to prevent shortcutting
  4. Durability: Relies on stable OS/forensic constructs

Scoring: 40 pts (O1) + 25 pts (O2) + 15 pts (O3) + 10 pts (O4) + 10 pts (O5) = 100 pts total

Key Tradeoffs

  • ⚖️Colonial Pipeline focus limits generalization to other ransomware families
  • ⚖️Windows/NTFS only - no Linux or macOS coverage
  • ⚖️Read-only analysis - no active response evaluation
  • ⚖️Ground truth requires manual curation of reference file pairs

What I'd Improve Next

  • →Expand to other ransomware families beyond DarkSide
  • →Add cross-platform support (Linux, macOS)
  • →Include active response evaluation tasks
  • →Automate ground truth generation from malware samples