In this section

Where Most SOCs Fail

2 hours · Module 0 · Free

What you already know

Section 2 mapped the four security operations functions and showed how the cycle breaks. This section examines the three specific failure patterns that cause that breakage, the habit SOC, the speed SOC, and the stale SOC. You'll recognize your own environment in at least one of them. That recognition is the starting point for the operational foundation this course builds.

Three patterns, three missing disciplines

Scenario

An organization has three SOC analysts, a managed SOC partner for after-hours coverage, Microsoft Sentinel with 23 analytics rules, and Defender XDR. The CISO reports 24/7 monitoring to the board. The SOC processes 80 alerts per day with a 6-minute mean time to triage. Then an AiTM credential phishing campaign runs undetected for 21 days. The board asks: how did this happen if we have 24/7 monitoring? The answer is that the SOC had monitoring, not operations. Three failure patterns, all present simultaneously, explain why.

SOCs don't fail because of bad technology or incompetent analysts. They fail because of missing operational disciplines, the documented processes, measurement frameworks, and improvement cadences that turn a team watching alerts into a function that gets better. The failure patterns are predictable, recognizable, and fixable. They're also nearly universal. If you've worked in a SOC for more than six months, you'll recognize at least one of them in your own environment.

Estimated time: 35 minutes.

Figure 0.3. The three SOC failure patterns. Each is caused by a missing operational discipline. Most SOCs exhibit all three simultaneously. Each pattern maps to specific sections in Module 1 that build the missing discipline.

Pattern 1: The habit SOC

The habit SOC runs on inherited knowledge. The team triages alerts because the analysts know how. Investigation happens because the L2 analyst has the skill. Detection rules exist because someone enabled templates during the Sentinel deployment. None of these depend on documented procedures, they depend on specific individuals being available and willing to share what they know.

How habits form

The habit SOC doesn't start deliberately. It forms naturally when a small team is told to "watch the SIEM" without anyone defining what that means operationally. The first analyst develops a triage routine, check SigninLogs for the user, look up the IP on VirusTotal, check for related alerts in the last 24 hours. The second analyst watches the first and develops a similar but slightly different routine. Over six months, the team has triage processes, investigation approaches, and escalation instincts, none of them documented, none of them consistent, all of them functional enough that nobody notices the inconsistency.

The habit SOC works until any of three things happen. First, the experienced analyst leaves. With a median L1 tenure of 18-24 months and 64% of analysts considering leaving within a year, this is not a risk, it's a certainty on a predictable timeline. When that analyst departs, their triage heuristics, their environmental knowledge ("the payroll batch runs Tuesday at 2 AM, ignore those alerts"), and their investigation methodology leave with them. The replacement has technical skills but no environmental context, and the ramp time to rebuild that context is 3-6 months.

Second, the team grows past the point where informal coordination works. Three analysts can coordinate informally. Six cannot. The new analyst asks "what's the process for escalation?" and the answer is "ask Tom, he usually handles those." Tom is on leave. The new analyst guesses. The guess is wrong.

Third, a complex incident hits when the right person isn't available. This is what happened at NE. The AiTM campaign executed during after-hours coverage when BlueVoyant was triaging. The managed SOC followed their standard playbook, check for MFA, close if present. MFA was present because the attacker captured the token through a proxy. The playbook had no path for AiTM. The institutional knowledge about what AiTM looks like lived in one internal analyst's head. That analyst wasn't on shift.

The cost of habits

The habit SOC's cost is invisible until the incident reveals it. Every day the SOC operates on undocumented processes, it accumulates key-person risk. Every investigation that completes without a documented methodology is knowledge that exists once, in one analyst's memory, and will need to be rebuilt when that analyst leaves. Every escalation decision that depends on instinct rather than documented criteria is a gamble that the instinct is correct.

The fix is documentation, not documentation for compliance auditors, but documentation that an L1 analyst uses on every shift. The triage decision framework. The escalation triggers. The shift handover checklist. The investigation template. Module 1 builds all of them. The habit SOC becomes a documented SOC not by adding bureaucracy, but by writing down what the experienced analysts already know so the knowledge survives their departure.

Pattern 2: The speed SOC

The speed SOC optimizes for throughput metrics. Mean time to triage (MTTT), SLA compliance, alerts closed per shift, the numbers that SIEM platforms produce automatically and that MSSP contracts are built around.

Why speed becomes the default

Speed metrics are easy to measure. Every SIEM platform can produce MTTT and alert counts from incident data without any manual classification. The automation makes it easy. The ease makes it the default. The default becomes the optimization target. The SOC tracks MTTT, reduces it from 8 minutes to 4 minutes, and reports the improvement to leadership. Leadership concludes the SOC is twice as effective. The SOC is twice as fast, which is not the same thing.

The problem is that speed metrics measure how fast the SOC processes its workload, not whether the workload contains the right alerts or whether the processing produces correct results. An analyst who closes 20 alerts per hour with a 3-minute MTTT might be closing real attacks along with the noise. The speed metric doesn't know the difference. A SOC that reduces its triage time from 8 minutes to 4 minutes while maintaining a 47% false positive rate has gotten faster at dismissing noise, which is not an improvement in security posture.

The quality metrics that speed hides

The metrics that actually measure SOC effectiveness are harder to produce because they require classification data. Mean time to detect (MTTD), how long attacks exist before the SOC detects them, requires knowing when the attack started, which you only discover through investigation. False positive rate, what proportion of alerts are noise, requires systematic disposition classification on every closed incident. Classification accuracy, whether triage decisions are correct, requires L2 review of L1 closures. External discovery rate, what proportion of compromises are found by users rather than the SOC, requires tracking who first reported each confirmed incident.

These quality metrics require deliberate investment. Someone has to classify every disposition. Someone has to review a sample of closures. Someone has to track external discoveries. The investment produces the data that makes SOC effectiveness measurable, and the data almost always reveals uncomfortable truths.

At NE before the AiTM incident, the dashboard showed 6-minute MTTT and 96% SLA compliance. Both healthy numbers. The quality metrics, measured for the first time after the incident, showed 14-day MTTD, 47% false positive rate, and 60% external discovery rate. The SOC was fast at processing alerts and slow at detecting attacks. Users were better at finding compromises than the SOC was. The dashboard looked green because it measured speed. The mission was failing because nobody measured quality.

The dashboard that measures activity instead of outcomes

MTTT, alerts closed, and SLA compliance. All green. Nobody asks: of the 2,400 alerts closed this month, how many were real attacks? Of the threats that existed in the environment, how many did the SOC detect? Of the incidents closed as 'benign,' how many were reclassified after a later investigation revealed they were the early stages of an attack? The dashboard measures the speed of processing a workload. It does not measure whether the SOC catches what it should catch.

Pattern 3: The stale SOC

The stale SOC operates against a static detection library. Rules deployed during the SIEM implementation sit unchanged while the threat landscape evolves. No rule is added after an investigation reveals a gap. No rule is tuned after a false positive pattern is identified. No rule is retired after the technique it detects becomes irrelevant.

How staleness accumulates

Detection staleness is invisible because it's measured in things that don't happen, alerts that don't fire, investigations that don't begin, compromises that aren't detected. The SOC processes the alerts it receives and has no visibility into the alerts that should exist but don't.

Consider the timeline. The organization deploys Sentinel in January with 23 template analytics rules. The rules cover brute force authentication, known malware hashes, impossible travel, and generic suspicious sign-in patterns. In March, a new AiTM phishing kit becomes prevalent, it captures MFA tokens through a reverse proxy, bypassing the traditional credential compromise detection that checks for failed MFA. In June, an attacker uses the kit against the organization. The authentication succeeds with a valid MFA claim. No rule queries for the specific indicators of AiTM, the token acquisition IP differing from the authentication IP, the MFA-by-claim assertion rather than interactive MFA, the session cookie replay from a non-corporate IP within minutes of the initial authentication.

The telemetry exists in SigninLogs. The data is collected. The ingestion cost is paid. But no rule examines it. The attack produces zero alerts. The detection library was deployed in January for the January threat landscape. It is now June, and the threat landscape has moved.

The compounding effect

Detection staleness compounds over time. Each quarter, new attack techniques enter the threat landscape. Each quarter, the organization's environment changes, new applications, new admin workflows, new service accounts. Each quarter, the gap between what the rules detect and what the threats actually do grows wider.

False positive rates compound alongside the gaps. The admin team starts using a new PowerShell automation tool in April. The "suspicious PowerShell execution" template rule fires on every invocation. By June, the rule produces 40 false positives per week. Nobody tunes it because there's no monthly tuning cadence. The analyst who triage the alerts learns to close them without investigation. "that's just the admin tool." Alert fatigue sets in. When a real attacker uses PowerShell for malicious purposes, the alert looks identical to the 40 weekly false positives. It gets the same 90-second closure.

The stale SOC's fix is the feedback loop, the operational mechanism that converts investigation findings into detection improvements, schedules monthly tuning reviews, and runs quarterly coverage assessments. Without the cadence, improvement happens only when an incident is bad enough to force a review, and by then the cost of the gap has already been paid in dwell time, data exfiltration, and incident response spend. The Vectra AI 2026 State of Threat Detection report found that 69% of organizations use more than 10 detection and response tools, and 39% use more than 20. The tools exist. The telemetry exists. What's missing is the operational discipline that evaluates whether the tools are detecting what matters. Module 1 builds the maturity assessment that identifies staleness as a gap. Modules 2-6 build the detection engineering capability that closes it. Modules 10-12 build the operational cadences that sustain it.

Your turn: find what's noisy

Before you can tune a noisy detection library, you need to measure which rules are noisy. The same principle applies one level up: which categories produce the most events at all? Write a KQL query that ranks signal volume by Country across the NE workspace.

NE before the incident: all three patterns

Northgate Engineering's SOC exhibited all three patterns simultaneously before INC-NE-2026-0227-001. Understanding how the patterns interacted explains why the AiTM campaign succeeded despite the SOC's apparent functionality.

The habits

Tom Ashworth and Priya Sharma had developed effective triage routines over 18 months, but neither routine was documented. Tom investigated credential compromise alerts by checking the sign-in IP against the corporate VPN range and looking for inbox rule creation within an hour. Priya checked the user's sign-in history for geographic anomalies and looked at the MFA method. Both approaches were reasonable. Neither covered AiTM, where MFA succeeds because the token is captured through a proxy. The institutional knowledge about AiTM detection lived in a blog post the SOC lead had read six months earlier, not in a documented playbook that the team or the managed SOC partner could reference.

The speed

The SOC reported 6-minute MTTT, 96% SLA compliance, and 2,400 alerts closed per month. BlueVoyant's after-hours coverage met every contractual SLA. The metrics went to Rachel Okafor monthly. Rachel reported them to the board. The board was satisfied.

The quality metrics, calculated for the first time after the incident, told a different story. MTTD was 14 days. False positive rate was 47%. External discovery rate was 60%. Classification accuracy could not be calculated because dispositions were binary (TP or FP) with no undetermined category and no L2 review of L1 closures. The SOC had been fast at processing alerts and slow at detecting threats for the entire 18 months it had operated. Nobody knew because nobody measured quality.

The staleness

Twenty-three analytics rules. Twelve templates from the Sentinel content hub, enabled during deployment and never modified. Eleven custom rules written by the SOC lead during the first three months, tested once, tuned never. No rule had been added, modified, or retired in the 12 months prior to the incident.

The ATT&CK coverage against the techniques relevant to a manufacturing company on the Microsoft stack was 10.3%. The AiTM technique (T1557 combined with T1539) was not covered by any rule. The OAuth consent grant persistence technique (T1098.003) was not covered. The inbox rule creation for evidence hiding (T1564.008) was covered by one rule that checked for external forwarding, but the attacker's rule moved emails to RSS Subscriptions rather than forwarding them externally, so the rule didn't fire.

The telemetry existed at every phase of the attack chain. Five tables. SigninLogs, OfficeActivity, CloudAppEvents, EmailEvents, MailItemsAccessed, recorded the attacker's actions. Twenty-three rules. Zero fires. Twenty-one days of undetected access.

What changed after

Rachel's response addressed all three patterns. The habit SOC became a documented SOC, operating model ADR, tier definitions with scope boundaries, escalation framework with three trigger types including the instinct trigger, triage decision framework with enrichment steps, shift handover checklist. The speed SOC became a measured SOC. MTTD, false positive rate, classification accuracy, and external discovery rate tracked alongside MTTT and SLA. The stale SOC became a tuned SOC, monthly false positive review, quarterly ATT&CK coverage assessment, detection backlog fed by investigation findings and hunt reports.

The result: MTTD dropped from 14 days to 4.2 hours over six months. False positive rate dropped from 47% to 18%. External discovery rate dropped from 60% to 15%. The same three analysts. The same tools. The same budget. Different operational discipline.

This course builds every operational document and measurement framework that NE built after the incident. The difference is that you build it before the incident, so the 21-day gap never happens.

SOC Operations Principle

SOCs fail in three predictable patterns: the habit SOC (undocumented processes, key-person dependency), the speed SOC (optimized for throughput, blind to quality), and the stale SOC (no feedback loop, detection degrades). Each pattern is caused by a missing operational discipline, not missing technology. The disciplines are documentation, quality measurement, and scheduled improvement. All three are buildable with existing tools and headcount, the investment is process change, not budget.

Section 0.4. The SOC Maturity Spectrum. The three failure patterns describe what goes wrong. The maturity spectrum describes the path forward, five levels from ad-hoc habits to measured, improving operations. The next section defines each level, shows you how to assess where your SOC sits, and maps the improvement path this course builds.

Unlock the Full Course See Full Course Agenda

Get weekly detection and investigation techniques

KQL queries, detection rules, and investigation methods — the same depth as this course, delivered every Tuesday.

No spam. Unsubscribe anytime. ~2,000 security practitioners.

← Previous Next →