In this section

7.1 Playbook Architecture. Structure, Standards, and Design Principles

10-14 hours · Module 7

What you already know

You've built 28 detection rules across Modules 3-6 that generate alerts in the Sentinel incident queue. You've configured entity mapping so every alert carries the user, IP, and device context an analyst needs. What you haven't built yet is the structured workflow that tells an analyst exactly what to do when one of those alerts fires. This section defines the architecture that every playbook in this module follows.

Scenario

DET-SOC-008 fires at 02:47 AM, inbox forwarding to an external domain created six minutes after an anomalous sign-in. The managed SOC partner opens the alert, checks whether the forwarding address matches a known vendor, sees that it doesn't, and escalates to the internal team. The internal analyst arrives at 08:15 and spends 40 minutes re-investigating from scratch because the managed SOC's escalation note says "suspicious inbox rule, please review." No triage checklist was followed. No containment action was taken during the five-hour gap. The attacker used that gap to extend delegation to the CISO's mailbox.

What a playbook actually is

A playbook is not a document that describes how an investigation should go. It is a set of instructions that tells an analyst exactly what to do at each step, what decision to make based on what they find, and what to do next based on that decision. The difference is operational. A description is read once and interpreted differently by every analyst who reads it. Instructions are followed step by step during a live incident, producing the same investigation quality regardless of who follows them.

The scenario above is what happens without a playbook. The managed SOC partner had no structured workflow for the alert type. The escalation carried no triage findings because there was no triage checklist. The internal analyst started from zero because there was no investigation handoff format. Five hours of attacker dwell time resulted from a missing document.

Every production playbook in this module follows the same seven-section architecture. This consistency means an analyst who has used one playbook can pick up any other and immediately understand the structure, the decision points, and the escalation paths. During a live incident, when adrenaline is high and time is short: a familiar structure is an operational advantage. You don't spend cognitive effort figuring out where to look. You spend it on the investigation itself.

Figure 7.1. The seven-section playbook architecture used across all investigation playbooks in this module.

The seven sections of a production playbook

Section 1. Trigger conditions

What activates this playbook. Not a vague description ("when a phishing incident occurs") but a specific list of detection rule IDs, alert names, or operational conditions that initiate the workflow.

A well-written trigger for the AiTM playbook reads: "This playbook is activated when any of the following detection rules fire: DET-SOC-001 (anomalous sign-in), DET-SOC-002 (MFA fatigue), DET-SOC-008 (inbox forwarding to external domain), or DET-SOC-012 (AiTM phishing indicators). It is also activated when a user reports receiving unexpected MFA prompts or discovering unfamiliar inbox rules."

Trigger conditions create an unambiguous link between detection and response. When DET-SOC-008 fires, the analyst does not decide which playbook applies: the trigger conditions in the AiTM playbook explicitly list DET-SOC-008. When multiple playbooks share a trigger (DET-SOC-001 appears in both the AiTM and the BEC playbook triggers), the analyst uses the correlated alerts to select the right one. A DET-SOC-001 alert accompanied by DET-SOC-008 (inbox forwarding) activates the AiTM playbook. A DET-SOC-001 alert accompanied by DET-SOC-014 (suspicious outbound email) activates the BEC playbook.

Section 2. Severity classification and SLA

The initial severity determines the response timeline. Playbooks define severity based on observable conditions at the time of activation, not on the eventual impact, which is unknown at activation time.

Critical severity applies when multiple correlated detections fire for the same user within 30 minutes, when a detection involves an executive or privileged account, or when active data exfiltration is confirmed. The SLA is triage within 5 minutes and containment within 15 minutes. High severity applies when a single high-confidence detection fires. MFA fatigue with IP mismatch, forwarding to a newly registered domain, transport rule manipulation. Triage within 15 minutes, containment within 1 hour. Medium severity applies when a single detection has ambiguous indicators, anomalous sign-in from a residential IP, forwarding to a known email provider. Triage within 30 minutes, investigation within 4 hours.

SLAs measure from alert generation to analyst action, not from alert generation to incident closure. An analyst who triages within 5 minutes and determines that investigation will take 2 hours has met the Critical triage SLA. The 2-hour investigation is expected: the SLA governs how fast you start, not how fast you finish.

Anti-Pattern

SLAs measured against incident closure time instead of triage and containment time. The result: analysts rush to close incidents quickly rather than investigate thoroughly. A 15-minute closure SLA incentivizes closing alerts as benign without checking. A 15-minute containment SLA incentivizes containing the threat immediately and then investigating thoroughly, which is the correct operational behavior.

Section 3. Triage phase

The first five minutes. Triage answers one question: is this a real incident that requires investigation, or can it be classified and closed?

Triage consists of 3-5 checks that can be performed in under 5 minutes using the alert data and quick queries. Each check produces a binary outcome (yes/no) that feeds a decision: investigate further, contain immediately, or close with classification. The checks are ordered by discriminating power: the check most likely to distinguish real incidents from false positives comes first.

Here's what a triage check looks like in practice. When DET-SOC-008 fires (inbox forwarding to external domain), the first triage question is: was the forwarding destination domain registered in the past 30 days? This single check is highly discriminating: a forwarding rule pointing to a domain registered yesterday is almost certainly malicious. You check domain registration age using WHOIS data. If the answer is yes (newly registered), you escalate to Critical immediately and skip the remaining triage checks. If the answer is no, you move to the second check: was the forwarding rule created from the same IP as a recent anomalous sign-in? You run a quick correlation:

// Triage check: correlate inbox rule creation with anomalous sign-in
let alertUser = "harrison@northgateeng.com";
let ruleCreationTime = datetime(2026-03-15T02:47:00Z);
SigninLogs
| where TimeGenerated between ((ruleCreationTime - 1h) .. ruleCreationTime)
| where UserPrincipalName == alertUser
| where RiskLevelDuringSignIn in ("medium", "high")
| project TimeGenerated, IPAddress, Location, RiskLevelDuringSignIn, AppDisplayName

If this query returns results: a risky sign-in from an unfamiliar IP within the hour before the forwarding rule was created, you have correlated evidence. Escalate to High and proceed to the investigation phase. If it returns no results, the sign-in activity was normal, and the forwarding rule may be legitimate. Move to the third check: contact the user to confirm whether they created the rule.

The critical insight about triage is ordering. The most discriminating check comes first. If domain registration age resolves the alert in 30 seconds, you've saved the other four checks. If you start by contacting the user (who may be asleep at 02:47 AM), you've wasted 30 minutes waiting for a response before running a 30-second WHOIS query that would have answered the question immediately.

Section 4. Investigation phase

The detailed investigation performed when triage indicates a potential real incident. The investigation phase is structured as a decision tree: each step produces findings that determine the next step.

Investigation steps follow a consistent format. Each step specifies an action (a KQL query, a PowerShell command, a portal check), the expected output if the hypothesis is true, and a binary decision that routes to the next step. There is no "investigate further" as a next step: every branch leads to a specific action.

Investigation Step Format

STEP 3: Check for post-authentication persistence

QUERY:

OfficeActivity

| where TimeGenerated > ago(24h)

| where UserId == "harrison@northgateeng.com"

| where Operation in ("New-InboxRule", "Set-InboxRule",

"Add-MailboxPermission", "Set-Mailbox")

| project TimeGenerated, Operation, Parameters, ClientIP

EXPECTED OUTPUT (if compromise confirmed):

- New-InboxRule with ForwardTo or RedirectTo to external domain

- Set-Mailbox with ForwardingSmtpAddress set

- Add-MailboxPermission granting FullAccess to another user

DECISION:

→ Persistence found: proceed to Step 4 (scope assessment)

→ No persistence: proceed to Step 5 (session analysis)

This format eliminates ambiguity. The analyst knows what to run, what to look for, and what to do based on what they find. Judgment is applied within steps, interpreting whether a specific inbox rule parameter is malicious or benign. Judgment is not applied between steps: the decision tree defines the routing.

Section 5. Containment phase

Actions taken to stop the adversary's access and prevent further damage. Every containment action in the playbook is documented with four attributes: the action itself, its blast radius, the rollback procedure, and the approval level required.

Token revocation is the first containment action in most identity compromise playbooks. It invalidates all refresh tokens for the compromised user, forcing re-authentication on every device and application. The blast radius is low: the user is inconvenienced for the time it takes to sign in again. No rollback is needed because the user re-authenticates normally. A T2 analyst or SOC manager can authorize it.

# Containment Step 1: Revoke all refresh tokens for compromised user
# Blast radius: Low — user must re-authenticate on all devices
# Rollback: Not required — user signs in normally after re-authentication
# Approval: T2 analyst or SOC manager
Connect-MgGraph -Scopes "User.RevokeSessions.All"
Revoke-MgUserSignInSession -UserId "harrison@northgateeng.com"
# Verify revocation succeeded (cmdlet returns True on success)
# Note: Existing access tokens remain valid until expiry (up to 1 hour)
# Continuous Access Evaluation (CAE) reduces this to near-real-time
# for CAE-aware applications (Outlook, Teams, SharePoint Online)

Account disablement is the escalation from token revocation. It blocks all access to M365 services immediately: the user cannot work at all until the account is re-enabled. The blast radius is high. Rollback requires re-enabling the account through the Entra admin center or PowerShell (Update-MgUser -UserId $userId -AccountEnabled:$true). SOC manager approval is required because a mistake here, disabling the wrong account, or disabling an executive's account without justification, creates a business disruption that may escalate to the CISO faster than the security incident itself.

The blast radius documentation is what separates a production playbook from a wiki page. An analyst who knows that token revocation is low-blast and account disablement is high-blast makes better containment decisions under pressure. They revoke tokens first (fast, low-risk, reversible by re-authentication) and escalate to account disablement only when the evidence justifies the disruption.

Section 6. Communication phase

Who needs to be notified, when, and with what information. Communication requirements vary by severity and by the identities involved. The SOC manager is notified for all High and Critical incidents at triage completion with the initial assessment, containment actions taken, and estimated investigation timeline. The CISO is notified for all Critical incidents and any incident involving executives or confirmed data breach. Legal and compliance are notified when the incident may involve regulatory notification obligations. Affected users are notified after containment when they need to take action, password reset, MFA re-registration.

The communication matrix is not optional. An analyst who contains a credential compromise at 03:00 AM but doesn't notify the SOC manager until 08:00 AM has created a five-hour reporting gap. If the compromise expanded during those five hours, the SOC manager had no opportunity to authorize additional containment. The playbook specifies the notification timing so the analyst doesn't need to decide when to escalate: the playbook tells them.

Communication format matters as much as timing. A notification that says "suspicious activity on Harrison's account" is useless to a SOC manager at 03:00 AM. A notification that says "DET-SOC-008 fired at 02:47, inbox forwarding to newly registered proton.me address, created 6 minutes after anomalous sign-in from hosting provider IP 185.220.101.42. Tokens revoked at 02:52. Investigation in progress, preliminary scope: 1 account, no lateral movement confirmed yet", that gives the SOC manager enough context to decide whether to escalate further or let the analyst continue. The playbook templates in Sections 7.4-7.6 include notification message formats for each severity level so the analyst doesn't compose these under pressure.

Section 7. Post-incident phase

Actions after the immediate incident is resolved. This phase is where most playbooks are weakest: the adrenaline is gone, the threat is contained, and the team moves on to the next alert. But the post-incident phase is what converts a one-time investigation into organizational improvement.

Evidence preservation means ensuring that log data, screenshots, query results, and containment action records are stored in a location that survives log retention policies. Sentinel's default retention is 90 days for the analytics tier. If legal proceedings begin six months after the incident, the evidence must exist somewhere. The playbook specifies what to export, in what format, and where to store it, before the analyst moves on.

Post-incident review scheduling means booking the PIR within five business days while the investigation is fresh. Every PIR answers four questions: what happened (the technical timeline), why it happened (the root cause, not "the user clicked a phishing link" but "the user clicked a phishing link because our email filtering didn't catch an azurestaticapps.net URL, and our security awareness training hasn't covered AiTM phishing"), what worked in the response (the playbook steps that saved time and reduced dwell), and what needs to improve (specific, measurable changes with owners and deadlines).

Detection improvement documents whether existing rules caught the attack at each phase, missed phases entirely, or fired too late to prevent damage. If DET-SOC-001 fired 45 minutes after the initial sign-in because the scheduled rule runs on a 1-hour frequency, the PIR recommends converting it to an NRT rule for this trigger condition. That recommendation goes directly into the detection backlog from Module 2.

Every post-incident review should include: did the playbook work? Which steps were unclear? Which steps were missing? What did the analyst do that was not in the playbook but should have been? The answers become playbook updates. A playbook that has been used in an incident and not updated afterward is a playbook that stopped improving.

Decision tree notation

The playbooks in Sections 7.4-7.6 use a consistent decision tree notation. Each decision point is a question with two or three possible answers, each leading to a specific next action:

Decision Tree Notation

[DECISION]: Is the sign-in IP from a known hosting provider ASN?

→ YES: Escalate to High severity.

Proceed to Investigation Step 3 (post-auth activity check)

→ NO: Check if IP is from a residential ISP in the user's country.

→ YES (residential, home country): Medium severity.

Proceed to Investigation Step 2 (user contact)

→ NO (residential, foreign country): High severity.

Proceed to Investigation Step 3 (post-auth activity check)

Each path is explicit. There is no "use your judgment" at decision points. The playbook defines what constitutes a hosting provider ASN (the list is maintained as a reference in the playbook appendix), what constitutes a residential ISP, and what to do for each outcome. Judgment is applied within steps, interpreting whether a specific query result matches the expected pattern. Judgment is not applied between steps: the decision tree defines the routing.

Integration with detection rules

Each playbook specifies which detection rules feed into it and which alert fields to examine first. The integration model works in five steps. The detection rule fires and an alert appears in the Sentinel incident queue. The analyst opens the incident and identifies the detection rule ID from the alert metadata. The trigger conditions in the playbook match the rule ID, so the analyst activates the corresponding playbook. The triage phase uses the alert's entity fields. UserPrincipalName from the Account entity, IPAddress from the IP entity, TimeGenerated from the alert, as input to the first triage query. The investigation phase builds on the triage findings with additional KQL queries against SigninLogs, OfficeActivity, AuditLogs, and the Device* tables.

The detection rule's entity mapping, which you configured in Modules 3-6, directly feeds the playbook's investigation queries. If DET-SOC-008 maps the Account entity to UserId and the IP entity to ClientIP, the playbook's first triage query uses those exact field values. This is why entity mapping matters operationally, not just for Sentinel's investigation graph. The playbook's queries depend on the entities the detection rule provides.

← Previous Next →