In this section

The Triage Decision Framework

8-10 hours · Module 1 · Free

What you already know

Section 1.4 built the escalation framework for alerts that can't be classified. This section builds the triage decision framework that resolves the 70% that can — a structured methodology that produces consistent classification regardless of which analyst runs it.

Classification is the core L1 skill

Scenario

Two analysts receive the same alert: suspicious sign-in from unusual location for a VPN user. Analyst A checks the IP against the VPN exit list, confirms it matches, classifies as False Positive, and closes in 2 minutes. Analyst B checks the same IP, confirms VPN — but also checks whether the sign-in used interactive MFA or a cached token, whether other sign-ins occurred from non-VPN IPs in the same session, and whether any post-authentication activity occurred. Both classified correctly. Analyst B's classification is higher confidence because the enrichment was deeper. Without a framework that specifies what to check, the depth depends on the analyst — not the process.

Triage is not "look at the alert and decide." Triage is a structured evaluation: read the alert, run specific enrichment checks, evaluate the enrichment results against classification criteria, record the disposition with rationale. The framework defines what to check and what the results mean. The analyst applies judgment within the framework — the framework removes the guesswork about what to check, not the judgment about what the results mean.

Without a framework, triage quality is a function of the individual analyst. Experienced analysts check more thoroughly because they know what to look for. New analysts check less because they don't. The gap creates inconsistent classification quality that nobody measures because nobody defined what "thorough" means. The triage decision framework defines it.

Estimated time: 45 minutes.

Figure 1.5 — The five-step triage framework: read, check known, enrich, classify, record. Four disposition categories replace the binary TP/FP system — adding Benign True Positive and Undetermined to capture the classifications that binary systems force into the wrong bucket.

The five-step triage process

Step 1: Read the alert

Every triage starts with the alert details: which detection rule fired, what severity it assigned, which entities are involved (user account, device, IP address, application), when the event occurred, and any contextual information the rule provides through custom details or dynamic alert content.

This step takes 30-60 seconds. The analyst is building initial context: is this a credential alert, an email alert, an endpoint alert? Is the entity a VIP user, a service account, a test account? Is the timestamp consistent with normal business activity or does it fall outside business hours?

Reading the alert before acting on it sounds obvious but is routinely skipped in high-volume environments. Alert fatigue — the desensitization that comes from processing 80+ alerts per day — creates a pattern where analysts classify based on the rule name and severity without reading the entity details. The framework requires the analyst to read the full alert before proceeding. The 30 seconds invested prevents the 15-minute investigation that would result from misinterpreting the alert at the start.

Step 2: Check against known sources

Before running any enrichment queries, the analyst checks the entities against known-good sources: the VPN exit IP list, the service account registry, the scheduled process document, and the VIP watchlist. This check takes 1-2 minutes and resolves 15-20% of alerts immediately.

A sign-in from the corporate VPN exit IP resolves without further enrichment. A process execution from a known monitoring tool resolves without investigating the process. A scheduled backup that generates OfficeActivity alerts every night at 2 AM resolves without querying the audit log. These are FP or BTP classifications based on known environmental patterns — the fastest, highest-confidence triage path.

The known-good sources have to be maintained. The VPN exit IP list changes when the organization switches providers. The service account registry changes when new automation is deployed. The scheduled process document changes when IT modifies maintenance windows. If the known-good sources are stale, the analyst either wastes time investigating known-good activity or — worse — closes alerts on sources that are no longer legitimate. The environmental context documentation from Section 1.3 serves double duty here.

Step 3: Enrich

Enrichment means running specific queries to gather context beyond what the alert provides. The enrichment steps depend on the alert type — credential alerts need sign-in history, email alerts need inbox rule audit, endpoint alerts need process chain analysis. The triage framework specifies the enrichment steps for each alert category.

For identity alerts, the standard enrichment is: sign-in history for the user (last 7 days — is this IP, location, or device new?), MFA method details (interactive prompt vs satisfied-by-claim vs cached token), related alerts on the same user in the last 24 hours, and user role/group membership (VIP, admin, service account). This enrichment takes 3-5 minutes and produces the evidence that supports the classification.

Run this query in Advanced Hunting or Sentinel Logs to check sign-in history and MFA method for any user — replace the UPN with the entity from the alert:

KQL — Identity Triage Enrichment

// L1 Triage: sign-in history + MFA method for alerted user
// Replace the UPN with the entity from the alert
let AlertedUser = "user@company.com";
SigninLogs
| where TimeGenerated > ago(7d)
| where UserPrincipalName =~ AlertedUser
| extend MFAMethod = tostring(
    parse_json(AuthenticationDetails)[0].authenticationMethod)
| extend MFADetail = tostring(
    parse_json(AuthenticationDetails)[0].authenticationStepResultDetail)
| project TimeGenerated,
    IPAddress,
    Location = strcat(LocationDetails.city, ", ",
        LocationDetails.countryOrRegion),
    DeviceDetail = DeviceDetail.displayName,
    MFAMethod,
    MFADetail,
    AppDisplayName,
    ResultType,
    ConditionalAccessStatus
| sort by TimeGenerated desc

The output shows every sign-in for this user over the past 7 days. The columns that matter for triage: IPAddress — is the alerted IP new or does it appear in the history? Location — is the geography consistent? MFAMethod — did the user complete MFA interactively (PhoneAppNotification, OneWaySMS, FIDO2) or was it satisfied by a claim in the token (SatisfiedByClaimInToken)? MFADetail — MFA completed in Azure AD versus MFA requirement satisfied by claim in the token is the single field that distinguishes legitimate authentication from AiTM token capture.

Step through the query below to see how each line transforms the data — from thousands of sign-in events to the MFA method field that reveals whether this authentication is legitimate or a captured token:

This simulator uses sample data from the NE environment to demonstrate the triage enrichment process.

Now write the query yourself

You watched the query work above. Writing it builds different memory than reading it. The workbench below has a real KQL engine running in your browser, hooked up to a 30-day simulated NE Sentinel workspace. It has the actual AiTM compromise of c.richardson seeded in it — three sign-ins from Singapore IPs with the MFA method check that distinguishes them from legitimate authentication.

If you get stuck, the workbench has three progressive hints. Use them in order. The full solution is available after all hints are revealed.

The corpus is the 30-day NE Sentinel workspace anchored at 2026-03-15. The AiTM event is the canonical INC-NE-2026-0227-001 incident. The query you write here works verbatim in a real Sentinel workspace.

The MFA method check is critical and frequently skipped. A sign-in that shows "MFA satisfied" looks benign in most triage frameworks. But the detail matters: interactive MFA (the user physically responded to a push notification or entered a TOTP code) is strong evidence of legitimate access. "Satisfied by claim in token" means the MFA assertion was carried in the token from a previous authentication — which can indicate a stolen token where the attacker inherited the MFA claim without ever performing MFA. This single enrichment step — checking the MFA method field in AuthenticationDetails, not just the MFA result — is what distinguishes an AiTM detection from a false dismissal. NE's post-incident triage framework adds this check to every identity alert.

For email alerts, the enrichment is: inbox rule audit (any new rules in the last 48 hours?), MailItemsAccessed volume (anomalous read activity?), email forwarding configuration (external forwarding enabled?), and related identity alerts (was the account compromised before the email activity?).

Run this query to check for inbox rules created in the last 48 hours for an alerted user:

KQL — Email Triage Enrichment

// L1 Triage: inbox rules created in last 48h for alerted user
let AlertedUser = "user@company.com";
OfficeActivity
| where TimeGenerated > ago(48h)
| where UserId =~ AlertedUser
| where Operation in ("New-InboxRule", "Set-InboxRule",
    "Enable-InboxRule", "Set-Mailbox")
| extend RuleParams = tostring(Parameters)
| project TimeGenerated, Operation,
    ClientIP, RuleParams

If this query returns results, check the RuleParams for suspicious patterns: rules that forward to external addresses, rules that move emails to RSS Subscriptions or obscure folders, rules that match keywords like "invoice," "payment," "wire," or "urgent." The cross-domain check — email alert triggering a sign-in history review — catches BEC chains that start with credential compromise and progress through email manipulation. An inbox rule alert in isolation might be a user organizing their mailbox. The same inbox rule combined with a suspicious sign-in from a new IP 30 minutes earlier changes the classification entirely.

For endpoint alerts, the enrichment is: parent process chain (what launched the suspicious process?), file reputation (known good, known bad, unknown?), network connections from the device in the same time window, and related alerts on the same device. The parent process check is the single most valuable enrichment for endpoint alerts. A suspicious PowerShell command launched by explorer.exe (user-initiated) is entirely different from the same command launched by winword.exe (macro-initiated) or wmiprvse.exe (remote WMI execution). Same alert, same command, three different investigation paths determined by the parent process.

The framework specifies these enrichment steps so that every analyst checks the same things in the same order. Analyst judgment applies to interpreting the results — the framework standardizes what's checked, not what the results mean.

Enrichment time management

The enrichment steps for each alert category take 3-8 minutes. Combined with the initial read (1 minute) and known-source check (1-2 minutes), the full triage framework runs in 5-11 minutes for most alerts — fitting within the 15-minute L1 boundary with margin for classification and recording. When enrichment queries run slow during peak ingestion, the analyst notes the incomplete enrichment and escalates at the 15-minute mark rather than extending triage time.

Step 4: Classify

The enrichment results map to one of four disposition categories. The mapping requires judgment, but the categories are defined and the criteria are specific.

True Positive (TP): The alert correctly identified malicious activity. The entity is compromised or under active attack. Action required: escalate to L2 for investigation and containment.

False Positive (FP): The alert fired on activity that is not malicious and does not represent a real event matching the detection intent. Example: a brute force rule fires on a user who mistyped their password five times. The rule detected what it was designed to detect (multiple failed authentications) but the activity isn't an attack. The rule needs tuning if this FP pattern recurs.

Benign True Positive (BTP): The alert correctly identified the activity it was designed to detect, and the activity is real, but the activity is legitimate in context. Example: a "suspicious PowerShell" rule fires on the IT admin's scheduled automation script. The PowerShell execution is real, the encoded command line is real, and the detection rule correctly identified it — but the activity is a known, authorized administrative function. This distinction matters for metrics: FP means the rule fired incorrectly, BTP means the rule fired correctly on legitimate activity. FP requires rule tuning. BTP requires an exclusion or watchlist entry.

Undetermined (UND): The analyst cannot classify the alert with sufficient confidence after completing the enrichment steps. The evidence doesn't clearly support TP, FP, or BTP. This is the honest classification for the ambiguous zone — and it triggers escalation per the capability trigger from Section 1.4.

The undetermined problem

Most SOCs lack an Undetermined category. Sentinel's built-in classification options are True Positive, Benign Positive, and False Positive — no Undetermined option. When analysts are forced to choose from three categories and none fits, they pick the closest one. Usually FP. An alert that should be "I don't know" becomes "not malicious" — and the investigation never happens.

NE addresses this by adding a custom classification tag. Analysts who select "Benign Positive" but add the tag "UND" indicate that their classification is uncertain. These tagged incidents are reviewed weekly during the L3 quality review. The undetermined rate — what percentage of closures are tagged UND — is a direct signal of detection rule quality. A rule with a 20% undetermined rate produces alerts that analysts cannot resolve with the available evidence, which means the rule needs either better enrichment data or a more specific detection hypothesis.

Step 5: Record the disposition

Every closure includes a brief rationale: what was checked, what was found, and why the classification was chosen. "FP — IP 203.0.113.5 is corporate VPN exit. User's device and location consistent with history. No related alerts." This takes 30 seconds and serves two purposes: it enables L2 quality review (was the L1 classification correct?) and it builds the disposition data that feeds the quality metrics in Section 1.6.

The recording discipline is the hardest part of the framework to sustain. Under queue pressure, analysts cut the rationale short or skip it entirely. The temptation is always "I know it's FP, why do I need to write it down?" The answer is metrics. Without recorded rationale, the quality review can't assess whether closures were correct. Without quality assessment, the SOC can't measure classification accuracy. Without accuracy data, the SOC can't identify which analysts need targeted training or which parts of the framework need refinement.

What we see in 90% of SOC triage processes

Binary classification: True Positive or False Positive. No Benign True Positive category, so legitimate admin activity that triggers rules gets classified as FP — which inflates the false positive rate and makes it impossible to distinguish between "the rule fired incorrectly" (needs tuning) and "the rule fired correctly on legitimate activity" (needs an exclusion). No Undetermined category, so ambiguous alerts get classified as FP to clear the queue — hiding the alerts that should be investigated. No recorded rationale, so quality review is impossible. The SOC processes alerts at high speed with no mechanism to assess whether the processing is accurate.

Building the L1 triage playbook

The deliverable from this section is a triage playbook — a document the L1 analyst references during every shift. The playbook defines the enrichment steps for each alert category (identity, email, endpoint, cloud application), the classification criteria for each disposition, and the escalation triggers from Section 1.4.

The playbook is not a decision tree with binary yes/no paths for every possible alert. Real alerts don't follow binary paths. The playbook is a structured framework that tells the analyst what to check, in what order, and what the results mean — while leaving room for the judgment that makes triage effective.

Build the playbook for your three most common alert categories first. At NE, those are: identity alerts (suspicious sign-in, impossible travel, MFA anomaly), email alerts (inbox rule creation, mail forwarding, phishing), and consent grant alerts (application permissions, OAuth flows). Each category has a one-page enrichment checklist. The combined playbook is three pages.

The playbook should live where the analyst works — in the Sentinel workspace, as a pinned bookmark or embedded guide, not in SharePoint where it requires a context switch to access. NE embeds the enrichment checklist as a Teams tab in the SOC channel and as a pinned tile in the Sentinel workspace. The analyst has the framework visible during triage without leaving the tool they're triaging in.

Version control the playbook

The triage playbook changes. New alert types require new enrichment checklists. Incidents reveal that an enrichment step is missing (the MFA method check was added to NE's playbook after INC-NE-2026-0227-001). New environmental knowledge needs to be reflected in the known-good sources. Each change should be versioned with a date and a brief rationale — "v1.3: Added MFA method check to identity enrichment after AiTM incident revealed gap."

Versioning prevents the stale playbook problem: a document that was accurate when written but no longer reflects the current environment or threat landscape. The L3 quality review (monthly) includes a check of whether the playbook needs updating based on the past month's escalation patterns and incidents.

Test the playbook by having two analysts triage the same 20 alerts independently using the framework. Compare their classifications. Where they agree, the framework is working. Where they disagree, the framework needs a clearer criterion for that alert pattern. The disagreement analysis is the most valuable part of the test — it reveals exactly where the framework is ambiguous. Run this test quarterly. The first test will reveal 3-5 areas where the framework needs clearer criteria. Subsequent tests should show improving agreement as the framework matures.

At NE, the first inter-analyst agreement test showed 72% agreement on the 20 test alerts. The disagreements clustered around two patterns: identity alerts where MFA succeeded but the location was unfamiliar (one analyst classified BTP, the other escalated), and email alerts where an inbox rule matched a suspicious pattern but the user had recently organized their mailbox (one classified FP, the other classified BTP). The framework was updated with clearer criteria for both patterns. The second test, three months later, showed 88% agreement. The framework improved because the test identified where it was ambiguous.

SOC Operations Principle

Triage is a structured evaluation, not a gut check. The framework defines what to check, the enrichment provides the evidence, and the four-category disposition system captures the full range of classification outcomes — including the honest "I don't know" that the Undetermined category provides. Without structure, triage quality depends on the analyst. With structure, it depends on the framework — and the framework is improvable while individual judgment is not.

Section 1.6 — Operational Metrics — Speed vs Quality. The triage framework produces classifications. The metrics framework measures whether those classifications are accurate. The next section draws the line between speed metrics that measure throughput and quality metrics that measure effectiveness — and shows why optimizing speed doesn't improve security.

Unlock the Full Course See Full Course Agenda

Get weekly detection and investigation techniques

KQL queries, detection rules, and investigation methods — the same depth as this course, delivered every Tuesday.

No spam. Unsubscribe anytime. ~2,000 security practitioners.

← Previous Next →