In this section

Unified Portal Operations: Daily SOC Workflow

10-14 hours · Module 1 · Free
What you already know
You know how each Defender product works individually and how to investigate threats in each one. You learn how to operate the unified portal as a daily workflow: what you check first, how you manage the queue, how you document your work, and how you hand over to the next analyst. The operational discipline here is what separates a SOC that catches attacks from one that drowns in noise.

Scenario

Northgate Engineering's morning analyst starts at 08:00. The incident queue shows 14 new incidents from the overnight period: one High-severity multi-stage attack, three Medium-severity alerts, and ten Low-severity policy violations. The night shift handover mentions a BEC investigation where containment is complete but the password reset is pending manager approval. CloudAppEvents data stopped flowing at 03:17 and has not recovered. The analyst needs to figure out what to do first, how to do it efficiently, and what to document so the afternoon shift can continue without starting over.

The shift start routine

Every shift starts the same way, regardless of what happened on the previous shift. You need situational awareness before you work the queue, and situational awareness comes from a consistent routine that checks the same things in the same order.

SHIFT START ROUTINE (~15 MINUTES) 1. QUEUE SCAN Count new incidents Note High/Critical ~5 min 2. HANDOVER Read shift notes Pending actions ~3 min 3. PIPELINE HEALTH Data freshness query Flag stale tables ~5 min 4. THREAT INTEL Threat Analytics Impact reports ~2 min 5. ACTION CENTER Pending approvals Stale remediation ~2 min

Figure 1.10: Five-step shift start routine. Complete all five before working the incident queue. The order matters: pipeline health before queue work ensures you know which detection domains are actually reporting.

Step 1: Queue scan (5 minutes). Open Incidents, filter to Status = New, sort by Severity descending. Count new incidents since the last shift. Read the names and severities without opening them yet. You are building a mental map of what needs attention, not investigating. If any incident is High or Critical severity, note it. If any incident has been auto-assigned but not acknowledged, note it. The goal is a single answer: is anything on fire right now?

Step 2: Read the shift handover (3 minutes). If your SOC uses a handover document (Teams channel, shared OneNote, wiki page), read it before you touch the queue. The previous shift should have documented incidents actively being investigated, actions pending external responses, and any data pipeline issues. If no formal handover exists, check the most recent incident comments. The handover bridges the context gap between shifts. Without it, investigation progress made at 23:00 is invisible at 08:00, and you spend the first hour reconstructing what your colleague already figured out.

Step 3: Check data pipeline health (5 minutes). This is the step most analysts skip and most SOC managers wish they would not. If a data connector stopped flowing, your detection rules are blind. Alerts that should fire will not fire, and the quiet queue creates a false sense of security.

KQL
// Pipeline health: when did each critical table last receive data?
union
    (DeviceProcessEvents | summarize LastEvent = max(Timestamp) | extend Table = "DeviceProcessEvents"),
    (EmailEvents | summarize LastEvent = max(Timestamp) | extend Table = "EmailEvents"),
    (IdentityLogonEvents | summarize LastEvent = max(Timestamp) | extend Table = "IdentityLogonEvents"),
    (CloudAppEvents | summarize LastEvent = max(Timestamp) | extend Table = "CloudAppEvents"),
    (AADSignInEventsBeta | summarize LastEvent = max(Timestamp) | extend Table = "AADSignInEventsBeta")
| project Table, LastEvent, DataAge = datetime_diff('minute', now(), LastEvent)
| order by DataAge desc

Normal ingestion latency for Defender XDR tables is 5-15 minutes. If any table shows a DataAge greater than 60 minutes, investigate the connector before touching the incident queue. A silent DeviceProcessEvents means endpoint detections are blind. A stale EmailEvents means phishing alerts will not fire. A stale IdentityLogonEvents means sign-in risk detections and Defender for Identity alerts stop flowing. Escalate connector issues to your engineering team immediately and document the outage window in your shift handover, because incidents that should have been detected during the gap will not appear in the queue retroactively when the pipeline recovers.

Step 4: Check Threat Analytics (2 minutes). Navigate to Threat Analytics in the portal. Microsoft publishes threat analytics reports when new campaigns or vulnerabilities are actively being exploited. Reports marked with "Impact" indicate your tenant has exposed or impacted assets. A new report showing vulnerable devices or affected users takes priority over the standard queue because the attack may already be in your environment and your existing rules may not detect it. Threat Analytics reports include recommended hunting queries that you can run directly from the report page.

Step 5: Check the Action Center (2 minutes). Navigate to Actions & submissions, then Action center. The Pending tab shows automated investigation results waiting for analyst approval. If your automation level is set to "Semi — require approval for all folders," AIR identifies remediation actions but waits for you. Pending actions that sit unapproved for days mean you get the detection benefit but lose the response benefit. A soft-delete recommendation from AIR on a phishing campaign is useless if the emails sit in user inboxes for 48 hours while the approval ages in the queue. Review pending items at every shift start and either approve, reject, or escalate.

Triage methodology and incident prioritization

After the shift start routine, you work through the incident queue. Defender XDR now applies machine learning-based incident scoring that goes beyond simple severity labels. The prioritization model evaluates signal rarity, correlation breadth, and potential impact to surface the incidents most likely to represent real attacks. Incidents with rare detection signals score higher than incidents built from common, frequently-firing alerts, even if both carry the same severity label. This helps when you have 14 new incidents and need to decide where to start.

The 5-minute triage framework from Section 1.2 applies to every incident. Read the incident name and severity. Check the entities (how many users, devices, mailboxes are involved — scope determines urgency). Open the highest-severity alert and read the evidence. Check whether automated investigation or attack disruption already acted. Classify and act.

The classifications are True Positive (real threat, investigate), False Positive (legitimate activity, close with documentation), Benign True Positive (real detection of authorized activity, close with reference to the change ticket or business justification), and Informational (the detection worked correctly but the activity is expected in your environment). The discipline is not spending 30 minutes on something you should classify in 3. Speed in triage comes from pattern recognition that develops with experience: you learn which alert titles are almost always false positives in your environment, which detection rules produce the most noise, and which combinations of correlated alerts indicate real attacks.

The analyst who investigates every incident to completion before moving to the next one

This pattern looks thorough but creates a dangerous backlog. You spend 90 minutes fully investigating a Medium-severity impossible travel alert while a High-severity multi-stage attack sits untouched three rows down the queue. The impossible travel was a VPN user who forgot to disconnect before switching networks. The multi-stage attack involved credential theft, lateral movement, and mailbox rule creation. Triage first, investigate in priority order. A 5-minute classification pass across all new incidents takes 30 minutes and identifies the one that actually matters. A 90-minute deep dive into the first incident in the list wastes time on what is statistically most likely the wrong incident.

Priority-based investigation

After triaging all new incidents, investigate in priority order.

Priority 1: Active attacks. Attack disruption triggered, ransomware indicators, active data exfiltration, ongoing lateral movement. Full attention immediately. These incidents have automated containment actions in progress, and you need to verify the containment scope, confirm the disruption worked, and determine whether the attacker established persistence before disruption engaged.

Priority 2: High-severity confirmed true positives. Classified as TP during triage with High or Critical severity. Investigate within the current shift. Do not defer to the next shift without documenting your progress in the incident comments. If investigation is incomplete at shift end, the handover must specify exactly where you stopped and what the next analyst should do first.

Priority 3: Medium-severity true positives and unknowns. Require investigation but are not actively progressing. Investigate within 24 hours. Document enough context in the incident that the next analyst who picks it up does not re-triage from scratch.

Priority 4: Low-severity and operational items. Policy violations, informational alerts, configuration issues. Batch these during quiet periods. Schedule 30-60 minutes per shift for clearing low-priority items. Do not let them accumulate indefinitely because patterns hide in the noise. Three unrelated low-severity alerts against the same user over five days may be the early stages of an attack that has not yet triggered a high-severity correlation.

Documentation standards

Every incident you touch should have comments that allow another analyst to continue without contacting you. This is operational necessity. Analysts go on leave, shift patterns rotate, incidents span multiple days. If the only person who understands the investigation state is unavailable, the investigation stalls.

Classification comments. When you close an incident, document the classification and your reasoning: "Classified as FP. Alert fired on legitimate admin PowerShell activity by admin.t.clark running a scheduled compliance script. Ref: CHANGE-2026-0142." This tells the next analyst why you closed it and gives them a reference to verify. If you close without explanation and the same alert fires again tomorrow, the next analyst repeats your entire investigation.

Investigation progress. When you stop working an incident that is not yet resolved: "Confirmed malicious macro in invoice.docx delivered via email from compromised external account. Device DESKTOP-NGE042 isolated. Investigation package collected. Pending: decode Base64 payload, check file hash across tenant via DeviceFileEvents." The next analyst knows exactly where you stopped and what to do next.

Actions and approvals. "User sessions revoked. Password reset requires manager approval per IR policy. Ticket INC-NE-2026-0321 raised. Handover to next shift if not approved by 17:00." Actions pending external approval are the most common handover gap. If the password reset is approved at 18:30 and no one knows to execute it because you went home without documenting it, the attacker retains access for another 14 hours.

Escalations. "Escalated to Tier 2. Incident involves 12 devices and potential data exfiltration via SharePoint. Tier 2 lead S. Patel notified via Teams at 14:30." Name the person, name the channel, record the time.

The standard: another analyst should be able to read your comments and continue the investigation without messaging you. If they need to ask you what happened, your documentation failed.

Shift handover

The handover is the document you write at the end of every shift. It covers three areas: open incidents and their current state, actions pending external responses, and environmental issues (connector outages, scheduled maintenance, active penetration tests). Keep it to five to ten bullet points. The incoming analyst needs a 2-minute briefing.

Shift Handoff
Day Shift → Night Shift 2026-05-21 · T. Ashworth → P. Sharma

Open incidents (2):

INC-4821 [High] AiTM phishing campaign targeting Finance. 19 emails delivered, 3 users clicked. Containment complete on all 3 accounts (sessions revoked, MFA re-enrolled). Scoping in progress — check whether j.morrison accessed the Q3 financials SharePoint before containment. KQL in incident comments.

INC-4819 [Medium] Suspicious OAuth consent on marketing-analytics-v2 app. Publisher is legitimate vendor but requested Mail.ReadWrite + Files.ReadWrite.All. Awaiting app owner d.chen response (email sent 14:30). If no response by 22:00, revoke consent and notify manager.

Pending actions:

▸ AIR remediation pending approval: soft-delete phishing emails from 16 mailboxes (INC-4821). Review and approve.

▸ Password reset for s.patel pending manager approval. Ticket INC-NE-2026-0321. Escalate if not approved by 22:00.

Pipeline and environment:

▸ CloudAppEvents recovered at 09:42 after 6h25m outage (03:17–09:42). Gap period logged. No MCAS alerts expected from that window.

▸ Scheduled pen test on VLAN 10 endpoints runs 20:00–02:00. Suppress MDE alerts from IP range 10.20.30.0/24.

Queue status: 14 incidents triaged · 8 FP closed · 3 escalated to Tier 2 · 1 informational batched

Priority for next shift: Complete scoping on INC-4821. The compromised accounts may have accessed the finance SharePoint site. Run the CloudAppEvents query in incident comments to check file download activity during the compromise window.

This handover gives the incoming analyst everything they need to continue without asking. The open incidents have specific next steps. The pending actions have deadlines. The pipeline issue is documented with the gap window so the night analyst knows which time period has no cloud app visibility. The pen test suppression prevents a cascade of false positives at 20:00.

Managing alert fatigue

Alert fatigue is the gradual degradation of analyst attention caused by sustained exposure to high-volume, low-value alerts. The symptoms are specific: you start closing incidents without reading the evidence tab. You classify ambiguous alerts as FP because the last five with that title were false positives. You skip the pipeline health check because it has been healthy for weeks. If you notice these patterns in yourself, switch to a different task for 30 minutes and return to the queue with fresh attention.

The organizational countermeasure is aggressive false positive reduction. Review the noisiest alert types weekly and tune or suppress the ones with high FP rates.

KQL
// Top 10 noisiest alert types in the last 30 days
AlertInfo
| where Timestamp > ago(30d)
| summarize AlertCount = count() by Title
| where AlertCount > 20
| order by AlertCount desc
| take 10

If "Suspicious PowerShell command line" fired 187 times in 30 days with a 76% false positive rate, that single alert type consumed approximately 15 hours of analyst triage time across shifts. Review the false positives, identify the common pattern (specific script path, specific service account, specific scheduled task), and create a suppression rule that excludes the known-good activity while preserving detection for the malicious variant. A SOC that receives 500 alerts per week at a 90% FP rate has 50 real alerts buried in 450 noise. The same SOC after tuning that receives 100 alerts at a 50% FP rate still has those 50 real alerts, but the analyst's ability to find them is transformed.

Rotation helps too. Analysts who spend every shift on the same queue experience faster fatigue than those who alternate between queue triage, threat hunting, detection engineering, and investigation. If your team is too small for formal rotation, allocate 20-30% of each shift to non-queue activities: rule tuning, hunting queries, or documentation improvements. The non-queue work directly reduces future queue volume.

Security Operations Principle

A silent data pipeline is more dangerous than a noisy incident queue. A queue with 50 false positives is annoying but the real attacks are in there somewhere. A queue with zero alerts because the email connector has been down for six hours is dangerous because real attacks are happening undetected. Check data pipeline health at every shift start, before you touch the incident queue.

Next
Section 1.8 covers cross-product incident correlation: the Advanced Hunting unified schema, entity pivoting across products, building attack timelines from multi-product telemetry, and correlation patterns for common multi-stage attacks.
Unlock the Full Course See Full Course Agenda