In this section

4.4 On-Call Escalation Automation

5 hours · Module 4

What you already know

Section 4.3 covered email notifications. This section continues the notification pipeline.

Scenario

The on-call phone rings at 04:00 AM. The analyst answers. The automated voice reads: 'Incident number four five two one has been assigned to you.' The analyst asks: what kind of incident? What severity? Is containment needed? The notification triggered the escalation. It did not arm the responder.

Detecting after-hours

The playbook needs to determine whether the current incident occurred during or outside business hours. Business hours at NE: Monday through Friday, 08:00–18:00 UTC.

The Logic App expression:

@if(or(
    less(int(formatDateTime(utcNow(), 'HH')), 8),
    greater(int(formatDateTime(utcNow(), 'HH')), 17),
    equals(dayOfWeek(utcNow()), 0),
    equals(dayOfWeek(utcNow()), 6)
), true, false)

This evaluates to true when the UTC hour is before 08:00 or after 17:00, or when the day is Sunday (0) or Saturday (6). When true AND the incident severity is High or Critical, the playbook adds the on-call escalation branch alongside the standard Teams card notification.

During business hours, the escalation branch is skipped: the Teams card in #SOC-Critical or #SOC-Alerts is sufficient because analysts are actively monitoring the channels.

Timezone handling. NE operates in the UK (UTC/BST). During British Summer Time (last Sunday in March to last Sunday in October), the effective business hours shift by 1 hour. The simplest approach: define business hours in UTC and accept the 1-hour shift during BST. The alternative, dynamically adjusting for BST, adds complexity for minimal gain (the on-call analyst receives a DM 1 hour earlier or later than "necessary" during the transition periods).

For organizations with global teams, the after-hours check can reference a watchlist that defines business hours per timezone/site, with the on-call schedule indicating which site's hours apply to the current on-call analyst.

The On-Call-Schedule watchlist

The on-call rotation is stored in a Sentinel watchlist rather than hardcoded in the playbook. This allows the SOC lead to update the rotation without modifying the Logic App:

Watchlist name: On-Call-Schedule Columns:

WeekStartDate (date): the Monday that starts this on-call week
OnCallAnalystUPN (string), primary on-call analyst's UPN
BackupAnalystUPN (string), backup analyst's UPN (for minute-10 escalation)
IRLeadUPN (string). IR lead for this rotation period
PhoneNumber (string, optional), on-call analyst's phone number for PagerDuty/Opsgenie

The playbook queries the watchlist for the current week:

_GetWatchlist('On-Call-Schedule')
| where WeekStartDate <= now() and WeekStartDate > ago(7d)
| take 1

If the query returns no results (watchlist not updated for the current week), the playbook falls back to the IR lead (a hardcoded default) and adds an incident comment: "⚠ On-call schedule not found for current week (week starting {date}). Escalated directly to IR lead. Please update the On-Call-Schedule watchlist."

This fallback prevents the catastrophic failure mode where a Critical incident goes unnotified because someone forgot to update the rotation spreadsheet. The IR lead receives the escalation and also sees the warning about the missing schedule.

The four-tier escalation cascade

Tier 1. Minute 0: On-call analyst DM. The playbook sends a Teams direct message to the on-call analyst with the incident summary (same adaptive card format as the channel notification, but delivered as a DM for higher visibility). The DM notification is more likely to trigger a phone alert than a channel message, because most analysts configure DM notifications as "always on" even when channel notifications are muted.

The DM includes an "Acknowledge" button. When clicked, the button adds the "acknowledged" tag to the Sentinel incident via webhook (same mechanism as SA4.2).

Tier 2. Minute 10: Backup analyst + re-notification. The playbook waits 10 minutes (Delay action: PT10M), then checks whether the incident has been acknowledged:

SecurityIncident
| where IncidentNumber == {incidentNumber}
| where Tags has "acknowledged"
| count

If acknowledged (count > 0) → stop escalation. The on-call analyst is handling it.

If NOT acknowledged → send a second DM to the on-call analyst with "ESCALATION: High/Critical incident unacknowledged after 10 minutes. Please respond immediately." Simultaneously send a DM to the backup analyst: "Backup escalation: {on-call analyst} has not acknowledged {incident}. Please review."

Tier 3. Minute 20: IR lead. Wait 10 more minutes. Check acknowledgment again. If still unacknowledged → send a Teams DM to the IR lead AND an email (the IR lead may not have Teams notifications active at 02:00, but their email is synced to their phone): "IR Lead escalation: High/Critical incident unacknowledged after 20 minutes. On-call and backup analysts have not responded. Incident: {summary}. Please assess and coordinate response."

Tier 4. Minute 30: CISO (Critical only). For Critical severity incidents that remain unacknowledged at the 30-minute mark → send the CISO the standard Critical email (SA4.3) with an additional note: "This incident has been unacknowledged for 30 minutes. The on-call analyst, backup analyst, and IR lead have been notified but have not responded. Automated containment (if applicable) has executed." This is the final escalation: the CISO takes ownership of the response coordination.

For High severity: the cascade stops at Tier 3. The IR lead receives the escalation at minute 20 and is expected to respond. High severity incidents are important but do not warrant waking the CISO at 02:00.

Acknowledgement detection: the escalation stop signal

The escalation cascade stops as soon as ANY person acknowledges the incident. Acknowledgement is detected by checking for the "acknowledged" tag on the Sentinel incident. The tag is added by the "Acknowledge" button on the DM adaptive card, or by the analyst manually adding the tag in Sentinel.

This design means: if the on-call analyst sees the DM at minute 5 and clicks Acknowledge, the minute-10 escalation does not fire. If the backup analyst acknowledges at minute 12, the minute-20 escalation does not fire. The acknowledgment is global: any responder can stop the cascade.

The acknowledgment also means the analyst has SEEN the incident, it does not mean they have started triage. Acknowledgement is "I am aware and will handle this." The actual triage status is tracked separately through the incident assignment and status updates.

Anti-Pattern

Escalating to the IR lead on every after-hours High incident

The escalation logic pages the IR lead for every High severity incident outside business hours. At NE's volume, this means 2 to 3 pages per night. By the second week, the IR lead silences the notifications. When a genuine Critical arrives, the page is ignored alongside the routine Highs. Reserve immediate IR lead escalation for Critical only. High incidents follow the timeout cascade: on-call analyst first, re-escalate to IR lead only after 60 minutes with no acknowledgment.

Phone call integration

Teams DMs are unreliable for sleeping analysts: the phone may be on silent, the Teams app may not generate a sound notification, or the analyst may have Do Not Disturb enabled. For organizations that take after-hours response seriously, phone call integration is the most reliable escalation mechanism.

PagerDuty integration. The Logic App sends an HTTP POST to PagerDuty's events API:

POST https://events.pagerduty.com/v2/enqueue
Content-Type: application/json

{
  "routing_key": "{PagerDutyServiceIntegrationKey}",
  "event_action": "trigger",
  "dedup_key": "sentinel-{incidentNumber}",
  "payload": {
    "summary": "{incidentTitle} — {severity}",
    "severity": "critical",
    "source": "Sentinel",
    "custom_details": {
      "incident_number": "{incidentNumber}",
      "affected_user": "{UPN}",
      "enrichment_summary": "{enrichmentVerdict}"
    }
  }
}

PagerDuty routes the event to the on-call schedule configured in PagerDuty (which can mirror or replace the Sentinel watchlist schedule), triggers phone calls, and manages its own escalation if the phone call is not answered. The dedup_key prevents duplicate PagerDuty incidents if the Sentinel playbook retries.

Opsgenie integration follows a similar pattern with Opsgenie's alert API. The Logic App sends an HTTP POST to https://api.opsgenie.com/v2/alerts with the incident details and the Opsgenie API key.

For organizations without PagerDuty or Opsgenie: the escalation relies entirely on Teams DMs and emails. This is less reliable but still significantly better than no escalation. To improve reliability: instruct on-call analysts to enable "Priority notifications" in Teams mobile settings for DMs from the soc-automation service account.

Compliance Context

BlueVoyant monitors 24/7, so NE does not need on-call escalation for after-hours incidents.

The MSSP monitors and triages alerts. They do NOT make containment decisions for VIP accounts (SA5.1 approval gate), do NOT authorize regulatory notifications (SA4.8 legal team assessment), and do NOT brief the CISO (SA4.3 executive notification). These actions require NE's internal team. A Critical incident at 02:00 that involves the CFO's account, triggers GDPR assessment, and requires board-level awareness cannot be fully handled by the MSSP alone. The on-call escalation ensures NE's internal decision-makers are reached for the decisions that require organizational authority.

The on-call rotation is managed outside the automation stack. PagerDuty, Opsgenie, or a simple shared calendar determines who is on call. The SA4 playbook integrates via webhook: when an after-hours Critical incident fires, the playbook sends a webhook to PagerDuty with the incident details. PagerDuty handles the escalation: page the primary on-call, escalate to secondary after 10 minutes, escalate to the SOC lead after 20 minutes. This separation of concerns means the SA4 playbook does not need to manage on-call schedules, it sends the webhook and PagerDuty handles the rest.

← Previous Next →