Free Cheatsheet

Detection Engineering

Detections are engineered, not written once and forgotten. The rule anatomy, the threat-model-driven coverage, real detections by ATT&CK tactic, and the testing, tuning, and detection-as-code lifecycle that keeps them working. For Sentinel and Defender XDR. No account needed.

The problem Rule anatomy Coverage Detections by tactic Testing & tuning Detection as code Quick lookup

Part of the M365 & Entra ID Detection Reference This guide is one part of Ridgeline's unified M365 detection reference. See the full technique coverage map and the other guides in the set.

Detection engineering is the discipline of building detections that work, prove they work, and keep working. The difference from writing a query is the lifecycle around it: a rule is threat-modeled, structured, tested before it goes live, tuned as the environment changes, and version-controlled so a change is deliberate and reviewable. These run in Microsoft Sentinel and Defender XDR; the queries are real detections, set your own thresholds before enabling.

The detection engineering problem

Most detection estates fail the same way: rules are written once, in the portal, against whatever data was easy, and then never touched. Coverage becomes accidental rather than designed; nobody knows which techniques are covered or which rules have silently stopped firing. A rule that broke when a field was renamed looks identical to a rule that simply has nothing to detect, both return zero, and zero is the most dangerous result in detection because it is indistinguishable from safety.

Failure	The engineering answer
Accidental coverage	Threat-model the detections: build from ATT&CK, know what you cover and what you do not.
Write once, never tune	A tuning lifecycle: rules degrade, and degradation is monitored and corrected.
Silent rule breakage	Detection as code: version control, review, and tests catch the rule that quietly stopped working.
Zero results trusted	Test every rule with known-true data before enabling, so zero means clean, not broken.

Rule anatomy

A production detection is more than its query. The scheduling and entity configuration determine whether it fires usefully, and the timing parameters hide gotchas that silently drop detections. Get the anatomy wrong and a perfectly correct query still misses, or floods.

Element	What it controls (and the trap)
Query frequency	How often the rule runs. Too rare misses fast attacks; too frequent costs and duplicates.
Lookback period	How far back each run reads. Must cover the gap since the last run, plus ingestion delay, or events fall between runs.
Execution delay	The roughly 5-minute ingestion lag means a rule querying "the last 5 minutes" can miss events not yet ingested. Build the delay into the lookback.
Entity mapping	Maps query output to accounts, hosts, IPs so incidents correlate. Unmapped detections do not stitch into incidents.
Custom details	Surface the fields an analyst needs in the alert itself, so triage does not require re-running the query.

Watch for field renames. Vendor schema changes (a column renamed in an update) silently break any rule that referenced the old name, the rule runs, errors or returns nothing, and the dashboard still shows it as "enabled." This is the single most common cause of a detection that exists but no longer detects.

Threat-model-driven coverage

Coverage you did not design is coverage you cannot trust. The engineering approach builds detections from a threat model, the ATT&CK techniques relevant to your environment and adversaries, and maps each detection to what it covers. The output is a coverage map: the techniques you detect, the ones you have accepted you do not, and the gaps you are actively closing. Without it, you have a pile of rules and no idea what an attacker could do unseen.

Step	What it produces
Threat model	The techniques that matter for your tenant, data, and likely adversaries.
Map to data	Which table would show each technique. No data source, no possible detection.
Build the detection	The rule, structured and tested, for each prioritized technique.
Track coverage	An ATT&CK map of covered, gapped, and accepted-risk techniques, kept current.

Detections by tactic

Real detections from across the ATT&CK tactics, each with its rule ID, hypothesis, and core logic. These are engineered rules, not ad-hoc queries, tune the thresholds to your environment and test before enabling.

Initial access


// DE3-001: Phishing click-through with post-click auth anomaly
// Hypothesis: AiTM phishing produces allowed clicks followed by
// sign-ins from new infrastructure within 60 minutes
let CorrelationWindow = 60m;
let LookbackPeriod = 14d;
// Step 1: Identify allowed clicks from inbound emails
let SuspiciousClicks =
    UrlClickEvents
    | where Timestamp > ago(1h)
    | where ActionType == "ClickAllowed"
    | where Workload == "Email"
    | extend ChainLength = array_length(todynamic(UrlChain))
    | project ClickTime = Timestamp, AccountUpn, Url,
              UrlChain, ChainLength, IsClickedThrough,
              NetworkMessageId;
// Step 2: Build per-user baseline of known sign-in infrastructure
let KnownInfra =
    SigninLogs
    | where TimeGenerated > ago(LookbackPeriod)
    | where ResultType == 0
    | summarize
        KnownIPs = make_set(IPAddress, 100),
        KnownDevices = make_set(DeviceDetail.deviceId, 50),
        KnownCountries = make_set(LocationDetails.countryOrRegion, 20)
        by UserPrincipalName;
// Step 3: Find post-click sign-ins from new infrastructure
S

Credential & identity


// DE4-001: MFA push bombing. Hypothesis: an attacker with the password but not
// the factor sends repeated pushes; the signal is a burst of explicit denials,
// and the danger is a single approval that follows within the burst window.
let bombing = SigninLogs
    | where TimeGenerated > ago(1h)
    | where ResultType == 500121
    | where Status has "declined the authentication" or Status has "Reported Fraud"
    | summarize Denials = count(), FirstDenial = min(TimeGenerated),
        LastDenial = max(TimeGenerated), IPs = make_set(IPAddress, 10)
        by UserPrincipalName, bin(TimeGenerated, 10m)
    | where Denials >= 5;          // 5+ explicit denials in 10 min = bombing, not fat-finger
let approvals = SigninLogs
    | where TimeGenerated > ago(1h)
    | where ResultType == 0 and AuthenticationRequirement == "multiFactorAuthentication"
    | project UserPrincipalName, ApprovalTime = TimeGenerated, ApprovalIP = IPAddress;
bombing
| join kind=leftouter approvals on UserPrincipalName
| where isempty(ApprovalTime) or ApprovalTime between (FirstDenial .. (LastDenial + 30m))
| exten

Persistence & execution


// DE5-002: OAuth consent grant anomaly. Hypothesis: a consent grant to an
// unrecognised app with sensitive permissions is consent phishing; legitimate
// business apps are on the approved list and filter cleanly.
let approvedApps = dynamic(["Microsoft Office", "Microsoft Teams",
    "SharePoint Online", "Azure Portal", "Outlook Mobile"]);
let sensitiveScopes = dynamic(["Mail.Read", "Mail.ReadWrite", "Mail.Send",
    "Files.Read", "Files.ReadWrite", "Files.Read.All",
    "Directory.ReadWrite.All", "User.Read.All"]);
AuditLogs
| where TimeGenerated > ago(20m)
| where OperationName == "Consent to application"
| extend ConsentUser = tostring(parse_json(
    tostring(InitiatedBy.user)).userPrincipalName),
    AppName = tostring(TargetResources[0].displayName),
    ConsentIP = tostring(parse_json(tostring(InitiatedBy.user)).ipAddress),
    Permissions = tostring(TargetResources[0].modifiedProperties)
| where AppName !in (approvedApps)
| where Permissions has_any (sensitiveScopes)
| project TimeGenerated, ConsentUser, AppName, ConsentIP, Permissions

Discovery & evasion


// DE6-001: reconnaissance command sequence. Hypothesis: 3+ distinct
// discovery commands within 2 minutes from a non-IT user is systematic
// environment mapping; legitimate admin usage is 1-2 commands per session.
let sequenceWindow = 2m;
let minCommands = 3;
let reconCommands = dynamic([
    "whoami", "whoami.exe",
    "net.exe", "net1.exe",
    "nltest", "nltest.exe",
    "systeminfo", "systeminfo.exe",
    "ipconfig", "ipconfig.exe",
    "tasklist", "tasklist.exe",
    "netstat", "netstat.exe",
    "arp.exe", "route.exe",
    "nslookup", "nslookup.exe",
    "qwinsta", "qwinsta.exe",
    "cmdkey", "cmdkey.exe"]);
let itAdmins = dynamic(["a.patel", "p.greaves",
    "SYSTEM", "NETWORK SERVICE"]);
DeviceProcessEvents
| where TimeGenerated > ago(15m)
| where FileName in~ (reconCommands)
| where AccountName !in~ (itAdmins)
| where AccountName !endswith "$"
| summarize
    CommandCount = dcount(FileName),
    Commands = make_set(FileName, 10),
    FirstCommand = min(TimeGenerated),
    LastCommand = max(TimeGenerated)
    by AccountName, DeviceName,
    bin(TimeGenerated, sequenceWind

Collection & exfiltration


// DE7-005: Cloud Storage Exfiltration Detection
// Detects: file uploads to unsanctioned cloud storage services
let lookback = 30m;
let sanctionedApps = dynamic([
    "Microsoft OneDrive for Business",
    "Microsoft SharePoint Online",
    "Microsoft Teams"
]);
CloudAppEvents
| where TimeGenerated > ago(lookback)
| where ActionType in ("FileUploaded", "Upload")
| where Application !in (sanctionedApps)
| where Application has_any (
    "OneDrive", "Google Drive", "Dropbox", "Box",
    "iCloud", "WeTransfer", "Mega", "pCloud",
    "Azure Storage", "AWS S3")
| summarize
    UploadCount = count(),
    TotalSizeBytes = sum(toint(RawEventData.FileSize)),
    Files = make_set(ObjectName, 20)
    by AccountDisplayName, Application, AccountObjectId
| where UploadCount >= 3
| project TimeGenerated = now(), AccountDisplayName,
    Application, UploadCount, TotalSizeBytes, Files

Lateral movement & impact


// DE8-001: RDP First-Access Lateral Movement Detection
// Detects: RDP logon to a device the user has never previously accessed
let lookback = 15m;
let baselinePeriod = 30d;
// Build 30-day RDP baseline: user -> devices they normally access
let rdpBaseline = DeviceLogonEvents
    | where TimeGenerated > ago(baselinePeriod) and TimeGenerated ago(lookback)
    | where LogonType == "RemoteInteractive"  // Type 10 = RDP
    | where isnotempty(AccountName) and isnotempty(DeviceName)
    | distinct AccountName, DeviceName;
// Current RDP logons
DeviceLogonEvents
| where TimeGenerated > ago(lookback)
| where LogonType == "RemoteInteractive"
| where isnotempty(AccountName) and isnotempty(DeviceName)
| where AccountName !endswith "$"  // Exclude computer accounts
// Find logons to devices NOT in the baseline
| join kind=leftanti rdpBaseline
    on AccountName, DeviceName
// This user has NEVER RDP'd to this device in 30 days
| extend RemoteIP = RemoteIP
| project TimeGenerated, AccountName, DeviceName,
    RemoteIP, RemoteDeviceName,
    LogonType, ActionType

These are abbreviated real detections. Each is an engineered rule from the course, shown with its ID and core logic. The full versions, with entity mapping, custom details, tuning notes, and the complete coverage across every tactic, are in Detection Engineering.

Testing and tuning

A rule that has never been tested is a guess, and a rule that is never tuned decays. Testing happens before a rule goes live: validate it fires on known-true activity and stays quiet on known-benign, so that when it is live, a hit means something and silence means clean. Tuning happens forever after, because the environment moves under the rule.

Cause of degradation	What happens
Schema change	A renamed field breaks the query; the rule errors or returns nothing.
Environment drift	New apps, users, or behavior that the rule was not tuned for, generating noise or blind spots.
Threshold rot	A count or window that fit at design time no longer matches the current baseline.
Data source gap	A connector breaks or a log stops flowing; the rule runs against nothing.
False-positive fatigue	An untuned noisy rule gets ignored, then disabled, then the coverage is gone.

The test plan comes before enable, not after the first incident. Validate a new rule against representative true-positive data (it fires) and a window of normal activity (it does not flood). A rule enabled without this is discovered to be broken or unbearable in production, which is the most expensive place to find out.

Detection as code

Portal-only rule management is how the silent rule break survives: a change is made directly in the console, unreviewed and unlogged, and the rule that stops working leaves no trace of what changed or when. Detection as code puts the rules in version control, so every change is reviewed, attributable, and reversible, and a test suite can catch the break before it ships. You do not need a mature pipeline to start; you need the rules in a repository and a review step.

Portal-only failure	What detection-as-code gives
Unreviewed changes	Pull-request review before a rule change reaches production.
No change history	Git history: what changed, who changed it, when, and why.
No rollback	Revert to the last working version when a change breaks a rule.
No testing	Automated validation of rule syntax and logic in the pipeline.
Drift between environments	The repository is the source of truth; environments deploy from it.

Worked example, the silent rule break

A vendor update renames a field the credential-theft rule depends on. The rule keeps running, now returning zero, and the portal still shows it enabled and healthy. For weeks, the technique it covered is undetected, and nobody knows, because zero results looks exactly like a quiet environment. The gap is found only when an incident that the rule should have caught is discovered another way.

The engineering fix: detection as code would have caught it, a test asserting the rule fires on known-true data fails in the pipeline the moment the field reference breaks, before the change ships. Plus a monitoring rule for detections that have gone unexpectedly silent. Zero results is a state to alert on, not to trust.

Quick lookup

Tactic	Primary detection data source
Initial access	SigninLogs, EmailEvents, DeviceProcessEvents
Credential / identity	SigninLogs, AADNonInteractive, AuditLogs
Persistence / execution	AuditLogs, CloudAppEvents, DeviceProcessEvents
Discovery / evasion	DeviceProcessEvents, AuditLogs
Collection / exfiltration	CloudAppEvents, EmailEvents, DeviceNetworkEvents
Lateral movement / impact	DeviceLogonEvents, SigninLogs, DeviceProcessEvents

Rule problem	Likely cause
Suddenly zero results	Schema change (field renamed) or a broken data connector.
Too noisy	Threshold rot or environment drift; re-baseline and tune.
Misses known activity	Lookback shorter than the run gap plus ingestion delay.
Does not correlate	Missing entity mapping; alerts do not stitch into incidents.

From writing rules to engineering detection coverage

This cheatsheet is the craft in outline. Detection Engineering teaches the full discipline: rule architecture, threat modeling, the detection per ATT&CK tactic, the testing and tuning lifecycle, and detection-as-code from first repository to mature pipeline.

Explore the course

Weekly security engineering insights

Detection techniques, architecture patterns, and operational judgment, every Tuesday.

No spam. Unsubscribe anytime.