Documentation & Tools →
Sign In
Free Cheatsheet

Detection Engineering Cheatsheet

Detections are engineered, not written once and forgotten. The rule anatomy, the threat-model-driven coverage, real detections by ATT&CK tactic, and the testing, tuning, and detection-as-code lifecycle that keeps them working. For Sentinel and Defender XDR. No account needed.

Detection engineering is the discipline of building detections that work, prove they work, and keep working. The difference from writing a query is the lifecycle around it: a rule is threat-modeled, structured, tested before it goes live, tuned as the environment changes, and version-controlled so a change is deliberate and reviewable. These run in Microsoft Sentinel and Defender XDR; the queries are real detections, set your own thresholds before enabling.

The detection engineering problem

Most detection estates fail the same way: rules are written once, in the portal, against whatever data was easy, and then never touched. Coverage becomes accidental rather than designed; nobody knows which techniques are covered or which rules have silently stopped firing. A rule that broke when a field was renamed looks identical to a rule that simply has nothing to detect, both return zero, and zero is the most dangerous result in detection because it is indistinguishable from safety.

FailureThe engineering answer
Accidental coverageThreat-model the detections: build from ATT&CK, know what you cover and what you do not.
Write once, never tuneA tuning lifecycle: rules degrade, and degradation is monitored and corrected.
Silent rule breakageDetection as code: version control, review, and tests catch the rule that quietly stopped working.
Zero results trustedTest every rule with known-true data before enabling, so zero means clean, not broken.

Rule anatomy

A production detection is more than its query. The scheduling and entity configuration determine whether it fires usefully, and the timing parameters hide gotchas that silently drop detections. Get the anatomy wrong and a perfectly correct query still misses, or floods.

ElementWhat it controls (and the trap)
Query frequencyHow often the rule runs. Too rare misses fast attacks; too frequent costs and duplicates.
Lookback periodHow far back each run reads. Must cover the gap since the last run, plus ingestion delay, or events fall between runs.
Execution delayThe roughly 5-minute ingestion lag means a rule querying "the last 5 minutes" can miss events not yet ingested. Build the delay into the lookback.
Entity mappingMaps query output to accounts, hosts, IPs so incidents correlate. Unmapped detections do not stitch into incidents.
Custom detailsSurface the fields an analyst needs in the alert itself, so triage does not require re-running the query.
Watch for field renames. Vendor schema changes (a column renamed in an update) silently break any rule that referenced the old name, the rule runs, errors or returns nothing, and the dashboard still shows it as "enabled." This is the single most common cause of a detection that exists but no longer detects.

Threat-model-driven coverage

Coverage you did not design is coverage you cannot trust. The engineering approach builds detections from a threat model, the ATT&CK techniques relevant to your environment and adversaries, and maps each detection to what it covers. The output is a coverage map: the techniques you detect, the ones you have accepted you do not, and the gaps you are actively closing. Without it, you have a pile of rules and no idea what an attacker could do unseen.

StepWhat it produces
Threat modelThe techniques that matter for your tenant, data, and likely adversaries.
Map to dataWhich table would show each technique. No data source, no possible detection.
Build the detectionThe rule, structured and tested, for each prioritized technique.
Track coverageAn ATT&CK map of covered, gapped, and accepted-risk techniques, kept current.

Detections by tactic

Real detections from across the ATT&CK tactics, each with its rule ID, hypothesis, and core logic. These are engineered rules, not ad-hoc queries, tune the thresholds to your environment and test before enabling.

Initial access

// DE3-001: Phishing click-through with post-click auth anomaly // Hypothesis: AiTM phishing produces allowed clicks followed by // sign-ins from new infrastructure within 60 minutes let CorrelationWindow = 60m; let LookbackPeriod = 14d; // Step 1: Identify allowed clicks from inbound emails let SuspiciousClicks = UrlClickEvents | where Timestamp > ago(1h) | where ActionType == "ClickAllowed" | where Workload == "Email" | extend ChainLength = array_length(todynamic(UrlChain)) | project ClickTime = Timestamp, AccountUpn, Url, UrlChain, ChainLength, IsClickedThrough, NetworkMessageId; // Step 2: Build per-user baseline of known sign-in infrastructure let KnownInfra = SigninLogs | where TimeGenerated > ago(LookbackPeriod) | where ResultType == 0 | summarize KnownIPs = make_set(IPAddress, 100), KnownDevices = make_set(DeviceDetail.deviceId, 50), KnownCountries = make_set(LocationDetails.countryOrRegion, 20) by UserPrincipalName; // Step 3: Find post-click sign-ins from new infrastructure S

Credential & identity

// DE4-001: MFA push bombing. Hypothesis: an attacker with the password but not // the factor sends repeated pushes; the signal is a burst of explicit denials, // and the danger is a single approval that follows within the burst window. let bombing = SigninLogs | where TimeGenerated > ago(1h) | where ResultType == 500121 | where Status has "declined the authentication" or Status has "Reported Fraud" | summarize Denials = count(), FirstDenial = min(TimeGenerated), LastDenial = max(TimeGenerated), IPs = make_set(IPAddress, 10) by UserPrincipalName, bin(TimeGenerated, 10m) | where Denials >= 5; // 5+ explicit denials in 10 min = bombing, not fat-finger let approvals = SigninLogs | where TimeGenerated > ago(1h) | where ResultType == 0 and AuthenticationRequirement == "multiFactorAuthentication" | project UserPrincipalName, ApprovalTime = TimeGenerated, ApprovalIP = IPAddress; bombing | join kind=leftouter approvals on UserPrincipalName | where isempty(ApprovalTime) or ApprovalTime between (FirstDenial .. (LastDenial + 30m)) | exten

Persistence & execution

// DE5-002: OAuth consent grant anomaly. Hypothesis: a consent grant to an // unrecognised app with sensitive permissions is consent phishing; legitimate // business apps are on the approved list and filter cleanly. let approvedApps = dynamic(["Microsoft Office", "Microsoft Teams", "SharePoint Online", "Azure Portal", "Outlook Mobile"]); let sensitiveScopes = dynamic(["Mail.Read", "Mail.ReadWrite", "Mail.Send", "Files.Read", "Files.ReadWrite", "Files.Read.All", "Directory.ReadWrite.All", "User.Read.All"]); AuditLogs | where TimeGenerated > ago(20m) | where OperationName == "Consent to application" | extend ConsentUser = tostring(parse_json( tostring(InitiatedBy.user)).userPrincipalName), AppName = tostring(TargetResources[0].displayName), ConsentIP = tostring(parse_json(tostring(InitiatedBy.user)).ipAddress), Permissions = tostring(TargetResources[0].modifiedProperties) | where AppName !in (approvedApps) | where Permissions has_any (sensitiveScopes) | project TimeGenerated, ConsentUser, AppName, ConsentIP, Permissions

Discovery & evasion

// DE6-001: reconnaissance command sequence. Hypothesis: 3+ distinct // discovery commands within 2 minutes from a non-IT user is systematic // environment mapping; legitimate admin usage is 1-2 commands per session. let sequenceWindow = 2m; let minCommands = 3; let reconCommands = dynamic([ "whoami", "whoami.exe", "net.exe", "net1.exe", "nltest", "nltest.exe", "systeminfo", "systeminfo.exe", "ipconfig", "ipconfig.exe", "tasklist", "tasklist.exe", "netstat", "netstat.exe", "arp.exe", "route.exe", "nslookup", "nslookup.exe", "qwinsta", "qwinsta.exe", "cmdkey", "cmdkey.exe"]); let itAdmins = dynamic(["a.patel", "p.greaves", "SYSTEM", "NETWORK SERVICE"]); DeviceProcessEvents | where TimeGenerated > ago(15m) | where FileName in~ (reconCommands) | where AccountName !in~ (itAdmins) | where AccountName !endswith "$" | summarize CommandCount = dcount(FileName), Commands = make_set(FileName, 10), FirstCommand = min(TimeGenerated), LastCommand = max(TimeGenerated) by AccountName, DeviceName, bin(TimeGenerated, sequenceWind

Collection & exfiltration

// DE7-005: Cloud Storage Exfiltration Detection // Detects: file uploads to unsanctioned cloud storage services let lookback = 30m; let sanctionedApps = dynamic([ "Microsoft OneDrive for Business", "Microsoft SharePoint Online", "Microsoft Teams" ]); CloudAppEvents | where TimeGenerated > ago(lookback) | where ActionType in ("FileUploaded", "Upload") | where Application !in (sanctionedApps) | where Application has_any ( "OneDrive", "Google Drive", "Dropbox", "Box", "iCloud", "WeTransfer", "Mega", "pCloud", "Azure Storage", "AWS S3") | summarize UploadCount = count(), TotalSizeBytes = sum(toint(RawEventData.FileSize)), Files = make_set(ObjectName, 20) by AccountDisplayName, Application, AccountObjectId | where UploadCount >= 3 | project TimeGenerated = now(), AccountDisplayName, Application, UploadCount, TotalSizeBytes, Files

Lateral movement & impact

// DE8-001: RDP First-Access Lateral Movement Detection // Detects: RDP logon to a device the user has never previously accessed let lookback = 15m; let baselinePeriod = 30d; // Build 30-day RDP baseline: user -> devices they normally access let rdpBaseline = DeviceLogonEvents | where TimeGenerated > ago(baselinePeriod) and TimeGenerated ago(lookback) | where LogonType == "RemoteInteractive" // Type 10 = RDP | where isnotempty(AccountName) and isnotempty(DeviceName) | distinct AccountName, DeviceName; // Current RDP logons DeviceLogonEvents | where TimeGenerated > ago(lookback) | where LogonType == "RemoteInteractive" | where isnotempty(AccountName) and isnotempty(DeviceName) | where AccountName !endswith "$" // Exclude computer accounts // Find logons to devices NOT in the baseline | join kind=leftanti rdpBaseline on AccountName, DeviceName // This user has NEVER RDP'd to this device in 30 days | extend RemoteIP = RemoteIP | project TimeGenerated, AccountName, DeviceName, RemoteIP, RemoteDeviceName, LogonType, ActionType
These are abbreviated real detections. Each is an engineered rule from the course, shown with its ID and core logic. The full versions, with entity mapping, custom details, tuning notes, and the complete coverage across every tactic, are in Detection Engineering.

Testing and tuning

A rule that has never been tested is a guess, and a rule that is never tuned decays. Testing happens before a rule goes live: validate it fires on known-true activity and stays quiet on known-benign, so that when it is live, a hit means something and silence means clean. Tuning happens forever after, because the environment moves under the rule.

Cause of degradationWhat happens
Schema changeA renamed field breaks the query; the rule errors or returns nothing.
Environment driftNew apps, users, or behavior that the rule was not tuned for, generating noise or blind spots.
Threshold rotA count or window that fit at design time no longer matches the current baseline.
Data source gapA connector breaks or a log stops flowing; the rule runs against nothing.
False-positive fatigueAn untuned noisy rule gets ignored, then disabled, then the coverage is gone.
The test plan comes before enable, not after the first incident. Validate a new rule against representative true-positive data (it fires) and a window of normal activity (it does not flood). A rule enabled without this is discovered to be broken or unbearable in production, which is the most expensive place to find out.

Detection as code

Portal-only rule management is how the silent rule break survives: a change is made directly in the console, unreviewed and unlogged, and the rule that stops working leaves no trace of what changed or when. Detection as code puts the rules in version control, so every change is reviewed, attributable, and reversible, and a test suite can catch the break before it ships. You do not need a mature pipeline to start; you need the rules in a repository and a review step.

Portal-only failureWhat detection-as-code gives
Unreviewed changesPull-request review before a rule change reaches production.
No change historyGit history: what changed, who changed it, when, and why.
No rollbackRevert to the last working version when a change breaks a rule.
No testingAutomated validation of rule syntax and logic in the pipeline.
Drift between environmentsThe repository is the source of truth; environments deploy from it.
Worked example, the silent rule break

A vendor update renames a field the credential-theft rule depends on. The rule keeps running, now returning zero, and the portal still shows it enabled and healthy. For weeks, the technique it covered is undetected, and nobody knows, because zero results looks exactly like a quiet environment. The gap is found only when an incident that the rule should have caught is discovered another way.

The engineering fix: detection as code would have caught it, a test asserting the rule fires on known-true data fails in the pipeline the moment the field reference breaks, before the change ships. Plus a monitoring rule for detections that have gone unexpectedly silent. Zero results is a state to alert on, not to trust.

Quick lookup

TacticPrimary detection data source
Initial accessSigninLogs, EmailEvents, DeviceProcessEvents
Credential / identitySigninLogs, AADNonInteractive, AuditLogs
Persistence / executionAuditLogs, CloudAppEvents, DeviceProcessEvents
Discovery / evasionDeviceProcessEvents, AuditLogs
Collection / exfiltrationCloudAppEvents, EmailEvents, DeviceNetworkEvents
Lateral movement / impactDeviceLogonEvents, SigninLogs, DeviceProcessEvents
Rule problemLikely cause
Suddenly zero resultsSchema change (field renamed) or a broken data connector.
Too noisyThreshold rot or environment drift; re-baseline and tune.
Misses known activityLookback shorter than the run gap plus ingestion delay.
Does not correlateMissing entity mapping; alerts do not stitch into incidents.

From writing rules to engineering detection coverage

This cheatsheet is the craft in outline. Detection Engineering teaches the full discipline: rule architecture, threat modeling, the detection per ATT&CK tactic, the testing and tuning lifecycle, and detection-as-code from first repository to mature pipeline.

Explore the course