Detection Engineering Cheatsheet
Detections are engineered, not written once and forgotten. The rule anatomy, the threat-model-driven coverage, real detections by ATT&CK tactic, and the testing, tuning, and detection-as-code lifecycle that keeps them working. For Sentinel and Defender XDR. No account needed.
Detection engineering is the discipline of building detections that work, prove they work, and keep working. The difference from writing a query is the lifecycle around it: a rule is threat-modeled, structured, tested before it goes live, tuned as the environment changes, and version-controlled so a change is deliberate and reviewable. These run in Microsoft Sentinel and Defender XDR; the queries are real detections, set your own thresholds before enabling.
The detection engineering problem
Most detection estates fail the same way: rules are written once, in the portal, against whatever data was easy, and then never touched. Coverage becomes accidental rather than designed; nobody knows which techniques are covered or which rules have silently stopped firing. A rule that broke when a field was renamed looks identical to a rule that simply has nothing to detect, both return zero, and zero is the most dangerous result in detection because it is indistinguishable from safety.
| Failure | The engineering answer |
|---|---|
| Accidental coverage | Threat-model the detections: build from ATT&CK, know what you cover and what you do not. |
| Write once, never tune | A tuning lifecycle: rules degrade, and degradation is monitored and corrected. |
| Silent rule breakage | Detection as code: version control, review, and tests catch the rule that quietly stopped working. |
| Zero results trusted | Test every rule with known-true data before enabling, so zero means clean, not broken. |
Rule anatomy
A production detection is more than its query. The scheduling and entity configuration determine whether it fires usefully, and the timing parameters hide gotchas that silently drop detections. Get the anatomy wrong and a perfectly correct query still misses, or floods.
| Element | What it controls (and the trap) |
|---|---|
| Query frequency | How often the rule runs. Too rare misses fast attacks; too frequent costs and duplicates. |
| Lookback period | How far back each run reads. Must cover the gap since the last run, plus ingestion delay, or events fall between runs. |
| Execution delay | The roughly 5-minute ingestion lag means a rule querying "the last 5 minutes" can miss events not yet ingested. Build the delay into the lookback. |
| Entity mapping | Maps query output to accounts, hosts, IPs so incidents correlate. Unmapped detections do not stitch into incidents. |
| Custom details | Surface the fields an analyst needs in the alert itself, so triage does not require re-running the query. |
Threat-model-driven coverage
Coverage you did not design is coverage you cannot trust. The engineering approach builds detections from a threat model, the ATT&CK techniques relevant to your environment and adversaries, and maps each detection to what it covers. The output is a coverage map: the techniques you detect, the ones you have accepted you do not, and the gaps you are actively closing. Without it, you have a pile of rules and no idea what an attacker could do unseen.
| Step | What it produces |
|---|---|
| Threat model | The techniques that matter for your tenant, data, and likely adversaries. |
| Map to data | Which table would show each technique. No data source, no possible detection. |
| Build the detection | The rule, structured and tested, for each prioritized technique. |
| Track coverage | An ATT&CK map of covered, gapped, and accepted-risk techniques, kept current. |
Detections by tactic
Real detections from across the ATT&CK tactics, each with its rule ID, hypothesis, and core logic. These are engineered rules, not ad-hoc queries, tune the thresholds to your environment and test before enabling.
Initial access
// DE3-001: Phishing click-through with post-click auth anomaly
// Hypothesis: AiTM phishing produces allowed clicks followed by
// sign-ins from new infrastructure within 60 minutes
let CorrelationWindow = 60m;
let LookbackPeriod = 14d;
// Step 1: Identify allowed clicks from inbound emails
let SuspiciousClicks =
UrlClickEvents
| where Timestamp > ago(1h)
| where ActionType == "ClickAllowed"
| where Workload == "Email"
| extend ChainLength = array_length(todynamic(UrlChain))
| project ClickTime = Timestamp, AccountUpn, Url,
UrlChain, ChainLength, IsClickedThrough,
NetworkMessageId;
// Step 2: Build per-user baseline of known sign-in infrastructure
let KnownInfra =
SigninLogs
| where TimeGenerated > ago(LookbackPeriod)
| where ResultType == 0
| summarize
KnownIPs = make_set(IPAddress, 100),
KnownDevices = make_set(DeviceDetail.deviceId, 50),
KnownCountries = make_set(LocationDetails.countryOrRegion, 20)
by UserPrincipalName;
// Step 3: Find post-click sign-ins from new infrastructure
S
Credential & identity
// DE4-001: MFA push bombing. Hypothesis: an attacker with the password but not
// the factor sends repeated pushes; the signal is a burst of explicit denials,
// and the danger is a single approval that follows within the burst window.
let bombing = SigninLogs
| where TimeGenerated > ago(1h)
| where ResultType == 500121
| where Status has "declined the authentication" or Status has "Reported Fraud"
| summarize Denials = count(), FirstDenial = min(TimeGenerated),
LastDenial = max(TimeGenerated), IPs = make_set(IPAddress, 10)
by UserPrincipalName, bin(TimeGenerated, 10m)
| where Denials >= 5; // 5+ explicit denials in 10 min = bombing, not fat-finger
let approvals = SigninLogs
| where TimeGenerated > ago(1h)
| where ResultType == 0 and AuthenticationRequirement == "multiFactorAuthentication"
| project UserPrincipalName, ApprovalTime = TimeGenerated, ApprovalIP = IPAddress;
bombing
| join kind=leftouter approvals on UserPrincipalName
| where isempty(ApprovalTime) or ApprovalTime between (FirstDenial .. (LastDenial + 30m))
| exten
Persistence & execution
// DE5-002: OAuth consent grant anomaly. Hypothesis: a consent grant to an
// unrecognised app with sensitive permissions is consent phishing; legitimate
// business apps are on the approved list and filter cleanly.
let approvedApps = dynamic(["Microsoft Office", "Microsoft Teams",
"SharePoint Online", "Azure Portal", "Outlook Mobile"]);
let sensitiveScopes = dynamic(["Mail.Read", "Mail.ReadWrite", "Mail.Send",
"Files.Read", "Files.ReadWrite", "Files.Read.All",
"Directory.ReadWrite.All", "User.Read.All"]);
AuditLogs
| where TimeGenerated > ago(20m)
| where OperationName == "Consent to application"
| extend ConsentUser = tostring(parse_json(
tostring(InitiatedBy.user)).userPrincipalName),
AppName = tostring(TargetResources[0].displayName),
ConsentIP = tostring(parse_json(tostring(InitiatedBy.user)).ipAddress),
Permissions = tostring(TargetResources[0].modifiedProperties)
| where AppName !in (approvedApps)
| where Permissions has_any (sensitiveScopes)
| project TimeGenerated, ConsentUser, AppName, ConsentIP, Permissions
Discovery & evasion
// DE6-001: reconnaissance command sequence. Hypothesis: 3+ distinct
// discovery commands within 2 minutes from a non-IT user is systematic
// environment mapping; legitimate admin usage is 1-2 commands per session.
let sequenceWindow = 2m;
let minCommands = 3;
let reconCommands = dynamic([
"whoami", "whoami.exe",
"net.exe", "net1.exe",
"nltest", "nltest.exe",
"systeminfo", "systeminfo.exe",
"ipconfig", "ipconfig.exe",
"tasklist", "tasklist.exe",
"netstat", "netstat.exe",
"arp.exe", "route.exe",
"nslookup", "nslookup.exe",
"qwinsta", "qwinsta.exe",
"cmdkey", "cmdkey.exe"]);
let itAdmins = dynamic(["a.patel", "p.greaves",
"SYSTEM", "NETWORK SERVICE"]);
DeviceProcessEvents
| where TimeGenerated > ago(15m)
| where FileName in~ (reconCommands)
| where AccountName !in~ (itAdmins)
| where AccountName !endswith "$"
| summarize
CommandCount = dcount(FileName),
Commands = make_set(FileName, 10),
FirstCommand = min(TimeGenerated),
LastCommand = max(TimeGenerated)
by AccountName, DeviceName,
bin(TimeGenerated, sequenceWind
Collection & exfiltration
// DE7-005: Cloud Storage Exfiltration Detection
// Detects: file uploads to unsanctioned cloud storage services
let lookback = 30m;
let sanctionedApps = dynamic([
"Microsoft OneDrive for Business",
"Microsoft SharePoint Online",
"Microsoft Teams"
]);
CloudAppEvents
| where TimeGenerated > ago(lookback)
| where ActionType in ("FileUploaded", "Upload")
| where Application !in (sanctionedApps)
| where Application has_any (
"OneDrive", "Google Drive", "Dropbox", "Box",
"iCloud", "WeTransfer", "Mega", "pCloud",
"Azure Storage", "AWS S3")
| summarize
UploadCount = count(),
TotalSizeBytes = sum(toint(RawEventData.FileSize)),
Files = make_set(ObjectName, 20)
by AccountDisplayName, Application, AccountObjectId
| where UploadCount >= 3
| project TimeGenerated = now(), AccountDisplayName,
Application, UploadCount, TotalSizeBytes, Files
Lateral movement & impact
// DE8-001: RDP First-Access Lateral Movement Detection
// Detects: RDP logon to a device the user has never previously accessed
let lookback = 15m;
let baselinePeriod = 30d;
// Build 30-day RDP baseline: user -> devices they normally access
let rdpBaseline = DeviceLogonEvents
| where TimeGenerated > ago(baselinePeriod) and TimeGenerated ago(lookback)
| where LogonType == "RemoteInteractive" // Type 10 = RDP
| where isnotempty(AccountName) and isnotempty(DeviceName)
| distinct AccountName, DeviceName;
// Current RDP logons
DeviceLogonEvents
| where TimeGenerated > ago(lookback)
| where LogonType == "RemoteInteractive"
| where isnotempty(AccountName) and isnotempty(DeviceName)
| where AccountName !endswith "$" // Exclude computer accounts
// Find logons to devices NOT in the baseline
| join kind=leftanti rdpBaseline
on AccountName, DeviceName
// This user has NEVER RDP'd to this device in 30 days
| extend RemoteIP = RemoteIP
| project TimeGenerated, AccountName, DeviceName,
RemoteIP, RemoteDeviceName,
LogonType, ActionType
Testing and tuning
A rule that has never been tested is a guess, and a rule that is never tuned decays. Testing happens before a rule goes live: validate it fires on known-true activity and stays quiet on known-benign, so that when it is live, a hit means something and silence means clean. Tuning happens forever after, because the environment moves under the rule.
| Cause of degradation | What happens |
|---|---|
| Schema change | A renamed field breaks the query; the rule errors or returns nothing. |
| Environment drift | New apps, users, or behavior that the rule was not tuned for, generating noise or blind spots. |
| Threshold rot | A count or window that fit at design time no longer matches the current baseline. |
| Data source gap | A connector breaks or a log stops flowing; the rule runs against nothing. |
| False-positive fatigue | An untuned noisy rule gets ignored, then disabled, then the coverage is gone. |
Detection as code
Portal-only rule management is how the silent rule break survives: a change is made directly in the console, unreviewed and unlogged, and the rule that stops working leaves no trace of what changed or when. Detection as code puts the rules in version control, so every change is reviewed, attributable, and reversible, and a test suite can catch the break before it ships. You do not need a mature pipeline to start; you need the rules in a repository and a review step.
| Portal-only failure | What detection-as-code gives |
|---|---|
| Unreviewed changes | Pull-request review before a rule change reaches production. |
| No change history | Git history: what changed, who changed it, when, and why. |
| No rollback | Revert to the last working version when a change breaks a rule. |
| No testing | Automated validation of rule syntax and logic in the pipeline. |
| Drift between environments | The repository is the source of truth; environments deploy from it. |
A vendor update renames a field the credential-theft rule depends on. The rule keeps running, now returning zero, and the portal still shows it enabled and healthy. For weeks, the technique it covered is undetected, and nobody knows, because zero results looks exactly like a quiet environment. The gap is found only when an incident that the rule should have caught is discovered another way.
The engineering fix: detection as code would have caught it, a test asserting the rule fires on known-true data fails in the pipeline the moment the field reference breaks, before the change ships. Plus a monitoring rule for detections that have gone unexpectedly silent. Zero results is a state to alert on, not to trust.
Quick lookup
| Tactic | Primary detection data source |
|---|---|
| Initial access | SigninLogs, EmailEvents, DeviceProcessEvents |
| Credential / identity | SigninLogs, AADNonInteractive, AuditLogs |
| Persistence / execution | AuditLogs, CloudAppEvents, DeviceProcessEvents |
| Discovery / evasion | DeviceProcessEvents, AuditLogs |
| Collection / exfiltration | CloudAppEvents, EmailEvents, DeviceNetworkEvents |
| Lateral movement / impact | DeviceLogonEvents, SigninLogs, DeviceProcessEvents |
| Rule problem | Likely cause |
|---|---|
| Suddenly zero results | Schema change (field renamed) or a broken data connector. |
| Too noisy | Threshold rot or environment drift; re-baseline and tune. |
| Misses known activity | Lookback shorter than the run gap plus ingestion delay. |
| Does not correlate | Missing entity mapping; alerts do not stitch into incidents. |
From writing rules to engineering detection coverage
This cheatsheet is the craft in outline. Detection Engineering teaches the full discipline: rule architecture, threat modeling, the detection per ATT&CK tactic, the testing and tuning lifecycle, and detection-as-code from first repository to mature pipeline.
Explore the course