In this section
Formulating Hunt Hypotheses
Scenario
Your organization's last penetration test revealed that the red team established OAuth persistence, created inbox rules to redirect password-reset emails, and exfiltrated 4 GB of SharePoint data, all without triggering a single alert. The CISO asks you to start a threat hunting program. You open Advanced Hunting, stare at the query editor, and realize you have no idea where to begin. Not because you lack KQL skills, but because you lack a structured question to answer.
What makes a hypothesis testable
A hunt without a hypothesis is a fishing expedition. You open a query editor, write something based on a blog post you read that morning, scan the results, and move on. If nothing looks suspicious, the hunt is "done." If something does look suspicious, you investigate without a framework for deciding whether what you found is a genuine finding or noise you do not yet understand.
Structured hunting replaces that ad-hoc pattern with a prediction you can confirm or refute. A good hypothesis has four properties. Remove any one of them and the hunt degrades, either producing results you cannot interpret or consuming hours on a question you cannot answer with available data.
Specific. Name the technique, the behavior, or the indicator. "There might be threats in our environment" is not a hypothesis. "Compromised accounts are using OAuth applications with Mail.ReadWrite permissions to maintain persistent mailbox access after password resets" is. Specificity determines what you query, what you look for in the results, and how you know when the hunt is complete.
Testable. You can confirm or refute the prediction with data you actually have. A hypothesis about DNS tunneling is untestable if you do not ingest DNS query logs. A hypothesis about endpoint persistence is untestable without Defender for Endpoint telemetry. Before writing the first query, confirm that the data sources exist in your environment and have sufficient retention.
Grounded. The hypothesis comes from a credible source: threat intelligence, ATT&CK coverage gaps, prior incident findings, or environmental changes. Hypotheses invented from imagination are untethered from the threat landscape and likely to produce wasted effort. Grounding ensures the technique you are hunting for is something an attacker would actually use against an environment like yours.
Actionable. If confirmed, you know what to do next: escalate to IR, revoke sessions, disable accounts, remove persistence. If refuted, you can document the negative finding and convert the hunt query into a detection rule. A hypothesis whose confirmation leaves you asking "now what?" is not ready.
Four properties of a testable hypothesis. The formula produces predictions that can be confirmed or refuted with available data, grounded in real threat behavior.
The hypothesis formula in practice
A structured format removes ambiguity. Three worked examples show how the formula translates to specific M365 environments:
Notice what the hypothesis does: it names the exact table, the exact anomaly (token refresh from an IP not in the 30-day baseline), and the exact response if confirmed. The KQL query writes itself from that statement. Compare this to "look for AiTM attacks," a vague direction that produces either nothing useful or a flood of sign-in records you cannot interpret.
Before investing hunt hours, confirm the hypothesis is testable. Run a data availability check against every table your hypothesis requires:
If EventCount returns zero, your hypothesis is not testable today. The correct response is not to skip the hypothesis. It is to enable the data source and return to the hypothesis once ingestion is confirmed.
Six sources for hypothesis generation
You do not need to invent hypotheses from nothing. Six sources provide an ongoing pipeline of grounded hypotheses, and at least three require no external subscription or threat intelligence investment.
Source 1: ATT&CK coverage gaps. Map your detection rules to ATT&CK techniques using the coverage query from Section 0.1. Every technique with no detection rule is a candidate hypothesis. "We have no rule for T1098.003 (Additional Cloud Roles). Have any unauthorized application permission grants occurred in the last 90 days?" Coverage gaps are free, require only your own Sentinel data, and produce a backlog that sustains monthly hunting for years.
Source 2: Prior incident findings. Every incident investigation raises scope questions. An AiTM investigation found inbox rules redirecting password-reset emails. Were those the only rules, or do similar rules exist on other mailboxes across the tenant? A BEC investigation discovered anomalous SharePoint access — was the compromised account the only one accessing those libraries anomalously? Each post-incident question becomes a hypothesis grounded in confirmed attacker behavior in your own environment.
Source 3: Threat intelligence. Microsoft publishes threat actor profiles for groups like Storm-1567, Storm-2949, and Midnight Blizzard with specific techniques and IOCs. MITRE ATT&CK v18 includes 691 detection strategies mapped to adversary groups. A threat advisory describing Cloudflare Workers hosting AiTM proxy pages generates an immediate hypothesis: "Have any users been redirected to Cloudflare Workers domains proxying Microsoft authentication in the last 30 days?" TI-driven hypotheses are high-quality but perishable. IOCs have a relevance window that decreases as attackers cycle infrastructure.
Source 4: Environmental changes. Deploying Defender for Cloud Apps creates new telemetry. Completing an acquisition integrates a new M365 tenant. Enabling MicrosoftGraphActivityLogs surfaces API-level activity that was previously invisible. Each change creates new attack surface and new data sources simultaneously, and both generate hypotheses.
Source 5: Detection rule failures. A rule that has not fired in 12 months is either well-targeted for a rare event or broken. Test it with a hypothesis: "Using the same technique targeted by [rule name], test whether the rule would fire if the technique occurred today, by examining the data the rule queries for indicators matching the rule's logic but potentially missed due to threshold or exclusion configuration." This is purple-teaming your own detection rules through hunting.
Source 6: Peer and community research. CISA hunt reports, Microsoft Incident Response case studies, Mandiant M-Trends, and CrowdStrike Global Threat Reports describe techniques observed in real compromises. CISA's 2025 federal proactive hunt at a critical infrastructure organization searched across host, network, and cloud data, mapped findings to 18 MITRE ATT&CK techniques, and produced value without a confirmed breach, identifying six security gaps. Translate the techniques from published hunts into hypotheses for your own environment.
Prioritizing your backlog
Coverage gap analysis alone generates 50 to 80 candidate hypotheses. Threat intelligence adds more. Incident findings add more. A monthly cadence executes roughly 12 campaigns per year. You need a framework that selects the 12 highest-value hypotheses from a backlog of 100+.
Three dimensions, each scored 1 to 3. Multiply for a composite score between 1 and 27.
Threat relevance (1–3). How likely is this technique to be used against your environment? Score 3 if specific threat intelligence names threat actors targeting your sector with this technique. Score 2 if the technique is commonly used against M365 environments generally. Score 1 if the technique is documented in ATT&CK but not widely observed against M365 or your sector.
Data availability (1–3). Can you test the hypothesis with data you currently have? Score 3 if all required tables are ingested with sufficient retention. Score 2 if most data is available but one enrichment source is missing or has limited retention. Score 1 if critical data sources are not ingested, so the hypothesis is untestable until you enable the connector and confirm ingestion.
Detection gap severity (1–3). If this technique succeeds and no detection exists, how bad is the outcome? Score 3 for techniques enabling immediate high-impact outcomes: data exfiltration, financial fraud, tenant compromise. Score 2 for intermediate outcomes: persistent access, reconnaissance. Score 1 for limited outcomes: information gathering without direct escalation path.
Apply this scoring model to the NE environment. An AiTM session hijacking hypothesis scores 3 × 3 × 3 = 27: actively targeted by multiple BEC groups (relevance 3), SigninLogs and AADNonInteractiveUserSignInLogs both ingested with 90-day retention (data 3), enables immediate mailbox compromise and financial fraud (severity 3). That hypothesis goes first.
An OAuth consent phishing hypothesis scores 2 × 2 × 3 = 12 if AADServicePrincipalSignInLogs is not ingested (data 2, not 3). Still "hunt this quarter," but only after the AiTM hypothesis and all other 18+ scoring hypotheses execute.
The backlog is a living document. After every campaign, re-score: confirmed findings may change the relevance of adjacent hypotheses. Environmental changes alter data availability. New threat intelligence shifts relevance scores. A quarterly review of the backlog takes 30 minutes and ensures you are always hunting the highest-value hypothesis next.
Managing the backlog as a living document
The backlog is not a spreadsheet you create once. Every completed campaign updates it. A confirmed finding may reveal related techniques worth hunting. Add those hypotheses and score them. A null finding reduces the urgency of the corresponding hypothesis but does not remove it; the technique may appear later if the threat landscape shifts. Environmental changes alter data availability scores immediately: enabling MicrosoftGraphActivityLogs moves every Graph API hypothesis from "data 1" to "data 3" overnight.
Review the full backlog quarterly. A 30-minute session is sufficient for a backlog of 50 to 80 hypotheses. Remove retired hypotheses where the technique is no longer relevant to your architecture. Re-score hypotheses where threat intelligence has changed the relevance dimension. Promote hypotheses where you enabled a new data source since the last review. The goal is a living prioritization that reflects current reality, not a static list from the first brainstorm.
One practical format works well: a simple table with columns for hypothesis ID, ATT&CK technique, source (which of the six sources generated it), the three dimension scores, the composite score, and a status field (queued, in progress, completed, deferred). NE maintains this as a shared OneNote page accessible to both Tom Ashworth and Priya Sharma, updated after every campaign.
Without a scored backlog, hunt topic selection drifts toward whatever the analyst read most recently or whatever technique sounds the most exciting. Three months pass. Twelve campaigns execute. None targeted the four ATT&CK tactics with zero detection coverage. Leadership asks for the quarterly coverage improvement number and the answer is "we hunted a lot of interesting things." Interesting is not the same as valuable. A scored backlog makes the selection systematic. An analyst hunts the 27-scoring hypothesis before the 12-scoring hypothesis regardless of which one appeared in last week's blog post.
Threat Hunting Principle
A hypothesis with all four properties (specific, testable, grounded, actionable) produces either confirmed evidence of compromise or documented evidence of absence. Both outcomes have operational value. The scored backlog ensures you always produce the highest-value outcome first.
Get weekly detection and investigation techniques
KQL queries, detection rules, and investigation methods — the same depth as this course, delivered every Tuesday.
No spam. Unsubscribe anytime. ~2,000 security practitioners.