In this section

Evaluating AI Tools for Security Operations

3-4 hours · Module 1 · Free

Scenario

Your procurement team forwards three vendor proposals: Microsoft Copilot for Security bundled with your E5 licensing, Claude Team at $25 per seat per month, and a third-party "AI SOC analyst" product at $50,000 per year. Each claims to transform your security operations. Your CISO wants a recommendation by Friday. You need to evaluate all three against the same criteria so the recommendation is defensible.

The five evaluation dimensions

Every AI tool assessment covers five dimensions. Evaluating on capability alone ignores data handling. Evaluating on cost alone ignores governance readiness. The five dimensions produce a complete assessment: what the tool does, what it does with your data, how it integrates, what it costs across the real deployment timeline, and whether it satisfies your governance requirements.

Dimension 1: Capability fit

The question is not "can this tool generate KQL queries" but "can it generate KQL queries for my specific Sentinel workspace tables with the field names my environment uses." Vendor demos use curated inputs against standard schemas. Your environment has custom tables, non-standard field names, and log sources the vendor has never seen.

The evaluation method: run the same five security tasks through each tool. A triage summary of a real alert from your queue. An investigation query against your actual schema. A detection rule from a recent threat advisory. A draft incident report from a completed investigation. A PowerShell script for a routine operational task. Score each on output quality (works without modification, works with minor fixes, or requires rewriting), time to produce, and verification effort.

Context window size is a practical constraint for investigation work. Claude Opus 4.7 offers a 1 million token input context. Copilot for Security pulls context from your tenant directly rather than requiring manual paste. For investigations involving 50,000+ rows, the context window determines whether you analyze in one pass or segment across multiple interactions, losing cross-table context.

Here is what the same investigation task produces when you evaluate it properly. Take the AiTM scenario from sections 1.1 and 1.2 and run it through Claude with and without your environment context:

Claude Exchange

Test A: No Environment Context

Prompt: "Write a KQL query to detect AiTM phishing."

// Claude response (generic, no env context)

SigninLogs

| where RiskLevelDuringSignIn == "high"

| where AuthenticationRequirement == "multiFactorAuthentication"

| where ResultType == 0

Verdict: 2/5. Valid syntax but catches every high-risk successful MFA sign-in, not AiTM specifically. Would fire hundreds of times daily. No impossible travel logic, no session anomaly detection, no cross-table correlation.

Test B: With Project System Prompt + Incident Context

Prompt: "Investigate suspicious sign-in for user_01@contoso.com from IP 192.0.2.45 at 03:14 UTC, Lagos Nigeria. User based in London, last London sign-in 17:42 UTC previous day. MFA satisfied via push. Standard M365 tables in Sentinel."

// Claude response (with context)

SigninLogs

| where UserPrincipalName == "user_01@contoso.com"

| where TimeGenerated between (ago(48h) .. now())

| project TimeGenerated, IPAddress,

tostring(LocationDetails.city),

ResultType, AppDisplayName, UserAgent

| sort by TimeGenerated asc

Verdict: 4/5. Targeted to the specific user, correct timeframe, correct fields, parseable location extraction. The 1-point deduction: no AADNonInteractiveUserSignInLogs query for token replay (same gap from C1.1).

What The Test Reveals

The same model with the same underlying capabilities produced a 2/5 output and a 4/5 output. The difference is context, not capability. When you evaluate a tool, you are evaluating whether it can produce 4/5 output given your workflow, not whether it can produce 2/5 output from a bare prompt. If a vendor demo shows Test A quality, the demo is meaningless. Run Test B with your actual data.

Dimension 2: Data handling

This is the dimension most teams underweight and the one that causes the most governance problems. The critical questions: is your input data used for model training? How long is it retained? Can you delete on demand?

For Claude, consumer accounts (Free, Pro, Max) retain data for 30 days by default. Users who enable the training toggle extend retention to 5 years with training use. Commercial accounts (Team, Enterprise, API) default to 30-day retention with no training under any circumstances. Enterprise customers can configure custom retention through the admin console and access Zero Data Retention for API traffic, which deletes data immediately after the response. Claude Enterprise supports HIPAA-ready API access with signed Business Associate Agreements for organizations handling protected health information.

For Copilot for Security, data stays within your Microsoft 365 tenant boundary, processed within your geographic region, not used for training. This in-tenant model eliminates the third-party transfer question entirely for organizations with data sovereignty requirements. For ChatGPT, consumer accounts retain conversations indefinitely unless manually deleted, and conversations are eligible for training unless the user disables it per account.

The governance implication: if your team processes sign-in logs, alert details, or investigation evidence through AI, the platform tier determines whether that data is retained by the vendor and potentially used for training. Section 1.5 provides the data classification matrix that maps data sensitivity to minimum platform tier requirements.

Dimension 3: Integration depth

Integration determines workflow friction. Manual copy-paste between SIEM and AI interface adds 30 to 60 seconds per interaction. Over 20 interactions per investigation, that is 10 to 20 minutes of friction a well-integrated tool eliminates.

Copilot for Security has the deepest native integration with Microsoft products. It operates within the Defender XDR portal, pulls context from Sentinel and Defender tables directly, and correlates across the Microsoft stack without the analyst extracting and pasting data. The limitation is ecosystem lock-in: it only works within the Microsoft security stack.

Claude integrates through Projects (persistent context with system prompts and knowledge documents), Connectors (Gmail, Google Drive, GitHub, Slack), and MCP servers for deeper system integration. The Wazuh MCP Server connects Wazuh SIEM directly to Claude, enabling natural language queries against alerts, logs, and vulnerability data without manual export. The Claude Cookbook provides a threat intelligence enrichment agent pattern that queries multiple sources, cross-references findings, maps to MITRE ATT&CK, and produces structured reports for SIEM integration. Stacklok provides MCP governance with OTel-compatible telemetry that traces every tool invocation with authenticated identity, forwarded through standard pipelines to your SIEM.

For Sentinel-specific work without MCP integration, Claude operates as an external tool. You extract data, provide it in the prompt, and receive analysis. The workflow is powerful but requires the analyst to bridge between SIEM and AI manually. Evaluate integration by measuring the complete workflow end-to-end: from the moment the analyst decides to use AI through the moment the output is validated and applied.

Dimension 4: Cost analysis

Total deployment cost over 12 months, not just the license fee. Include the license cost per user per month, the number of users who will actually use the tool (start with 3 to 5 high-volume analysts, not the full team), training time to proficiency (2 to 4 weeks of guided use), and ongoing administration overhead.

Hidden costs include security capacity units for Copilot for Security (billed separately from E5 licensing, metered by usage volume), API usage costs for Claude in automated workflows (token-based pricing at $5/$25 per million tokens for Opus 4.7 means high-volume automation can exceed flat-rate license costs), and the compliance overhead for maintaining governance documentation and producing audit evidence. A tool that costs $25 per seat but requires 2 hours per week of administration costs more than the license suggests.

Frame the comparison in terms of time recovered. If AI saves 15 hours per week across triage, investigation, and documentation at a loaded analyst cost of $75 per hour, that is $58,500 per year. Most AI tools cost a fraction of this recovery value. The ROI argument is strong, but only if the team actually adopts the tool. A tool that sits unused because the team was not trained or the workflow was not integrated delivers zero ROI regardless of its capability score.

Dimension 5: Governance readiness

Governance readiness measures how well the tool supports your compliance requirements from section 1.3. The assessment covers audit logging (who used the tool, when, what data was processed), data retention controls (configurable to match your policy), compliance attestations (SOC 2 Type II, ISO 27001, HIPAA BAA), access management (SSO, RBAC, admin console), and acceptable use enforcement (restricting data types team members can process).

Claude Enterprise offers SAML 2.0 and OIDC single sign-on, audit log export in JSON and CSV to SIEM platforms including Splunk, Datadog, and Elastic, configurable retention, and role-based access controls. Claude Cowork activity is currently excluded from compliance mechanisms and requires OpenTelemetry routing to your own SIEM. Map the governance evaluation to your NIST AI RMF implementation from Module 7. If the tool does not support audit logging, you cannot implement the Measure function. If it does not support SSO, you cannot enforce identity governance. A tool that scores well on capability but fails on governance will not survive your compliance team's review.

Anti-Pattern

Selecting a tool based on vendor demonstrations

The Claude Exchange above shows why vendor demos are misleading. A demo using curated inputs against standard schemas produces clean output that does not predict production performance. Your evaluation uses your actual alerts, your schema, your investigation workflow. Run the five-task test from Dimension 1 with your data. That is the tool's real score.

The 2/5 versus 4/5 comparison from the exchange quantifies what most teams discover only after procurement: the tool's production quality depends entirely on how much context you provide. A team that deploys an AI tool without configuring a system prompt, without building a prompt library, and without establishing the investigation feedback loop will get 2/5 output regardless of which vendor they selected.

The five-dimension evaluation gives you the structure for that Friday recommendation. Score each vendor across capability fit, data handling, integration depth, cost, and governance readiness. The vendor that scores highest across all five is your recommendation. A vendor that scores 5/5 on capability and 1/5 on governance is not a viable candidate regardless of what the demo looked like.

← Previous Next →

Reading width

Text size