In this section

Building Your AI Operations Foundation

3-4 hours · Module 1 · Free

Scenario

Three analysts used Claude for two weeks. Results are inconsistent. One writes structured prompts that produce reliable output. Another writes vague prompts and spends more time correcting output than the investigation would have taken manually. A third has good prompts but keeps them in personal notes. Your SOC lead asks: how do we scale what works, fix what does not, and measure whether AI is actually making the team faster?

Completing the investigation: iteration 3

The investigation feedback loop has run twice. Iteration 1 (C1.1) produced a triage assessment with a hallucinated field name. Iteration 2 (C1.2) corrected the error using OfficeActivity and identified token replay. Both iterations generated follow-up actions. In a real investigation, you continue until the hypothesis is confirmed or refuted. Here is iteration 3, where the analyst feeds the results from iteration 2 back to Claude:

Claude Exchange

Analyst Prompt (iteration 3)

Results from iteration 2. AADNonInteractiveUserSignInLogs shows 14 non-interactive sign-ins from 192.0.2.45 between 03:17 and 04:52 UTC, all ResultType 0, targeting Exchange Online and SharePoint Online. OfficeActivity shows one New-InboxRule at 03:19 UTC by user_01: rule name ".", moves items matching "invoice" or "payment" to RSS Feeds folder. No Set-MailboxAutoReplyConfiguration or Set-TransportRule results. This is an active AiTM compromise with BEC setup. What containment actions and what additional evidence collection?

Claude Response

Confirmed: AiTM phishing → BEC staging. The inbox rule targeting "invoice" and "payment" keywords with redirection to RSS Feeds (a folder users rarely check) is a classic BEC persistence mechanism. The attacker is positioning to intercept financial communications.

Containment (immediate): Revoke all active sessions for user_01 via Revoke-MgUserSignInSession. Reset user password. Remove the malicious inbox rule. Block IP 192.0.2.45 in Conditional Access as a named location. Disable any app consent grants made during the compromise window.

Evidence collection: Query CloudAppEvents for MailItemsAccessed operations by user_01 between 03:17 and 04:52 to determine what emails the attacker read. Check for consent grants via ApplicationConsentGranted. Search outbound email for BEC lure messages sent from user_01's account during the compromise window.

Analyst Evaluation

✓ Containment correct. Session revocation, password reset, inbox rule removal, and IP block are the right immediate actions in the right order.

✓ BEC analysis accurate. The interpretation of the inbox rule pattern is correct. Redirecting invoice-related emails to RSS Feeds is a documented BEC technique.

✗ Missing scope check. 14 non-interactive sign-ins to Exchange and SharePoint suggest the attacker may have accessed shared mailboxes or SharePoint sites beyond user_01's own resources. The analyst needs to check whether user_01 has delegate access to executive mailboxes or sensitive SharePoint sites before concluding the blast radius is limited to one account.

Three iterations. The first produced a triage with errors. The second corrected the queries and identified the attack mechanism. The third produced containment and evidence collection with one scope gap the analyst caught. Each iteration improved because Claude had more real evidence to work with. This is the investigation feedback loop as methodology: describe context, generate queries, validate against schema, execute, feed results back, iterate.

The quality difference between iteration 1 and iteration 3 is not a model improvement. It is a context improvement. The same model with more evidence produces better output. In iteration 1, Claude had only the alert details. In iteration 3, Claude had 48 hours of sign-in history, confirmed token replay evidence, and a documented inbox rule with BEC indicators. The model's analysis became more accurate because the analyst supplied progressively richer evidence, not because the model got smarter between prompts.

The methodology formalizes this into five steps with quality gates. First, context loading: provide the incident type, initial indicators, relevant timeframe, and environmental context. A prompt saying "investigate suspicious sign-in" produces generic queries. A prompt specifying the user, IP, timestamp, MFA method, and available tables produces targeted queries. Second, generation: Claude produces output based on the context. Third, validation: apply the five-check discipline. Fourth, execution: run validated queries, paste results back. Fifth, iteration: Claude analyzes real results, identifies gaps, generates follow-up queries. A typical investigation runs 3 to 5 iterations, each taking 2 to 3 minutes of generation and validation plus query execution time.

The division of labor matters. Claude generates and analyzes. The analyst decides and acts. Claude cannot decide whether this AiTM compromise warrants immediate executive notification or whether the inbox rule pattern represents financial fraud preparation that requires legal involvement. Those decisions require organizational knowledge and professional judgment. The feedback loop produces the evidence. The analyst makes the call.

Prompt library architecture

The prompt library solves the consistency problem. Instead of each analyst developing their own patterns, the team maintains a shared library of tested prompts organized by function. The architecture uses your Claude Project as the repository:

Prompt Pattern

Prompt Library Structure

Triage/ Alert summary, enrichment queries, severity assessment

Investigation/ AiTM, BEC, endpoint, insider threat, ransomware, cloud

Detection/ Advisory-to-rule, rule specification, tuning

Documentation/ Technical report, exec summary, regulatory notification

Automation/ Evidence collection, scheduled hunting, remediation

Each prompt follows a standard format: purpose, required inputs, template text, and validation checklist. Modules 2 through 6 populate each category with tested templates. The maintenance discipline matters as much as the initial build. When an analyst discovers a prompt that produces better results, they contribute it with a note explaining the improvement. When a prompt fails due to a schema change or model update, the team updates the template. The library is a living artifact that improves with use.

Measurement framework

Without measurement, you rely on subjective impressions that tend toward optimism. An analyst who saves 20 minutes on one investigation and spends 30 minutes fighting a hallucinated table name on the next reports that AI is "really helpful" because the positive experience is more memorable. Measurement converts impressions into data.

Establish a baseline before deploying AI for operational work. For one week, record time-to-resolution for every investigation, the number of queries written per investigation, and total analyst-hours spent on triage, investigation, detection engineering, and documentation. This baseline is your comparison point. Without it, any improvement claim is anecdotal.

Track three metrics weekly after deployment. First, time-to-resolution: how long does each investigation take from alert to closure? Compare the pre-AI baseline against the AI-assisted average. The capabilities matrix in section 1.2 predicted 30 to 45 minutes saved per investigation. Measurement confirms or contradicts. If investigations are not getting faster, the problem is usually context loading (analysts providing insufficient information) or verification overhead (prompts producing low-quality output that requires extensive correction).

Second, verification overhead: how much time does each analyst spend validating AI output? If verification time exceeds the time saved by generation, AI is adding friction. This metric also identifies analysts who need training. An analyst whose verification time is consistently high may be over-verifying (checking things that do not need checking) or under-prompting (providing insufficient context, which produces lower-quality output).

Third, output quality: what percentage of AI-generated outputs are usable without modification, usable with minor corrections, or require substantial rewriting? Track by function. You may find that triage summaries are usable 90% of the time while detection rules require correction 60% of the time. The lowest-scoring functions get the most attention in prompt refinement.

Report monthly to your SOC lead and CISO. The data makes the case for continued investment, identifies where training is needed, and provides the evidence base for expanding AI adoption to additional functions.

Operational readiness checklist

Before Module 2, verify each item. Your Claude Project has a system prompt with your organization name, SIEM type, EDR platform, identity provider, query language, and formatting preferences. Test by asking Claude to generate a simple KQL query. If the output uses correct table names for your SIEM, the prompt is working.

The data classification matrix from section 1.5 is documented and communicated to every team member. Each analyst knows which tier applies to their daily data. The matrix should be a knowledge document in your Project and printed for reference.

Your shadow AI detection query is deployed as a scheduled rule or hunting query. You have baseline data showing current AI usage patterns.

The prompt library structure exists with the five categories above (empty for now; Modules 2 through 6 fill them). Each analyst has completed the five-check validation exercise from C0.2 and can demonstrate the discipline on a sample output.

Anti-Pattern

Measuring AI adoption by license count instead of operational impact

A CISO reports 12 licenses purchased, all activated. What the report misses: 3 analysts use AI daily with 40% time reduction, 5 use it occasionally with no workflow change, 4 have never opened the tool. License count measures spend. Time-to-resolution, verification overhead, and output quality measure impact. Report impact.

The measurement framework catches this gap because it tracks operational outcomes, not procurement milestones. When your monthly report shows 3 active users out of 12 licenses, the response is targeted training for the 9 who are not using the tool effectively, not a claim that AI adoption is complete because the licenses are activated.

You now have the complete foundation for operational AI adoption. The feedback loop gives you the methodology. The prompt library gives you consistency. The measurement framework gives you evidence. The readiness checklist confirms you are ready. Module 2 applies all of it to your first operational function: AI-assisted alert triage.

← Previous Next →

Reading width

Text size

Building Your AI Operations Foundation

Completing the investigation: iteration 3

Prompt library architecture

Measurement framework

Operational readiness checklist