In this section

Data Handling, Privacy, and Operational Security

3-4 hours · Module 1 · Free

Scenario

An analyst on your team pastes 200 rows of SigninLogs into Claude Free to investigate a suspicious sign-in. The log data contains UserPrincipalNames, IP addresses, device identifiers, and location data. The analyst has training enabled. The data is now retained for up to five years and eligible for model training. Your Data Protection Officer asks whether this constitutes a GDPR personal data transfer without a legal basis. You need to answer that question, and you need a system that prevents it from happening again.

The data classification matrix

Not all security data carries the same sensitivity. The matrix classifies common data types by sensitivity level and maps each to the minimum AI platform tier required for processing. This is the artifact you deploy to your team as operational policy.

The classification is based on two factors: whether the data can identify a real individual or organization (the privacy dimension), and whether the data reveals your security posture to an adversary (the operational security dimension).

Tier 1: Safe for any platform. Publicly available threat intelligence (ATT&CK descriptions, published CVEs, vendor advisories). Generic query templates with placeholder values. Framework and compliance references. Questions about KQL syntax or investigation methodology that do not reference your environment.

Tier 2: Requires commercial plan (no training). Alert data with identifiers redacted or pseudonymized. Anonymized log samples where UPNs are replaced with user_01@contoso.com, IPs with RFC 5737 documentation ranges (192.0.2.x), and device names with generic identifiers like WORKSTATION-01. Detection rule logic without workspace IDs, subscription IDs, or custom table names. Incident report drafts with organizational identifiers redacted. Policy and procedure documents that do not contain classified information. Security architecture concepts without specific IP ranges or service names. This is where the sanitized AiTM investigation data from the Claude Exchange below falls: all identifying information replaced, analytical value preserved.

Tier 3: Requires Enterprise with contractual protections. Raw sign-in logs with real UPNs and IP addresses. Investigation evidence containing employee identifiers, device names, and activity details. Incident details with real organizational data including affected systems, user accounts, and business impact. Compliance audit findings referencing specific control implementations and gaps. Security architecture documentation with network topology, IP ranges, firewall rule sets, and service configurations. Email content from phishing investigations. Any data constituting personal data under GDPR Article 4(1) or personally identifiable information under CCPA Section 1798.140. This tier is where most unsanitized investigation data falls, and it is why the default workflow should always be sanitization before processing.

Tier 4: Never process through external AI. Credentials, API keys, tokens, certificates. Active incident data during an ongoing breach. Legal privilege communications. Government-classified data. Customer data under NDAs prohibiting third-party processing. Memory dumps that may contain cleartext credentials.

Sanitization: the zero-cost control that eliminates an entire risk category

The practical workflow: sanitize by default, regardless of platform tier. Replace real identifiers with synthetic ones before processing through AI. Claude's analytical capability is identical whether the data uses real or synthetic identifiers because AI operates on patterns, relationships, and behavioral sequences, not on the specific string value of an identity.

The sanitization workflow has three steps. First, identify sensitive fields. For SigninLogs: UserPrincipalName, IPAddress, DeviceDetail, LocationDetails. For CloudAppEvents: AccountDisplayName, AccountObjectId, IPAddress. Second, create a mapping table converting each real value to a synthetic equivalent. Keep this table in a local document for reverse substitution. Third, substitute before pasting.

Here is what sanitization looks like in the AiTM investigation from the previous sections, and why Claude's analysis is identical on both versions:

Claude Exchange

Sanitized Prompt (safe for Tier 2)

Investigate these SigninLogs results. All identifiers are synthetic. user_01 signed in from 192.0.2.10 (London) at 17:42 UTC, ResultType 0, App "Microsoft Office 365 Portal." user_01 signed in from 192.0.2.45 (Lagos) at 03:14 UTC, ResultType 0, App "Microsoft Office 365 Portal," MFA via push. user_01 signed in from 192.0.2.45 (Lagos) at 03:17 UTC, ResultType 0, App "Microsoft Office 365 Exchange Online." Analyze the sign-in pattern and identify the attack technique.

Claude Response

Assessment: AiTM phishing with session cookie theft.

The pattern is consistent with adversary-in-the-middle phishing. The London-to-Lagos transition in 9.5 hours with MFA satisfied via push notification suggests the attacker intercepted the authentication session through a reverse proxy. The 3-minute gap between two Lagos sign-ins targeting different applications (Portal, then Exchange Online) indicates automated token replay: the attacker captured the session cookie and replayed it against Exchange for mailbox access.

Immediate actions: check AADNonInteractiveUserSignInLogs for non-interactive sign-ins from 192.0.2.45, check OfficeActivity for inbox rule creation by user_01 after 03:14, and revoke all active sessions for user_01.

What This Proves

The analysis is identical whether the UPN is j.smith@northgate.com or user_01@contoso.com. Claude detected the attack technique, identified the session cookie theft mechanism, and recommended the correct next steps. The sign-in times, location sequence, result types, and application targets carry the analytical signal. The identifiers are labels for the analyst's reference. Sanitization cost nothing and eliminated the entire GDPR transfer question.

Shadow AI: the governance gap you probably already have

Shadow AI is unauthorized AI tool usage by team members without organizational knowledge or data handling controls. If your analysts use personal Claude or ChatGPT accounts for investigation queries, your organization has shadow AI regardless of whether you have adopted AI formally.

The risk is not that analysts are using AI. It is that they are using AI without the controls. An analyst on Claude Team with a configured Project, no-training guarantees, and the data classification matrix is using AI safely. The same analyst on Claude Free with training enabled is creating a data governance incident with every prompt.

The detection approach uses your existing telemetry:

// Shadow AI detection: AI platform access from corporate network
let ai_domains = dynamic([
    "claude.ai", "api.anthropic.com",
    "chat.openai.com", "api.openai.com",
    "gemini.google.com", "copilot.microsoft.com",
    "perplexity.ai", "poe.com"
]);
CommonSecurityLog
| where TimeGenerated > ago(24h)
| where DeviceAction != "Blocked"
| extend domain = extract("https?://([^/]+)", 1, RequestURL)
| where domain has_any (ai_domains)
| summarize AccessCount = count(),
    FirstSeen = min(TimeGenerated),
    LastSeen = max(TimeGenerated),
    Domains = make_set(domain, 20)
    by SourceUserName, SourceIP
| where AccessCount > 5
| sort by AccessCount desc

Adapt this to your log source. If you use Defender for Cloud Apps, the McasShadowItReporting table provides app discovery data with AI services categorized by risk score. The policy enforcement follows detection: provide an approved tool with appropriate controls and train the team on the classification matrix. Blocking AI access entirely pushes usage to personal devices where you have zero visibility.

Regulatory considerations

AI data processing intersects with data protection regulations at the point where personal data enters the AI platform. The legal analysis follows the same framework as any third-party data processor assessment, but AI-specific characteristics create nuances security teams must understand.

Under GDPR, personal data processed through AI constitutes a transfer to a third-party processor. You need a legal basis (typically legitimate interest for security operations under Article 6(1)(f)), a Data Processing Agreement meeting Article 28 requirements, and transfer safeguards for US-based vendors such as Standard Contractual Clauses. Both Anthropic and OpenAI offer DPAs for commercial plans. Consumer plans do not include DPA protections. If your team processes EU personal data through a consumer AI account without a DPA, you have a compliance gap that produces a reportable finding in any GDPR audit.

The AI-specific complication is the training toggle. Under GDPR, using personal data for model training constitutes a new processing purpose requiring its own legal basis. Consumer plans that train on user input create a secondary processing purpose that is difficult to justify under legitimate interest for security operations. This is why the classification matrix draws a hard boundary: personal data on commercial plans only, where training is contractually prohibited.

Under CCPA, the AI vendor becomes a "service provider" requiring contractual commitments about data use and restrictions on selling or sharing personal information. Consumer plans that use data for training are inconsistent with the service provider designation because training constitutes a use beyond the services provided.

For sector-specific regulations, the requirements are stricter. HIPAA requires a Business Associate Agreement before any protected health information enters the AI platform. Claude Enterprise offers HIPAA-ready API access with signed BAAs; consumer and Team plans do not. PCI DSS requires that cardholder data never be processed through external AI regardless of plan tier. Financial services regulations (SOX, GLBA) impose data handling requirements that consumer AI plans cannot satisfy, and regulators increasingly expect documented AI governance programs for any AI tools used in regulated activities.

The operational conclusion is consistent across all frameworks: commercial plans with contractual protections are the minimum for data that could identify individuals. Consumer plans are appropriate only for Tier 1 data. The data classification matrix enforces this boundary. The shadow AI query identifies where it is being violated.

Anti-Pattern

Sanitizing retroactively instead of proactively

An analyst pastes raw logs into Claude, realizes they contain personal data, and deletes the conversation. The data has already been transmitted and retained in backend logs for up to 30 days (or 5 years with training enabled). Deleting the conversation removes it from the chat interface, not from backend systems. Sanitization must happen before data enters the platform. After transmission, the damage is done.

The retroactive sanitization failure illustrates why the data classification matrix exists as a preventive control rather than a detective one. By the time you detect that personal data was sent to an AI platform, the compliance event has already occurred. The matrix prevents the event by defining what goes where before the analyst opens the chat window.

The data classification matrix, the sanitization workflow, the shadow AI query, and the regulatory assessment give you a complete data governance posture for AI-assisted security operations. Print the matrix. Deploy the query. Establish the habit: sanitize first, paste second, verify third.

← Previous Next →

Reading width

Text size

Data Handling, Privacy, and Operational Security

The data classification matrix

Sanitization: the zero-cost control that eliminates an entire risk category

Shadow AI: the governance gap you probably already have

Regulatory considerations