In this section

AWS Control Plane: Why CloudTrail Is the Crime Scene in Cloud IR

Module 0

An EC2 instance was at the center of an incident last night. By the time you get access, it has been terminated. On-premises, this is where you reach for the disk image and the memory capture. In AWS, the disk is a volume that was deleted with the instance, and there was never a memory capture to take. The machine is gone. So what is left to investigate?

The answer is the record of how the instance was created, what role it carried, what that role did, and how it was destroyed. None of that lives on the instance. It lives in the log of the API calls that built and ran it. In AWS, that log is the investigation.

Everything is an API call

AWS has no back door. Every action against the platform goes through the AWS API, without exception. When an engineer clicks a button in the console, the console calls the API on their behalf. When a script runs aws ec2 run-instances, the CLI calls the API. When Terraform applies a plan, it calls the API. When a Lambda function assumes a role and reads a secret, it calls the API. There is no privileged path that skips the API, which means there is a single chokepoint where everything an attacker does has to pass.

CloudTrail sits at that chokepoint. It records the management actions taken in the account: who made the call, which API they invoked, when, from which IP address, with which credentials, and whether it succeeded or failed. This is the control plane, the layer where resources are created, configured, and destroyed. An attacker who creates a backdoor user, attaches an administrator policy, opens a security group to the internet, or disables logging is operating on the control plane, and every one of those actions is an API call CloudTrail writes down.

This changes what an investigation is. You are not recovering deleted files or carving memory. You are reading a structured, append-only record of actions and reconstructing intent from it. The skill is knowing which calls matter, how they chain together, and what a normal version of each one looks like so you can spot the abnormal.

One distinction helps you cut through volume from the start. Most API calls in any account are read-only: Describe, List, and Get actions that look at the environment without changing it. CloudTrail marks these with a readOnly flag set to true. The calls that change something, creating a user, attaching a policy, deleting a trail, are the mutating ones, and they are where impact lives.

A fast first pass on a noisy account is to set the read-only calls aside and look at what was actually changed, then come back to the reads, because a burst of read-only enumeration is itself a signal once you have the mutations in view. Knowing that the flag exists, and that the naming convention tells you an action's nature before you even check it, makes a wall of events tractable.

Reading a CloudTrail record

Here is a single CloudTrail event. It records a routine change: a security architect at Northgate opened a port on a security group. Read it before the explanation.

CloudTrail Management Event

{
  "eventTime": "2026-05-02T09:14:22Z",
  "eventSource": "ec2.amazonaws.com",
  "eventName": "AuthorizeSecurityGroupIngress",
  "awsRegion": "eu-west-2",
  "sourceIPAddress": "10.20.4.17",
  "userAgent": "aws-cli/2.15.0 Python/3.11.6 Linux/6.1",
  "userIdentity": {
    "type": "IAMUser",
    "arn": "arn:aws:iam::333333333333:user/m.webb",
    "accountId": "333333333333",
    "accessKeyId": "AKIAEXAMPLEWEBB0001",
    "userName": "m.webb"
  },
  "requestParameters": {
    "groupId": "sg-0prodweb111",
    "ipPermissions": { "fromPort": 443, "toPort": 443, "ipProtocol": "tcp" }
  },
  "responseElements": { "_return": true },
  "eventType": "AwsApiCall",
  "recipientAccountId": "333333333333"
}

Five fields carry the investigation. userIdentity is who: the IAM user m.webb, using a specific access key, in the production account. eventName with eventSource is what: an AuthorizeSecurityGroupIngress call to EC2, which adds an inbound rule to a security group. eventTime is when. sourceIPAddress is from where: 10.20.4.17, an address inside Northgate's production network. responseElements is the result: the call returned true, so it succeeded. In one record you can state that m.webb opened TCP 443 on the production web security group from inside the corporate network at 09:14, and it worked.

A sixth field, userAgent, is quieter but often decisive. It records the tool that made the call. Here it is aws-cli/2.15.0, which tells you m.webb ran this from a command line, not the console. That is unremarkable for an architect, but the same field becomes a signal when the identity and the tool do not match the person's habits. A user who only ever works in the console suddenly making calls through the Python SDK, or a burst of activity tagged with a Terraform user agent at an hour when no deployment was scheduled, is the kind of mismatch that turns an ordinary-looking call into a lead.

The record tells you what was done. The user agent tells you how it was done, and whether the how fits the who.

Now change one field in your head. Make the sourceIPAddress an address in a country Northgate does not operate in. The same call, with the same effect on the same security group, is now a very different event. Nothing about the action changed. The context did. Almost all of cloud investigation is exactly this: a legitimate-looking API call made by the wrong identity, from the wrong place, at the wrong time. You will spend this course learning to read that context.

One field deserves an early warning, because it confuses responders coming from a single-datacenter mindset. awsRegion is eu-west-2 here because EC2 is a regional service and this call hit the London region. But a handful of AWS services are global, and their activity is recorded in us-east-1 regardless of where the caller sits. IAM and STS are the two that matter most to an investigation: a CreateUser call or an AssumeRole made by someone in London still lands in CloudTrail tagged us-east-1.

If you filter an identity investigation to the region the company operates in, you will miss every IAM and STS event in the case. The region field tells you where the service lives, not where the attacker was. For where the attacker was, you read sourceIPAddress.

Failed calls are evidence too

The record above succeeded. Failures are logged with equal detail, and in an investigation they are often more revealing than successes, because they map the edges of what an attacker could not yet do. Here is the same kind of action, denied.

CloudTrail Management Event

{
  "eventTime": "2026-05-10T19:31:08Z",
  "eventSource": "iam.amazonaws.com",
  "eventName": "ListUsers",
  "awsRegion": "us-east-1",
  "sourceIPAddress": "203.0.113.198",
  "userIdentity": {
    "type": "IAMUser",
    "arn": "arn:aws:iam::444444444444:user/m.chen-dev",
    "accessKeyId": "AKIAEXAMPLEDEVKEY01",
    "userName": "m.chen-dev"
  },
  "errorCode": "AccessDenied",
  "errorMessage": "User is not authorized to perform: iam:ListUsers",
  "eventType": "AwsApiCall",
  "recipientAccountId": "444444444444"
}

A developer account, m.chen-dev, tried to list every IAM user from an external IP and was denied. On its own, one denied call is noise; developers fat-finger permissions every day. But read it as part of a pattern and it changes shape: this identity does not normally call IAM at all, the call came from an address with no history, and it is the kind of broad enumeration an attacker runs early to map what they can reach.

The AccessDenied is not the attacker failing. It is the attacker discovering a boundary, and the failure is recorded as faithfully as any success. A burst of denied List and Describe calls from one identity is one of the cleanest early signals of a compromised credential being explored, and you will hunt exactly this pattern in the credential-compromise module. Treating failures as noise to filter out is a habit worth unlearning now.

How the record reaches you

CloudTrail does not store events where the action happened. A trail delivers them to an S3 bucket, and in a well-run organization that bucket lives in a separate, locked-down security account that no production identity can write to. Northgate is set up this way: every account's activity flows to one central trail in the security account, which is why an investigator with access to that one bucket can see the whole organization at once.

It is also why an attacker who fully owns the production account still cannot quietly erase the record, because the evidence already left for an account they do not control.

Two practical details follow from this. First, delivery is not instant. Events typically appear within a few minutes, occasionally longer, so the freshest moments of a live incident may not be queryable yet. Second, because the record is centralized and durable, the timeline you build from it is authoritative in a way a host log never is. When you reconstruct a sequence from CloudTrail, you are reading the platform's own account of what happened, not an artifact the attacker had a chance to edit.

Control plane and data plane

CloudTrail records two kinds of activity, and the difference decides whether the evidence you need exists at all.

Management events record changes to the environment and are on by default. Data events record access to the contents of resources and are not. The gap between them is a recurring blind spot in real investigations.

Management events are control-plane actions: creating an instance, changing a policy, opening a port, disabling a trail. CloudTrail logs them by default in every account. Data events are data-plane actions: reading a specific object out of an S3 bucket, invoking a specific Lambda function. These are not logged unless someone turned them on, because in a busy account they are enormous in volume and cost money to record.

This produces one of the most common and most painful findings in cloud IR. You can prove from management events that an attacker gained access to a bucket and changed its policy. Whether you can prove which objects they actually read depends entirely on whether data events were enabled before the incident. If they were not, the control-plane story is intact and the data-plane story is simply absent. You will meet this gap directly in the data-exfiltration module, and learning to state clearly what the evidence does and does not support is part of the job.

Anti-Pattern

Waiting for the host.

Responders new to AWS often stall at the start of an incident, waiting to get onto the instance the way they would on-premises, asking for SSH access or a disk image. The instance may already be gone, and even when it is not, the host is rarely where the answer is. The attacker did their work through the API. Start with CloudTrail, identify the calls, and treat the host as one more piece of evidence, not the center of the case.

Why this is good news for the responder

The control-plane model takes things away, the disk and the memory, but it gives back something on-premises rarely offers. The record is centralized: one account's activity flows to one place, queryable with one language, instead of scattered across dozens of hosts in dozens of formats. It is structured: every record has the same fields, so you can filter and group across millions of events in a single query.

And it is complete in a way local logs never are, because it is produced by the platform rather than by an agent the attacker might have killed. An attacker can delete a file on a host. To remove the CloudTrail record of an action, they have to stop or delete the trail itself, and that action is its own loud, logged management event, which is exactly how the evasion module catches them.

The structure is what makes investigation at scale possible. Because every event shares the same shape, a question like "which identities called IAM write actions from outside our IP ranges in the last week" is a single query against the whole organization, not a manual sweep through separate logs. On-premises, answering that means collecting from many systems, normalizing different formats, and reconciling clocks that do not agree.

In AWS, the platform has already done the normalizing: one schema, one clock, one place. The investigative skill shifts away from gathering evidence and toward asking the right question of evidence you already have. That is why this course spends so much of its time on how to query well. The evidence is rarely the bottleneck. The question is. This is the thread running under the whole course: almost every action is an API call, CloudTrail records it with who, what, when, where, and the result, and a responder who can read that record can investigate an incident whose machine no longer exists.

The control plane gives you the record. What it does not give you is a stable set of things to investigate, because in AWS the resources themselves come and go in minutes. AWS0.4 takes that on: why the environment is ephemeral by design, and why that makes the log the only durable evidence you have.

← Previous Next →

Reading width