In this section

AWS Ephemeral Infrastructure: Why the Log Is the Durable Evidence

Module 0

You are investigating an alert tied to a server that handled a suspicious request at 14:30. You go to look at the server. There are now six servers behind that name, none of them the one from 14:30, because the application scaled out during a traffic spike and scaled back in an hour later. The instance that made the request was terminated at 15:10. Its private IP has already been handed to a different instance. Its instance ID will never be reused, but the instance it named is gone. Where do you start?

You start where the evidence actually is, which is not the server. In AWS, infrastructure is built to be temporary, and an investigation that depends on the resource still existing is an investigation that fails most of the time. The durable thing is the log.

Nothing here is built to last

On-premises, a server is a fixed object. It has a name, an address, and a physical existence that persists until someone decommissions it. You can walk up to it weeks later and image its disk. AWS inverts this. Resources are created and destroyed by API call, constantly, as a normal part of operation, and the whole platform is designed to make that cheap and routine.

Auto Scaling groups add and remove EC2 instances to match load, so the number of servers behind an application changes through the day with no human involved. Spot instances can be reclaimed by AWS with two minutes of notice. A Lambda function spins up, runs for a few hundred milliseconds, and disappears, leaving no host to examine at all. Containers start and stop in seconds.

Private IP addresses are pulled from a pool and reassigned to whatever launches next, so an address that belonged to a compromised instance this morning may belong to an innocent one this afternoon. Even the identifiers move: an instance ID is unique and never reused, which is helpful, but it points at something that no longer exists.

The consequence for a responder is direct. The artifact you would most like to examine, the running machine, is often gone before you are even assigned the case, and even when it is still alive it may be one of a fleet of identical instances where you cannot easily tell which one did the thing you care about. Building your method around the host means building it around something AWS treats as disposable.

The fleet problem deserves a moment, because it trips up responders who expect one server per role. When an application runs behind an Auto Scaling group, "the web server" is not a machine; it is a population of interchangeable machines that grows and shrinks. An alert that says "suspicious activity from the web tier" does not point at a box. To find which instance acted, you work from the identifiers the log gives you.

The instance ID is the reliable anchor: it is globally unique and never reused, so an instance ID in a CloudTrail record or a VPC Flow record points at exactly one instance for all time, even after it is terminated. The private IP address is the unreliable one: it returns to a pool and gets handed to the next instance that launches, so the same address can belong to three different instances across a single day.

A responder who pivots on IP address without pinning it to a time window will cheerfully merge the activity of three unrelated instances into one false story. You correlate on instance ID and timestamp, not on address, and the course drills that habit because it is where cloud correlation quietly goes wrong.

"The web server" behind an Auto Scaling group is a population, not a box. The instance ID survives the instance and pins activity to one machine for all time; the address does not, which is where cloud correlation quietly goes wrong.

The log outlives the resource

What persists is the record of the resource's life. The instance no longer exists, but CloudTrail recorded its creation, including who launched it and what role it was given, and later recorded its termination. Read this birth certificate.

CloudTrail Management Event

{
  "eventTime": "2026-05-12T14:02:55Z",
  "eventSource": "ec2.amazonaws.com",
  "eventName": "RunInstances",
  "awsRegion": "eu-west-2",
  "sourceIPAddress": "10.20.4.17",
  "userIdentity": { "type": "IAMUser", "userName": "m.webb",
    "arn": "arn:aws:iam::333333333333:user/m.webb" },
  "requestParameters": {
    "instanceType": "t3.medium",
    "imageId": "ami-0prodbase42",
    "iamInstanceProfile": { "name": "app-server-role" }
  },
  "responseElements": {
    "instancesSet": { "items": [ { "instanceId": "i-0a8f3c2prodweb" } ] }
  },
  "eventType": "AwsApiCall"
}

The instance i-0a8f3c2prodweb may be long gone, but this record tells you it existed, when it was launched, who launched it, and the one detail that matters most for what comes later: it was given the role app-server-role through its instance profile. That role is a set of permissions the instance carries, and an attacker who lands on the instance can steal those permissions through the metadata service, which is the entire subject of the compute-compromise module.

You will reconstruct that attack from records like this one, long after the instance itself has been terminated. The host being gone costs you nothing here, because the host was never going to tell you which role it held. CloudTrail did.

This is the shift in habit the whole module is pushing toward. When you are handed an incident, you do not ask "can I get on the box." You ask "what does the record say the box was, what did it carry, and what did it do," and you answer that from CloudTrail whether or not the box still exists.

The record also gives you the resource's full arc, well beyond its creation. The same trail that holds this RunInstances event holds the TerminateInstances that ended the instance, the calls that attached or changed its security groups, and any role changes along the way. Laid out in time order, those events are the instance's life story: when it appeared, what it was allowed to do, what it actually did, and when it went away.

CloudTrail Management Event

{
  "eventTime": "2026-05-12T15:11:40Z",
  "eventSource": "ec2.amazonaws.com",
  "eventName": "TerminateInstances",
  "awsRegion": "eu-west-2",
  "userIdentity": { "type": "AWSService", "invokedBy": "autoscaling.amazonaws.com" },
  "requestParameters": {
    "instancesSet": { "items": [ { "instanceId": "i-0a8f3c2prodweb" } ] }
  },
  "eventType": "AwsApiCall"
}

This is the same instance, i-0a8f3c2prodweb, being terminated, and the identity that did it is not a person but the Auto Scaling service acting on its own. The instance did not disappear because of anything an attacker did; it scaled in, exactly as designed. That distinction matters: a termination by the Auto Scaling service is routine, while a TerminateInstances from a user credential during an incident may be an attacker destroying evidence, and the only way to tell them apart is to read who made the call.

AWS Config adds a second angle, recording the configuration state of resources over time, so you can ask what a security group or a role looked like at the moment of the incident rather than what it looks like now. Between CloudTrail's stream of actions and Config's snapshots of state, you can reconstruct a resource that no longer exists in enough detail to investigate it, which is the whole point: the platform keeps the description even when it discards the thing.

What you can still preserve

Ephemerality cuts both ways. It is also a clock running on the present, and during a live incident you can act before the evidence disappears. If a compromised instance is still running, you do not have to accept losing it to the next scale-in. You can take an EBS snapshot of its disk and capture an AMI of the instance, both through API calls, and both produce durable copies that outlive the instance itself.

The same applies to preserving a forensic copy before you terminate or isolate a resource during containment. The instinct from on-premises forensics, capture before you change anything, still holds; it just executes through the API now. Module 10 teaches this preservation step as part of containment, where the hard part is doing it in the right order: preserve, then contain, so that stopping the attacker does not also destroy the evidence of what they did.

For orientation, the point is that ephemeral does not have to mean lost. It means you have a window, and acting inside it is a skill.

Retention is the clock you are racing

If the log is the evidence, then how long the log survives is the single most important fact about your investigation, and it is not the same for every source. This is where ephemerality bites a second time: not the resource vanishing, but the evidence of it aging out.

The resource vanishes in minutes. Its evidence survives only as long as the source that holds it, and the windows differ by source. The trail delivered to S3 is the copy you can rely on.

CloudTrail's free Event history in the console holds the last 90 days of management events and nothing older, and it does not include data events at all. A trail that delivers to an S3 bucket, by contrast, keeps events for as long as your S3 lifecycle policy retains them, which can be years, and it is the authoritative copy you investigate against. GuardDuty retains its findings for 90 days.

VPC Flow Logs, S3 data events, and similar high-volume sources exist only if someone enabled them before the incident, and no investigation can retrieve a log that was never being written. Northgate sends every account's activity to one durable trail in the security account, which is why its investigations can reach back months rather than stopping at a 90-day console wall.

The reason the richest sources are the ones most often missing is cost. Data events and flow logs are high volume: an active account generates enormous numbers of object reads and network flows, and AWS charges to record and store them. Many organizations leave them off to save money, or enable them only on a few resources, which means the evidence that would prove exactly which objects an attacker read or exactly what their instance connected to is frequently the evidence that does not exist.

This is not a failure you can fix during the incident. It is a condition you inherit, and a large part of professional cloud IR is being precise about what you can and cannot prove given what was being logged.

That makes one move the right first move in any AWS investigation: check the clock before you commit.

Establish which sources cover the window you care about and which have already aged out or were never enabled. Confirm the trail's retention, whether data events and flow logs were on, and how far back GuardDuty reaches.

This decides everything that follows. A timeline you can build only for the last 90 days is a different investigation from one you can build for a year, and an exfiltration case where data events were off is a different case from one where they were on. Knowing which one you are in, early, stops you promising findings the evidence cannot support and points you at the sources that actually hold answers.

Anti-Pattern

Treating the resource as the source of truth.

Pinning the investigation to the compromised instance: requesting SSH access, waiting for a disk snapshot, trying to log in. In a fleet that scales on its own, the instance is often gone, and the answer was never on it anyway. The same habit shows up with logs: assuming VPC Flow or data events will be there to query, when they were never turned on. Build the method around the durable record in S3, confirm what exists before you commit to a line of inquiry, and treat any surviving resource as a bonus rather than the foundation.

Why design for this at all

It is worth understanding why AWS works this way, because it explains why the pattern will not change. Ephemeral infrastructure is what makes the cloud elastic and cheap: you run exactly the capacity you need for exactly as long as you need it, and you stop paying when you stop using it. The same property that frustrates a host-centric responder is the property the business is paying for.

It is not going away, and an investigator who fights it loses. The responders who do well in AWS are the ones who stopped mourning the disk image and learned to read the record instead, because the record is the part the platform keeps.

The mindset that follows from this is worth stating plainly, because it governs every investigation in the course. You treat the log as primary and the resource as secondary. You assume the resource may be gone and build your account of the incident from records that cannot be, while seizing any live resource as a bonus to be preserved quickly rather than a foundation to depend on.

You check retention before you commit to a line of inquiry, so you never promise a finding the evidence cannot support. And you keep a copy of what matters early, because in an environment built to discard, the responder's discipline is to preserve. None of this is harder than on-premises forensics. It is different, and the difference is the first thing an analyst arriving from a traditional background has to internalize.

AWS0.5 turns to the second pillar of the cloud mental model. If resources are temporary and the log is the evidence, what is the attacker actually attacking? The answer is identity, and the next section explains why identity, not the network, is the perimeter in AWS.

← Previous Next →