I Tested AWS DevOps Agent—Here’s What Happened When I Broke My EC2 Instance

Leader posted 3 min read

How well does AI-powered incident response actually work? I decided to find out by deliberately stress-testing an EC2 instance and letting AWS DevOps Agent investigate — with zero hints from me.


What is AWS DevOps Agent?

AWS DevOps Agent (currently in public preview) is Amazon's autonomous AI agent for incident response. It connects to CloudWatch, logs, code repos, and third-party observability tools, then correlates data across all of them to find root causes — like an always-on, AI-powered on-call engineer.

During preview, it's free with limits: 20 investigation hours/month, up to 10 Agent Spaces, us-east-1 only.


Setting Up

Setup involves creating an Agent Space — a logical boundary that defines what the agent can access.

I configured the basics: connected my AWS account, enabled the web app, and left optional integrations (Datadog, Slack, GitHub, MCP Servers) empty to test out-of-box capabilities.

The agent also supports Skills (custom runbooks) and Prevention (weekly analysis of past incidents for proactive recommendations). I left Skills empty and enabled Prevention.


Creating Chaos

I launched a t2.medium EC2 instance ("demo") and connected via SSM Session Manager. Then I ran the stress utility to spike CPU:

# Maximum chaos first
stress --cpu 2 --timeout 300

# Then single CPU worker
stress --cpu 1 --timeout 300

A t2.medium has 2 vCPUs with 20% baseline performance. Running stress would push it well above baseline, burn CPU credits, and create a clear spike in CloudWatch.


The Investigation

I ran two investigations. The first was broad: "Investigate high CPU utilization across my compute resources." The second was targeted: "High CPU Usage on one of the EC2 Instances in us-east-1" with the exact incident timestamp.

The agent immediately identified the instance, noted it launched just ~15 minutes before the incident, and began gathering metrics in parallel.


What the Agent Found

The agent generated detailed CPU charts showing the escalation: average CPU rose from 2.8% → 28.3% → 58.5% across three 5-minute windows. Maximum hit 100% at 14:05, confirming full vCPU saturation.

It also tracked CPU credit burn rate — peaking at 5.83 credits per 5-minute window, roughly 2.9x the earn rate.

But the agent didn't stop at CPU. It simultaneously analyzed:

  • Network I/O — heavy 75MB ingress during bootstrap, dropping before CPU spike → not network-driven
  • EBS Disk I/O — boot activity settling before CPU spike → disk I/O was a result, not a cause
  • Status checks — all passing, no hardware failures
  • CWAgent metrics — none found, flagging a visibility gap


The Mitigation Plan — This Is Where It Got Interesting

The Mitigation plan tab said: "No mitigation action can be identified."

That sounds like a failure. It's actually the smartest part. The agent explained it found a strong correlation between SSM sessions and the CPU spike, but couldn't safely recommend action because:

  1. SSM Session Manager logging wasn't configured — actual shell commands were invisible
  2. CloudWatch Agent wasn't publishing per-process metrics yet
  3. IAM permissions prevented running diagnostic commands via SSM RunCommand

Instead of blindly suggesting "kill the process" or "stop the instance," the agent gave concrete next steps: run top or ps aux --sort=-%cpu via the console, configure SSM logging, and coordinate with the root user (it even identified the IP address). That's real engineering judgment.


The Verdict

What impressed me: The autonomous correlation across CPU, network, disk, credits, and status checks — building a coherent timeline without any manual intervention. The agent thinks like an engineer, distinguishing cause from effect and knowing when it doesn't have enough evidence to act.

Bottom line: AWS DevOps Agent is a genuine shift from observability dashboards to operational reasoning. It won't replace engineers, but it dramatically compresses the time between "something is wrong" and "here's exactly what happened." Worth trying during the free preview.


More Posts

What Is an Availability Zone Explained Simply

Ijay - Feb 12

Why most people quit AWS

Ijay - Feb 3

Agent Action Guard

praneeth - Mar 31

AWS Account Locked! How One IAM Mistake Cost Me

Ijay - Mar 18

10 Proven Ways to Cut Your AWS Bill

rogo032 - Jan 16
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

4 comments
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!