Autonomous Remediation: Where I Think Agents Belong (And Where I Learned They Don't)

The goal of autonomous remediation is not to remove humans. It is to handle the failures humans are tired of handling. The danger is when it handles the failures humans should still be handling.

I am cautiously optimistic about autonomous remediation. The caution comes from watching demos that look impressive until you imagine them running at 3 AM on a production cluster. The optimism comes from believing that narrow, well-scoped automation can genuinely reduce incident load.

I built one. It worked for two weeks. Then it restarted a production database pod during a backup window because the pod was using more memory than usual. The backup was supposed to use more memory. The agent did not know that. I learned a lot that night.

This post is where I think agents belong in remediation, and where they do not. It is based on something I actually built and something I actually broke.

The OODA Loop Still Applies

Every remediation system, human or agent, follows the same loop:

Observe: collect signals.
Orient: figure out what is wrong.
Decide: pick an action.
Act: execute and verify.

The difference with agents is speed and scale. An agent can observe hundreds of signals at once and respond in seconds. But the same loop also means an agent can misorient and act wrongly in seconds. My database pod restart took four seconds from alert to action. Four seconds to turn a routine backup into an incident.

The Four Layers I Would Build (Now)

Observability. Prometheus, Loki, PagerDuty, whatever you already trust. If the agent cannot see clearly, it cannot decide safely. My agent saw memory usage but not the backup schedule. That was the gap.
MCP server. A scoped translation layer between the agent and Kubernetes. Read-only by default. Write actions require approval or narrow policy. I did not have this when the database incident happened. I do now.
Agent runtime. The LLM and orchestration layer that runs the OODA loop.
Governance. Audit logs, rate limits, approval queues, and rollback paths. I had audit logs. I did not have rate limits or approval queues. I do now.

The architecture is simple. The discipline to keep it scoped is hard. I learned that the hard way.

What I Would Automate First (Revised)

I would only give an agent autonomous action for failures that are:

Well understood.
Reversible.
Low blast radius.
Frequent enough to matter.
Cheap to verify.
And the agent has context about why the failure is happening, not just what is happening.

The last one is new. My database pod restart failed because the agent knew the pod was unhealthy but not that it was supposed to be unhealthy during a backup. Context matters.

The classic example is restarting a crashed pod. The action is safe, the verification is simple, and the failure pattern is unambiguous. But even then, I now require the agent to check a calendar or schedule before acting. No restarts during known maintenance windows.

What I Would Not Automate (Confirmed)

I would not let an agent autonomously:

Run schema migrations.
Change network policies.
Delete stateful resources.
Modify secrets or IAM.
Restart services during an active security incident.
Restart anything during a backup or maintenance window.

These are not failures. They are high-stakes decisions. A human needs to be in the loop. The database incident confirmed this for me.

What I Actually Run Now

After the database incident, I rebuilt the agent with stricter rules. Here is what actually runs on my homelab cluster now:

Automated (no approval):

Read-only log queries
Read-only metric checks
Alert correlation (grouping related alerts)

Notification only (agent suggests, human decides):

Pod restarts in non-production namespaces
Service rollbacks to previous revision
ConfigMap updates for known-safe changes

Human approval required:

Anything in production namespaces
Anything touching stateful sets
Anything that modifies secrets or network policies
Anything the agent has not seen before

The agent has not taken an autonomous action in three weeks. It has suggested five actions. I approved three and rejected two. That feels right.

The Database Incident: What Happened

Here is exactly what happened, because I think specifics matter more than general warnings.

My PostgreSQL pod runs a nightly backup at 2 AM. During the backup, the pod memory usage spikes from 2 GB to 6 GB. This is expected. The backup process loads data into memory before writing to S3.

My agent had a rule: if a pod memory usage exceeds 5 GB for more than 60 seconds, restart it. This rule was designed for memory leaks. It was not designed for backups.

At 2:17 AM, the agent observed the pod at 6.2 GB memory usage. It waited 60 seconds. The memory stayed high because the backup was still running. The agent restarted the pod. The backup was interrupted. The database had to be restored from the previous night’s backup. I lost 24 hours of data.

The fix was not better AI. The fix was better context. The agent needed to know about the backup schedule. I needed to add calendar awareness to the observability layer. I also needed to add a rate limit: max one restart per pod per hour, regardless of the signal.

Governance Patterns I Actually Use Now

Dry-run mode: Test every action before applying it. I run this for a week before enabling any new rule.
Approval gates: Destructive or irreversible actions require human sign-off. I use Slack notifications with approve/reject buttons.
Human-on-the-loop: The agent acts and notifies, with one-click revert. I have reverted two actions since implementing this.
Audit everything: Log the alert, the reasoning, the action, and the outcome. I review these weekly.
Rate limits: Cap how many actions an agent can take per window. Max 3 restarts per namespace per hour. No exceptions.
Calendar awareness: The agent checks for scheduled maintenance before acting. This is the one that would have prevented the database incident.

These patterns are not friction. They are trust infrastructure. I did not appreciate that until I lost data.

What I Learned About Signal Design

The memory threshold rule was wrong. It was too simple. A better rule would be:

Memory exceeds threshold for 60 seconds
AND the pod is not in a scheduled maintenance window
AND the pod has not been restarted in the last hour
AND the namespace is not production-critical

That is four conditions instead of one. It is more complex to maintain. It is also more correct. The simplicity of the original rule was not a virtue. It was a bug.

Conclusion

Autonomous remediation is not a replacement for SREs. It is a tool for reducing toil on the failures that are boring, repetitive, and safe to fix. The right architecture keeps humans in control of high-stakes decisions while letting agents handle the routine.

Start with one narrow failure. Measure outcomes. Expand scope only with evidence. And never let an agent act without context about what is supposed to be happening, not just what is currently happening.

The database incident cost me a Sunday and 24 hours of data. It was a cheap lesson. The next one might not be.