Building an AI SRE Agent with MCP: What I Actually Built
Table of Contents
An AI SRE agent is not a chatbot with kubectl access. It is a reasoning layer on top of a tightly scoped tool layer. I learned this after giving one kubectl access and watching it suggest deleting my monitoring namespace.
I have been experimenting with MCP servers since they started getting attention in r/LocalLLaMA and the broader AI tooling community. The idea is simple: instead of giving an agent raw access to Kubernetes or AWS, you expose a small set of tools through a standard protocol. The agent discovers and calls those tools, but it cannot do anything the tools do not allow.
I built one. It reads logs and lists pods. That is it. Here is why that is the right scope for now.
Why MCP for Infrastructure Agents
The problem with most AI agent demos is scope. They connect the agent to a shell and hope the prompt is good enough. That works until it does not. I tried this. The agent suggested kubectl delete namespace monitoring because I said “clean up the old monitoring stuff.” It was technically correct. I had an old monitoring namespace. But I also had a current monitoring namespace with a similar name. The agent did not know the difference.
MCP fixes this by making the tool layer explicit. Each tool has a name, a description, a schema, and a bounded implementation. The agent can only invoke what is exposed. The operator controls what is exposed.
For infrastructure, this is exactly what I want. I do not want the agent to run arbitrary kubectl commands. I want it to call get_pod_logs, list_deployments, or restart_deployment if I have decided that is safe.
The Architecture I Actually Built
Four layers:
- Observability: Kubernetes API, metrics, logs. My existing Prometheus and Loki setup.
- MCP server: FastMCP, exposing two tools:
get_pod_logsandlist_pods. - Agent runtime: Claude Code, connected to the MCP server.
- Governance: Audit logs, approval gates, policy enforcement. I have audit logs. I do not have approval gates yet. That is the next step.
Layer 4 is the one people skip. It is also the one that prevents incidents. I am adding it slowly because I want to get it right.
What the MCP Server Actually Looks Like
A minimal Kubernetes diagnostic server. This is the actual code I run:
from kubernetes import client, configfrom mcp.server.fastmcp import FastMCP
mcp = FastMCP("k8s-diagnostics")config.load_kube_config()v1 = client.CoreV1Api()
@mcp.tool()async def get_pod_logs( pod_name: str, namespace: str = "default", tail_lines: int = 100,) -> str: """Get recent logs from a Kubernetes pod.""" try: logs = v1.read_namespaced_pod_log( name=pod_name, namespace=namespace, tail_lines=tail_lines, ) return logs except client.exceptions.ApiException as e: return f"Kubernetes API error: {e.status}: {e.reason}"
@mcp.tool()async def list_pods(namespace: str = "default") -> str: """List pods and their status in a namespace.""" try: pods = v1.list_namespaced_pod(namespace=namespace) lines = [f"{'POD':<40} {'STATUS':<12} {'RESTARTS':<10}"] for pod in pods.items: restarts = sum( c.restart_count for c in (pod.status.container_statuses or []) ) lines.append( f"{pod.metadata.name:<40} {pod.status.phase:<12} {restarts:<10}" ) return "\n".join(lines) except client.exceptions.ApiException as e: return f"Kubernetes API error: {e.status}: {e.reason}"
if __name__ == "__main__": mcp.run(transport="stdio")Read-only, scoped, with clear error handling. This is the only code the agent can run. It cannot delete anything. It cannot modify anything. It can only look.
Safety Defaults I Actually Use
My MCP server rules:
- Read-only by default. Only two tools, both read-only.
- Namespace allowlists. The server only connects to namespaces I explicitly configure.
- Every tool call logged with full context. I review these logs weekly.
- No write tools yet. I will add them when I have a use case that justifies the risk.
I would rather add a tool slowly than remove an incident later. The namespace deletion taught me that.
Connecting to an Agent
Kimi Code has native MCP support. You add the server to a config file, restart, and the agent discovers the tools. The connection looks like this:
{ "mcpServers": { "k8s-diagnostics": { "command": "python", "args": ["/path/to/mcp-server.py"] } }}Kimi Code does not need to know about Kubernetes. It needs to know about the tools. That decoupling is what makes MCP powerful. I can change the Kubernetes client version without touching Kimi Code. I can change Kimi Code without touching the server.
What I Actually Use It For
I use this agent for exactly two things:
- Quick log checks. “Show me the last 100 lines from the api-gateway pod.” Faster than typing kubectl.
- Pod status overview. “List all pods in the production namespace.” Faster than kubectl + grep.
That is it. It does not fix anything. It does not restart anything. It just looks. This is deliberately limited. I am proving that the agent can be trusted with data before I give it any power to act.
What I Learned About Trust
The namespace incident taught me that trust is not binary. It is a ladder:
- Can I trust the agent with read-only data? I am testing this now.
- Can I trust the agent to suggest actions? Not tested yet.
- Can I trust the agent to act with approval? Not tested yet.
- Can I trust the agent to act autonomously? Not anytime soon.
I am on step 1. Most blog posts about AI agents pretend they are on step 4. They are not. They are on step 1 and calling it step 4.
Conclusion
Building an AI SRE agent is not about giving the AI more power. It is about giving it the right power, with the right boundaries. MCP servers make that possible.
Start with read-only diagnostics. Add reversible actions with approval. Measure what the agent does well and where it fails. Expand only when the data supports it.
My agent reads logs and lists pods. That is not impressive. But it is safe. And safe is the right place to start.