SRE Runbook Agent Skills
Portable Agent Skills for creating, reviewing, and safely executing production SRE runbooks.
Create runbooks. Review them with a clean context. Execute them safely, one step at a time.
This repository publishes reusable Agent Skills for operational documentation and incident response workflows. The skills are designed for SREs, platform engineers, DevOps engineers, incident commanders, and on-call engineers who want AI-assisted runbooks that are clear, safe, and executable.
The practices in these skills are inspired by Google SRE practices and aligned with common SRE best practices: actionable alerting, emergency response, safe mitigation, verification, rollback criteria, escalation, and clear ownership.
Included Skills
| Skill | Purpose |
|---|---|
sre-runbook-author | Create execution-focused runbooks for alerts, incidents, mitigations, rollbacks, diagnostics, and escalation. |
sre-runbook-reviewer | Review runbooks from a clean-context perspective and identify gaps, unsafe actions, hidden assumptions, and missing verification. |
sre-runbook-executor | Walk through an existing runbook one step at a time with evidence capture, ambiguity stops, and approval gates for risky actions. |
Example Use Cases
- "Create a runbook for this high 5xx alert."
- "Review this runbook as if you had no prior context."
- "Execute this runbook safely and stop before risky actions."
- "Find hidden assumptions or missing steps in this rollback procedure."
- "Turn this incident response process into a clear agent-executable runbook."
Installation
Compatible tools generally discover skills as directories containing SKILL.md.
Project-local install:
mkdir -p .agents/skills
cp -r skills/sre-runbook-* .agents/skills/
User-level install:
mkdir -p ~/.agents/skills
cp -r skills/sre-runbook-* ~/.agents/skills/
Some tools use their own skill or plugin paths, such as .gemini/skills/, ~/.gemini/skills/, or product-specific extension directories. Check your tool's current documentation before installing.
Safety Model
The skills are built around conservative incident response:
- Start with the incident input and the applicable runbook.
- Separate read-only diagnostics from low-risk and risky actions.
- Stop when steps are ambiguous, required access is missing, or observed state diverges from the runbook.
- Ask for explicit approval before risky actions such as deploys, rollbacks, restarts, scaling operations, failovers, config changes, data changes, permission changes, alert suppression, or resource deletion.
- Preserve evidence and compare actual results with expected results.
Example Prompts
Create a production SRE runbook for alert <ALERT_NAME>. Use placeholders for missing dashboards, thresholds, owners, and commands.
Review this runbook from a clean-context perspective. Flag anything that would confuse an on-call engineer at 3 AM.
Execute this runbook one step at a time. Start with read-only diagnostics and stop before any risky action.
Examples
- Generic high HTTP 5xx error rate runbook
- Kafka consumer lag runbook
- Kubernetes deployment rollback runbook
License
This project is licensed under the MIT License. See LICENSE.