SRE Runbook Agent Skills

Portable Agent Skills for creating, reviewing, and safely executing production SRE runbooks.

Create runbooks. Review them with a clean context. Execute them safely, one step at a time.

This repository publishes reusable Agent Skills for operational documentation and incident response workflows. The skills are designed for SREs, platform engineers, DevOps engineers, incident commanders, and on-call engineers who want AI-assisted runbooks that are clear, safe, and executable.

The practices in these skills are inspired by Google SRE practices and aligned with common SRE best practices: actionable alerting, emergency response, safe mitigation, verification, rollback criteria, escalation, and clear ownership.

Included Skills

Skill	Purpose
`sre-runbook-author`	Create execution-focused runbooks for alerts, incidents, mitigations, rollbacks, diagnostics, and escalation.
`sre-runbook-reviewer`	Review runbooks from a clean-context perspective and identify gaps, unsafe actions, hidden assumptions, and missing verification.
`sre-runbook-executor`	Walk through an existing runbook one step at a time with evidence capture, ambiguity stops, and approval gates for risky actions.

Example Use Cases

"Create a runbook for this high 5xx alert."
"Review this runbook as if you had no prior context."
"Execute this runbook safely and stop before risky actions."
"Find hidden assumptions or missing steps in this rollback procedure."
"Turn this incident response process into a clear agent-executable runbook."

Installation

Compatible tools generally discover skills as directories containing SKILL.md.

Project-local install:

mkdir -p .agents/skills
cp -r skills/sre-runbook-* .agents/skills/

User-level install:

mkdir -p ~/.agents/skills
cp -r skills/sre-runbook-* ~/.agents/skills/

Some tools use their own skill or plugin paths, such as .gemini/skills/, ~/.gemini/skills/, or product-specific extension directories. Check your tool's current documentation before installing.

Safety Model

The skills are built around conservative incident response:

Start with the incident input and the applicable runbook.
Separate read-only diagnostics from low-risk and risky actions.
Stop when steps are ambiguous, required access is missing, or observed state diverges from the runbook.
Ask for explicit approval before risky actions such as deploys, rollbacks, restarts, scaling operations, failovers, config changes, data changes, permission changes, alert suppression, or resource deletion.
Preserve evidence and compare actual results with expected results.

Example Prompts

Create a production SRE runbook for alert <ALERT_NAME>. Use placeholders for missing dashboards, thresholds, owners, and commands.

Review this runbook from a clean-context perspective. Flag anything that would confuse an on-call engineer at 3 AM.

Execute this runbook one step at a time. Start with read-only diagnostics and stop before any risky action.

Examples

License

This project is licensed under the MIT License. See LICENSE.

medzin/sre-runbook-agent-skills

sre-runbook-agent-skills 是什麼？

在你喜歡的 AI 中提問

說明文件

SRE Runbook Agent Skills

Included Skills

Example Use Cases

Installation

Safety Model

Example Prompts

Examples

License

相關技能

steipete/sag

steipete/oracle

steipete/peekaboo

obra/brainstorming

affaan-m/prisma-patterns

affaan-m/django-celery