Community藝術與設計github.com

medzin/sre-runbook-agent-skills

Agent Skills for writing, reviewing, and safely executing production SRE runbooks.

相容平台Claude Code~Codex CLI~CursorGemini CLI
npx add-skill https://github.com/medzin/sre-runbook-agent-skills/tree/main

SRE Runbook Agent Skills

Portable Agent Skills for creating, reviewing, and safely executing production SRE runbooks.

Create runbooks. Review them with a clean context. Execute them safely, one step at a time.

This repository publishes reusable Agent Skills for operational documentation and incident response workflows. The skills are designed for SREs, platform engineers, DevOps engineers, incident commanders, and on-call engineers who want AI-assisted runbooks that are clear, safe, and executable.

The practices in these skills are inspired by Google SRE practices and aligned with common SRE best practices: actionable alerting, emergency response, safe mitigation, verification, rollback criteria, escalation, and clear ownership.

Included Skills

SkillPurpose
sre-runbook-authorCreate execution-focused runbooks for alerts, incidents, mitigations, rollbacks, diagnostics, and escalation.
sre-runbook-reviewerReview runbooks from a clean-context perspective and identify gaps, unsafe actions, hidden assumptions, and missing verification.
sre-runbook-executorWalk through an existing runbook one step at a time with evidence capture, ambiguity stops, and approval gates for risky actions.

Example Use Cases

  • "Create a runbook for this high 5xx alert."
  • "Review this runbook as if you had no prior context."
  • "Execute this runbook safely and stop before risky actions."
  • "Find hidden assumptions or missing steps in this rollback procedure."
  • "Turn this incident response process into a clear agent-executable runbook."

Installation

Compatible tools generally discover skills as directories containing SKILL.md.

Project-local install:

mkdir -p .agents/skills
cp -r skills/sre-runbook-* .agents/skills/

User-level install:

mkdir -p ~/.agents/skills
cp -r skills/sre-runbook-* ~/.agents/skills/

Some tools use their own skill or plugin paths, such as .gemini/skills/, ~/.gemini/skills/, or product-specific extension directories. Check your tool's current documentation before installing.

Safety Model

The skills are built around conservative incident response:

  • Start with the incident input and the applicable runbook.
  • Separate read-only diagnostics from low-risk and risky actions.
  • Stop when steps are ambiguous, required access is missing, or observed state diverges from the runbook.
  • Ask for explicit approval before risky actions such as deploys, rollbacks, restarts, scaling operations, failovers, config changes, data changes, permission changes, alert suppression, or resource deletion.
  • Preserve evidence and compare actual results with expected results.

Example Prompts

Create a production SRE runbook for alert <ALERT_NAME>. Use placeholders for missing dashboards, thresholds, owners, and commands.
Review this runbook from a clean-context perspective. Flag anything that would confuse an on-call engineer at 3 AM.
Execute this runbook one step at a time. Start with read-only diagnostics and stop before any risky action.

Examples

License

This project is licensed under the MIT License. See LICENSE.

相關技能