Community라이팅 & 에디팅github.com

alexbobkovv/db-migration-safe

Zero-downtime SQL migration safety as an Agent Skill: detect lock hazards, rewrite unsafe DDL, auto-generate rollbacks, gate execution (Postgres-first, wraps squawk + eugene).

지원 대상~Claude Code~Codex CLI~Cursor
npx skills add alexbobkovv/db-migration-safe

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

문서

db-migration-safe

Make SQL schema migrations safe. This skill detects locking/blocking hazards in a migration, rewrites unsafe DDL into zero-downtime multi-step migrations, generates a rollback, and gates the actual run behind a plan → validate → execute workflow.

Postgres is the strong, deterministic path. MySQL is an honest hybrid (see references/mysql-catalog.md) — materially weaker, and labeled as such.

This skill shells out to external linters (squawk, eugene) and OSC tools (gh-ost/pt-osc). It does not bundle them. If they are missing, install per references/tool-setup.md. Static analysis (squawk + eugene lint) needs no DB; eugene trace and EXECUTE need a Postgres.

When to use / when NOT to use

Use for any DDL: ALTER TABLE, CREATE INDEX, ADD/DROP COLUMN, type changes, constraints (NOT NULL, CHECK, UNIQUE, foreign keys), renames, and backfills — whether hand-written SQL or generated by an ORM. Do not use for pure SELECT/DML reporting queries, or for declarative schema engines that own their own apply step (e.g. Atlas) — those have their own safety path.

Safety contract (read first)

  • PLAN and VALIDATE never touch the user's database. They are analysis only and may run freely.
  • eugene trace runs the migration against an ephemeral throwaway Postgres (or a disposable clone) and rolls back by default. It never touches production.
  • EXECUTE applies real DDL to a real database. Never run EXECUTE unless the user explicitly asks to apply / run / ship the migration. Always dry-run first, always wrap in lock_timeout, never run CONCURRENTLY inside a transaction. EXECUTE is procedural guidance the user drives — there is no auto-execute script.

Note on auto-invocation: this is a single skill, so its frontmatter governs the whole file. PLAN/VALIDATE are safe to auto-trigger, so model-invocation is left enabled. The EXECUTE phase is gated behaviorally by the rule above. If you would rather make the entire skill user-triggered only, add disable-model-invocation: true to the frontmatter — it will require explicit invocation for analysis too.


PLAN

Run the static analyzer on the migration file. No database is required.

python3 scripts/analyze.py path/to/migration.sql --dialect postgres
# add --pg-version 13.0 to gate version-specific rules
# add --json for machine-readable output (CI)

analyze.py runs squawk and eugene lint, normalizes both into one verdict, and exits nonzero if any error-level finding is present (so it doubles as a CI gate). For --dialect mysql it applies the built-in InnoDB Online DDL heuristics instead (no free static MySQL lock-linter exists; see references/mysql-catalog.md).

Read the merged verdict:

  1. Findings — each has a source (squawk/eugene/heuristic), a rule id, a level (error/warning), the line, the message, and a workaround.
  2. Interpretation — map rule ids to plain meaning via references/squawk-rules.md and references/eugene-hints.md.
  3. Table size — static linters cannot tell 100 rows from 100k. Probe real size with scripts/table_size.sql (needs DB access) before deciding whether a "slow" op is actually a problem:
    psql "$DATABASE_URL" -v tablename=public.orders -f scripts/table_size.sql
    
  4. Escalate to trace when the verdict involves a possible table/index rewrite (E5, E10), an FK without a covering index (E15), or any lock whose real duration depends on row count. trace observes the actual lock mode and how long it is held:
    python3 scripts/trace.py path/to/migration.sql            # ephemeral temp server
    python3 scripts/trace.py path/to/migration.sql --host db.staging --port 5432 \
        --user app --database app                              # disposable clone
    

Then produce three artifacts for the user:

  • a RISK REPORT (the verdict, in plain language),
  • a SAFE REWRITE (next section), and
  • a generated ROLLBACK (scripts/gen_rollback.py).

Safe rewrites

For every error/warning finding, replace the unsafe statement with the cataloged zero-downtime rewrite. The full table of 11 operations → rewrites is in references/postgres-catalog.md (MySQL: references/mysql-catalog.md). The two Postgres primitives behind almost every rewrite:

  • CONCURRENTLY — build/drop indexes without an ACCESS EXCLUSIVE write lock (must run outside a transaction).
  • NOT VALIDVALIDATE CONSTRAINT — add a constraint in two phases so the expensive full-table validation takes a weak SHARE UPDATE EXCLUSIVE lock instead of blocking writes.

Quick reference (see catalog for the exact multi-step SQL and PG-version nuances):

UnsafeSafe rewrite
CREATE INDEXCREATE INDEX CONCURRENTLY (outside txn)
ADD COLUMN ... NOT NULL [DEFAULT]nullable col → set default → batch backfill → NOT NULL via CHECK
SET NOT NULLADD CHECK (c IS NOT NULL) NOT VALIDVALIDATESET NOT NULL → drop check
ADD FOREIGN KEY... NOT VALIDVALIDATE CONSTRAINT
ADD CHECK... CHECK (...) NOT VALIDVALIDATE CONSTRAINT
ADD UNIQUECREATE UNIQUE INDEX CONCURRENTLYADD CONSTRAINT ... USING INDEX
ALTER COLUMN ... TYPEexpand-contract (new col → dual-write → backfill → swap → drop)
rename column/tableexpand-contract, or compatibility VIEW
backfill UPDATEbounded batches in short txns, throttled

Write the rewritten statements to a new file (e.g. migration.safe.sql) so it can be re-validated.

Rollback

Generate the reverse migration from the (rewritten) DDL:

python3 scripts/gen_rollback.py migration.safe.sql > migration.rollback.sql

gen_rollback.py emits the inverse of every recognized cataloged op (ADD COLUMNDROP COLUMN, CREATE INDEX CONCURRENTLYDROP INDEX CONCURRENTLY, …). Review the output:

  • Irreversible ops (DROP COLUMN, DROP TABLE, type narrowing) cannot be restored from DDL alone. The generator marks them and emits the backup-table SQL that the forward migration must run first (<table>_<col>_backup_<date>) plus a best-effort restore. It never emits a silently-wrong inverse.
  • Unknown statement shapes produce an explicit manual rollback required: <reason> comment, not a guess.

VALIDATE

This is the gate. Re-run the analyzer on the rewritten migration:

python3 scripts/analyze.py migration.safe.sql --dialect postgres

Only proceed when it passes with zero error-level findings (warnings are configurable via .squawk.toml / -- eugene: ignore). If it still flags errors, return to Safe rewrites and loop. Re-run trace if the rewrite changed which locks are taken.

EXECUTE (gated — user-triggered only)

Do not enter this phase unless the user explicitly asked to apply the migration.

Apply the validated migration with locking guardrails. Dry-run first.

  1. Bound every lock. Wrap blocking statements so they fail fast instead of queueing behind a long transaction and stalling all traffic:
    SET lock_timeout = '5s';
    SET statement_timeout = '0';  -- or a sane cap for the specific op
    
  2. CONCURRENTLY runs outside a transaction. Do not put CREATE/DROP INDEX CONCURRENTLY in a BEGIN/COMMIT block or a transactional migration runner. After a failed concurrent build, drop the leftover INVALID index before retrying.
  3. Two-phase constraints. Run the NOT VALID step, let it commit, then VALIDATE CONSTRAINT in a separate statement.
  4. Batched backfills. Update in bounded chunks in separate short transactions, with a short sleep between batches to let replication and autovacuum keep up:
    -- repeat until 0 rows affected
    UPDATE t SET col = expr
    WHERE id IN (SELECT id FROM t WHERE col IS NULL ORDER BY id LIMIT 5000)
    
  5. Dry-run the whole sequence against the ephemeral/clone via scripts/trace.py one last time, confirm passed_all_checks, then apply to the target.

For MySQL, force ALGORITHM=…, LOCK=… so the server errors instead of silently falling back to a COPY rebuild, and delegate large rewrites to gh-ost/pt-osc — see references/mysql-catalog.md.


Reference index

  • references/postgres-catalog.md — the 11 unsafe ops → safe rewrites (core knowledge).
  • references/mysql-catalog.md — InnoDB Online DDL matrix + OSC delegation (weaker path).
  • references/squawk-rules.md — squawk rule ids and what each catches.
  • references/eugene-hints.md — eugene E1–E15 / W12–W14 and their workarounds.
  • references/tool-setup.md — installing squawk / eugene / gh-ost; ephemeral Postgres.

Scripts

  • scripts/analyze.py — orchestrates squawk + eugene lint (or MySQL heuristics) → one verdict. Exits nonzero on error-level findings.
  • scripts/trace.py — runs eugene trace (ephemeral temp server or clone) → real lock report.
  • scripts/gen_rollback.py — generates the reverse migration; flags irreversible ops.
  • scripts/table_size.sqlpg_class.reltuples size probe.

All scripts are Python 3 standard library only — no pip install.

관련 스킬