remote-gpu-trainer — the DL Experiment Lifecycle

Overview

One skill for the whole arc of a DL experiment: RUN → VERIFY → DELIVER.

RUN — get a long GPU job to start, survive, and finish, then get the result off the box. On a machine you own there is no meter; on a rented box the core insight is that you are a short-term tenant on someone else's machine — so the job is to detach the work, make the result outlive the instance, and stop the meter safely, not to provision a cluster. Platform-specific at the edges (one profiles/<platform>.md owns every path, proxy, billing verb, and spot rule), invariant at the core.
VERIFY — is this number a bug, a real effect, or noise? A surprising result is a hypothesis, not a fact to report. Platform-agnostic.
DELIVER — organize the result so every shipped number/figure/table is a deterministic function of one immutable evidence layer; provenance and cross-document consistency are locked by mechanism, not by a human remembering to update three documents. Platform-agnostic.

Two stances run through VERIFY and DELIVER: user sovereignty (the science — seed count, which samples, whether an aux channel exists — is the user's call; the skill organizes and discloses a tradeoff once, then stops nagging) and audit → disclose, not enforce (the skill is an honest auditor, not a gate guard — an integrity issue must surface with the conclusion it affects, but the skill never blocks the user from shipping). Mantra: "disclose it, or don't claim it."

Route first

RUN — own the box or rent it?
- Local (a workstation/laptop you own, no meter) → references/run-local/ and profiles/local.md.
- Rented / remote (any metered or shared box you don't own) → references/run-remote/, and pick your profiles/<platform>.md FIRST (it owns every path/verb/proxy the phases delegate to).
Then ALWAYS → VERIFY the result (references/verifying/) → DELIVER it (references/delivering/). These two are platform-agnostic; they run the same whether the job trained locally or on a rental. Skip nothing here just because the run succeeded — a green run is not a real number, and a real number is not yet a clean deliverable.

Already debugging a model that won't converge / OOMs / hangs / NaNs, regardless of where it runs? Jump straight to references/training/ (the 8-file debug layer), then come back to VERIFY before you report.

Operating principles (the spine)

The load-bearing invariants. One line each; the full cross-platform set (10 invariants for the remote lifecycle) is in references/run-remote/principles.md — read it before Phase 0 of a remote run.

Checkpoint-to-durable + idempotent resume is the universal spine. File-checkpoint to the durable location + unconditional load-latest-on-startup is the one mechanism that survives an SSH drop, a Slurm walltime kill, a K8s reschedule, a spot preemption, a Colab disconnect. The detach primitive (tmux / sbatch / Job) is the swappable plug; this is the invariant.
Trust the artifact you loaded, not a log line that claims success. "synced / saved / done" lies under a silently-failed write; a watcher's own state is also a claim — reconcile it against the real process / artifact / pixels / bytes.
Cheap checks before expensive compute. A 1–2 batch CPU smoke (logger off) kills import/config/ shape/scale bugs for ~free, before they bill GPU-hours.
Cost and destructive actions are the user's call. Never auto-release/terminate, never delete durable files without confirmation; if cleanup can't free space, ask to expand the disk, don't silently shrink the experiment.
Audit → disclose, not enforce. What is mandatory is disclosure, not the fix. An integrity finding (no disjoint val, leakage, test touched during selection, a number you can't re-derive) must ride with the conclusion — but the skill discloses, it does not block.

RUN — local (a box you own)

No meter, no teardown clock — the risks move from money to resource contention and your machine's stability. The discipline that does not relax: env hygiene, resource awareness, artifact/checkpoint care, and "state the seed." Start at profiles/local.md, then the matching doc:

Env hygiene — never train/install in conda base on a persistent box; the 4-step gate (enumerate → pick the project env → confirm sys.executable → run) → references/run-local/env-hygiene.md.
Launch & detach — nohup/tmux, log + alive probe, don't foreground-block → references/run-local/launch.md.
Single-node multi-GPU — torchrun/accelerate DDP env contract, the first-run rank/hang basics → references/run-local/multi-gpu.md (multi-node → references/run-remote/multinode.md).
Local OOM — the fit-it ladder on hardware you can't rent bigger → references/run-local/local-oom.md.

RUN — remote (a box you rent)

Pick your profile FIRST — it binds every concrete path/proxy/credential/billing verb/spot rule the phases delegate to. Mental verb model (one API across platforms; the profile binds each verb to real commands): up (rent+reach) → push (code/data on) → run (detached + checkpointing) → watch (durable monitor) → pull (results off + verify) → down (stop the meter).

You're on…	Profile	Meter-stop verb (the trap)
AutoDL (deepest, battle-tested)	`profiles/autodl.md`	关机 stops meter, keeps disk (the AutoDL exception)
RunPod	`profiles/runpod.md`	terminate (stop still bills 2×; destroys volume disk)
vast.ai	`profiles/vastai.md`	destroy (stop bills disk forever)
Lambda	`profiles/lambda.md`	terminate (no stop state)
Paperspace	`profiles/paperspace.md`	destroy + release IP + delete storage
恒源云 / 矩池云 / Featurize / 揽睿星舟	`profiles/china.md`	per-platform (data disk often bills while stopped)
Bare SSH / Slurm / K8s / Colab	`profiles/generic-ssh.md`	manual (a forgotten box bills 24/7)

The 6-phase lifecycle (full per-platform checklist → references/run-remote/lifecycle_checklist.md): 0 env audit (df -i not just df -h, cgroup memory.max, checkpoint disk budget) · 1 SSH + credentials (the prebuilt image is the env — don't conda create on a rental; secrets via stdin) · 2 wrapper + CPU-smoke gate before renting · 3 detached launch (probe, then hand back — never a blocking sleep) · 4 durable monitoring (the four-layer architecture → references/run-remote/monitoring_patterns.md; a session-bound watcher dies with the session) · 5 aggregate + verify + teardown.

Iron Law — teardown gate: NO release / terminate / destroy / file-delete until checkpoints are pulled to local AND verified by load (scripts/verify_local.py), and the user has explicitly approved the cost-affecting action. "It looked done in the log" is not evidence. On most platforms the meter-stopping action is irreversible (deletes the disk) — confirmation matters more, not less.

Other remote references: ssh_transport.md (rsync/scp resumable, secrets-via-stdin, CRLF) · spot-resilience.md (preemption grace, Young/Daly cadence, atomic-write resume) · china-network.md (mirrors + HF_ENDPOINT + the no_proxy trap) · parallel_ablation.md (fan-out independence + reconciliation) · multinode.md (NCCL/fabric, advanced) · gotchas_universal.md (the full U1–U43 catalog with a grep index).

When training itself breaks (the model, not the platform)

Once the box runs, training breaks in its own ways — local or remote, the same debug layer (references/training/, 8 files; each entry symptom → root cause → fix with cited docs). Route by symptom:

OOM / won't fit (CUDA-VRAM or host-RAM, OOM-at-a-step, the fit-it ladder) → oom-memory.md.
Multi-GPU launch / HANGS (torchrun/accelerate/deepspeed env contract, DDP/FSDP/ZeRO) → distributed-launch.md.
NaN / Inf / loss spikes (fp16/bf16/tf32, AMP/GradScaler, LLM divergence) → precision-stability.md.
Too slow (GPU- vs data- vs comms-bound, dataloader knobs, torch.compile traps) → throughput-profiling.md.
Resume bugs (full-state + sharded save/resume; epoch restart, reshuffle, scaler/EMA dropped) → checkpoint-resume.md.
Per-domain gotchas (LLM, vision det/seg, diffusion, RL, multimodal/VLM) → by-domain.md.
Runs but won't learn (overfit-one-batch, params-not-updating, LR/schedule, loss-function footguns, freezing) → convergence-debugging.md.
Dataloader correctness (worker-RNG aug duplication, IterableDataset sharding, RGB-vs-BGR / ÷255 / set_epoch) → data-pipeline.md.

VERIFY — is the number real?

Before you trust or report any metric, ablation delta, or "it works now": classify it bug / effect / noise, hold a comparison to exactly one changed variable, and probe leakage / fair-comparison / variance / metric-direction. A number you can't re-derive from the saved artifact is not a result yet. Stance: audit → disclose — surface an integrity issue with the conclusion, never silently pass or hard-block.

Full methodology (the 14-section probe ladder + the 6 invariants) → references/verifying/methodology.md.
Constant / degenerate output, real == shuffle, model-ignores-input → references/verifying/representation-collapse.md.
A green smoke that hides undertraining vs a real bug; loss-low-but-samples-bad → references/verifying/smoke-hidden-failures.md.

State the metric's direction when comparing (PSNR/SSIM/mAP ↑ better; LPIPS/NMSE/loss ↓ better) — never assume. Tracker forensics / pruning duplicate runs → scripts/wandb_forensics.py.

DELIVER — organize → single source → figures

Make the deliverable a deterministic function of one immutable, versioned evidence layer; lock provenance and cross-document consistency by mechanism, not by hand. Eight principles, tiered (references/delivering/principles.md):

CORE (wire from the first real number) — P1 evidence/presentation layers cleanly separated (zero hand-typed numbers in the presentation layer) · P2 numbers generated from results.json, not transcribed (so a stale number is physically impossible) · P3 content-addressed + append-only immutable runs (re-run mints a new <run-id>, never overwrites) · P7 figure-chain traceability (results.json → source_data.csv → figure + .provenance sidecar) + the pixel re-open gate (a figure that saved is not a figure that's correct) · P8 delivery = a disclosure gate, not a blocking one.
Advanced (flip on near submission — YAGNI applies to provenance too) — P4 data/split versioned
- hash-pinned (leakage becomes machine-checkable) · P5 one-command repro from a clean clone (scripts/repro.sh.template).

Mechanics: the on-disk tree → references/delivering/data-architecture.md; the two manifest schemas → references/delivering/evidence-manifest-schema.md; the one-folder-per-figure convention + pixel gate → references/delivering/figures.md; the per-number disclosure checklist → references/delivering/delivery-gate.md.

EVIDENCE.json is the project-level single source of truth — a machine-readable claims↔evidence map (each claim ← the supporting exp-id / metric / figure + a paper anchor + the repo file:line). scripts/reconcile.py greps the whole repo against its authoritative values to catch cross-document drift; scripts/manifest_scaffold.py stamps the structure.

Companion skills (all OPTIONAL — this skill is standalone)

Recommended separate installs that deepen RUN / VERIFY / DELIVER; the skill needs none of them and works fully standalone. One-line-each list, what each adds, and the no-companion fallback → references/companions.md. In short: figure drawing (nature-figure / publication-chart / scipilot-figure), experiment verification (the experiment-verifier agent), parallel ablation (superpowers:dispatching-parallel-agents), HF transport + hosted tracker (huggingface-skills:hf-cli / huggingface-trackio).

Getting better over time

The skill is static, but every run can teach it a gotcha — without corrupting it. Protocol → references/self-improvement.md: only sediment a root-caused, reproduced, generalizable gotcha (a one-off flake is a hypothesis, not a gotcha); route user/project-specific facts to the host's memory and generalizable ones to a proposed catalog edit; never silently rewrite a skill file — draft the symptom → root cause → fix and let the user approve. Platform facts carry a verified <month> stamp — re-verify any teardown/billing fact against current docs before betting money or data (scripts/check_staleness.py).

Bundled resources

Load only what the current phase needs (the body sections above name the individual files).

references/run-local/ — own-a-box: env-hygiene · launch · multi-gpu · local-oom.
references/run-remote/ — rented-box: principles · lifecycle_checklist · monitoring_patterns · ssh_transport · spot-resilience · china-network · parallel_ablation · multinode · gotchas_universal (U1–U43).
references/training/ — the DL-training debug layer (8 files; local/remote-agnostic) — routed above.
references/verifying/ — is-the-number-real: methodology · representation-collapse · smoke-hidden-failures.
references/delivering/ — deliverable: principles · data-architecture (+EVIDENCE.json) · evidence-manifest-schema · figures · delivery-gate.
references/companions.md (optional skills + fallbacks) · references/self-improvement.md (capture-a-gotcha loop).
profiles/<platform>.md — per-platform substrate (7 rental profiles + local.md; _schema.md = the fields).
scripts/ — wrappers (run_one/run_queue), monitors (mem_monitor, gpu_health, health_patrol.sh.template), transfer (download_loop, aggregate_to_fs, setup-china-mirrors), verify_local.py, delivering (manifest_scaffold.py, reconcile.py, repro.sh.template), wandb_forensics.py, check_staleness.py.
examples/autodl_sweep/ — one runnable worked case · evals/ — the regression harness.

Hanyuyuan6/remote-gpu-trainer

Ask in your favorite AI

文档

remote-gpu-trainer — the DL Experiment Lifecycle

Overview

Route first

Operating principles (the spine)

RUN — local (a box you own)

RUN — remote (a box you rent)

When training itself breaks (the model, not the platform)

VERIFY — is the number real?

DELIVER — organize → single source → figures

Companion skills (all OPTIONAL — this skill is standalone)

Getting better over time

Bundled resources

相关技能

steipete/sag

steipete/oracle

steipete/peekaboo

obra/brainstorming

affaan-m/prisma-patterns

affaan-m/django-celery