Performance patterns skill

A growing catalog of well-known code patterns that cause performance problems, with detection signals and resolution playbooks for each. The core pattern catalog focuses on x86 CPU code, with platform routing for Windows, WSL Linux, native Linux, and CUDA/NVIDIA environments.

The optimization knowledge is portable where the hardware is the same. What changes by platform is the mechanics: profiler vocabulary, compiler flags, debug-info format, synchronization primitives, CPU feature detection, and whether the bottleneck is CPU host code or CUDA device work.

Step 0 — Route the platform

If the user mentions WSL, Linux, CUDA, NVIDIA, GPU, Nsight, driver/toolkit mismatch, containers, or the platform is unclear, read references/platform-routing.md first.

Then load the platform-specific reference:

Platform / symptom	Read
Windows native C/C++ CPU performance	`PORTING-NOTES.md`
WSL Linux CPU performance	`references/wsl-linux.md`, then `references/linux-native.md`
Native Linux CPU performance	`references/linux-native.md`
CUDA/NVIDIA setup or GPU performance	`references/cuda.md`

Use scripts/collect-perf-env.ps1 on Windows and scripts/collect-perf-env.sh inside WSL/native Linux when the environment is the problem or the user has not provided enough toolchain/profiler context.

How to use this skill

Step 1 — Load the right file for your context

Context	Read this file
You have profiling output (VTune, AMD uProf, ETW/WPA, perf, flamegraph, Nsight summary, etc.)	`triggers/from-profile.md`
You are reading existing source code and have no profiling data yet	`triggers/from-source.md`
You are writing new performance-sensitive C/C++ or SIMD code	`guidelines/new-code.md`

The trigger files cover all the same patterns; they are separated so you only load what is relevant. guidelines/new-code.md is a write-time checklist — load it instead of a trigger file when generating new code, not reviewing it.

Step 2 — Identify the matching pattern

Each trigger file contains a compact table and brief descriptions — enough to decide whether the code or profile matches a known pattern.

Step 3 — Read the pattern detail file

When a pattern matches, read the corresponding file from patterns/. Do not attempt the fix from memory.

Step 4 — Apply the fix and verify

Follow the step-by-step instructions and verification method in the pattern file, using the platform mechanics from the routing file loaded in Step 0.

Multiple patterns can co-apply. Check all plausible matches before picking one.

Reusable library modules

These standalone implementation guides are available to any agent working in this skill, not only when following a specific pattern. Load the relevant file directly if the capability is needed.

Module	What it provides
`library/cpu-dispatch.md`	Runtime CPU feature detection and variant selection on Windows. Manual function-pointer dispatch only — `target_clones` is unavailable on the PE/COFF target (no IFUNC). Covers `__builtin_cpu_supports`, the explicit `__cpuid`/`_xgetbv` path, and per-variant `__attribute__((target(...)))`. Vendor-neutral (Intel + AMD). Use whenever a function has multiple performance-level implementations that need to be wired together at runtime.
`patterns/simd-upconversion-impl.md`	Full step-by-step zipper algorithm for doubling vector register width in intrinsics (SSE→AVX2 or AVX2→AVX-512); AVX-512 accumulator template; post-transformation checklist (CPUID guards, vzeroupper, target attributes).
`patterns/fast-crc32c-impl.md`	Drop-in CRC32C library: AVX-512 VPCLMULQDQ fusion, SSE4.2 + PCLMULQDQ multi-accumulator, plain C fallback. Runtime CPU dispatch wrapper included. Use whenever new CRC32C code is needed or an existing implementation is the bottleneck.
`references/platform-routing.md`	Route Windows / WSL / native Linux / CUDA requests before applying a pattern.
`references/cuda.md`	CUDA setup and performance triage: driver/toolkit visibility, Nsight Systems vs Nsight Compute, WSL GPU boundaries.

Cross-skill integration

Profiler/data-collection skills can invoke this skill after identifying a known pattern. This skill also works standalone from source code, profiler snippets, or environment probe output.

Skill status

Ported and reviewed for Windows native C/C++, then extended with routing notes for WSL Linux, native Linux, and CUDA/NVIDIA environments. The following core files are present: this SKILL.md, PORTING-NOTES.md, triggers/from-source.md, triggers/from-profile.md, guidelines/new-code.md, design.md, library/cpu-dispatch.md, CRC32C library sources, platform references, benchmark tests, and all pattern files:

parallel-accumulator, missing-vzeroupper, missing-restrict
ttas, false-sharing, per-cpu-stats, cold-path-annotation
cv-thundering-herd, mutex-to-rwlock, simd-sort
simd-upconversion + simd-upconversion-impl
fast-crc32c + fast-crc32c-impl
library-version-upgrade

references/library-versions.md intentionally keeps unverified third-party DLL entries in a TODO table; do not promote those to recommendations without primary source verification and Windows measurement.

2233admin/performance-patterns-skill

What is performance-patterns-skill?

Ask in your favorite AI

Documentation

Performance patterns skill

Step 0 — Route the platform

How to use this skill

Step 1 — Load the right file for your context

Step 2 — Identify the matching pattern

Step 3 — Read the pattern detail file

Step 4 — Apply the fix and verify

Reusable library modules

Cross-skill integration

Skill status

Related Skills

steipete/sag

steipete/oracle

steipete/peekaboo

obra/brainstorming

affaan-m/prisma-patterns

affaan-m/django-celery