name: performance-patterns description: >- Detect, triage, and fix performance problems across Windows native C/C++ (clang-cl/MSVC; VTune, AMD uProf, ETW/WPA), WSL Linux, native Linux (perf/flamegraphs/GCC/Clang), and CUDA/NVIDIA environments (nvidia-smi, nvcc, Nsight Systems, Nsight Compute). Invoke when the user asks to optimize or debug performance, review SIMD/vectorized code, investigate WSL/Linux/Windows profiler output, troubleshoot CUDA driver/toolkit/GPU visibility issues, or decide whether a bottleneck is CPU host code, WSL boundary overhead, native Linux tooling, or CUDA kernels. Trigger on serial accumulator loops, narrow SIMD, _mm* intrinsics, HITM/false sharing, missing restrict/vzeroupper, condition-variable thundering herd, mutex-to-rwlock, hot library/DLL symbols, fast CRC32C, known algorithms, SIMD sort, WSL filesystem/perf issues, and CUDA launch/copy/kernel/occupancy/toolkit problems.
Performance patterns skill
A growing catalog of well-known code patterns that cause performance problems, with detection signals and resolution playbooks for each. The core pattern catalog focuses on x86 CPU code, with platform routing for Windows, WSL Linux, native Linux, and CUDA/NVIDIA environments.
The optimization knowledge is portable where the hardware is the same. What changes by platform is the mechanics: profiler vocabulary, compiler flags, debug-info format, synchronization primitives, CPU feature detection, and whether the bottleneck is CPU host code or CUDA device work.
Step 0 — Route the platform
If the user mentions WSL, Linux, CUDA, NVIDIA, GPU, Nsight, driver/toolkit
mismatch, containers, or the platform is unclear, read
references/platform-routing.md first.
Then load the platform-specific reference:
| Platform / symptom | Read |
|---|---|
| Windows native C/C++ CPU performance | PORTING-NOTES.md |
| WSL Linux CPU performance | references/wsl-linux.md, then references/linux-native.md |
| Native Linux CPU performance | references/linux-native.md |
| CUDA/NVIDIA setup or GPU performance | references/cuda.md |
Use scripts/collect-perf-env.ps1 on Windows and
scripts/collect-perf-env.sh inside WSL/native Linux when the environment is
the problem or the user has not provided enough toolchain/profiler context.
How to use this skill
Step 1 — Load the right file for your context
| Context | Read this file |
|---|---|
| You have profiling output (VTune, AMD uProf, ETW/WPA, perf, flamegraph, Nsight summary, etc.) | triggers/from-profile.md |
| You are reading existing source code and have no profiling data yet | triggers/from-source.md |
| You are writing new performance-sensitive C/C++ or SIMD code | guidelines/new-code.md |
The trigger files cover all the same patterns; they are separated so you only
load what is relevant. guidelines/new-code.md is a write-time checklist —
load it instead of a trigger file when generating new code, not reviewing it.
Step 2 — Identify the matching pattern
Each trigger file contains a compact table and brief descriptions — enough to decide whether the code or profile matches a known pattern.
Step 3 — Read the pattern detail file
When a pattern matches, read the corresponding file from patterns/. Do not
attempt the fix from memory.
Step 4 — Apply the fix and verify
Follow the step-by-step instructions and verification method in the pattern file, using the platform mechanics from the routing file loaded in Step 0.
Multiple patterns can co-apply. Check all plausible matches before picking one.
Reusable library modules
These standalone implementation guides are available to any agent working in this skill, not only when following a specific pattern. Load the relevant file directly if the capability is needed.
| Module | What it provides |
|---|---|
library/cpu-dispatch.md | Runtime CPU feature detection and variant selection on Windows. Manual function-pointer dispatch only — target_clones is unavailable on the PE/COFF target (no IFUNC). Covers __builtin_cpu_supports, the explicit __cpuid/_xgetbv path, and per-variant __attribute__((target(...))). Vendor-neutral (Intel + AMD). Use whenever a function has multiple performance-level implementations that need to be wired together at runtime. |
patterns/simd-upconversion-impl.md | Full step-by-step zipper algorithm for doubling vector register width in intrinsics (SSE→AVX2 or AVX2→AVX-512); AVX-512 accumulator template; post-transformation checklist (CPUID guards, vzeroupper, target attributes). |
patterns/fast-crc32c-impl.md | Drop-in CRC32C library: AVX-512 VPCLMULQDQ fusion, SSE4.2 + PCLMULQDQ multi-accumulator, plain C fallback. Runtime CPU dispatch wrapper included. Use whenever new CRC32C code is needed or an existing implementation is the bottleneck. |
references/platform-routing.md | Route Windows / WSL / native Linux / CUDA requests before applying a pattern. |
references/cuda.md | CUDA setup and performance triage: driver/toolkit visibility, Nsight Systems vs Nsight Compute, WSL GPU boundaries. |
Cross-skill integration
Profiler/data-collection skills can invoke this skill after identifying a known pattern. This skill also works standalone from source code, profiler snippets, or environment probe output.
Skill status
Ported and reviewed for Windows native C/C++, then extended with routing notes
for WSL Linux, native Linux, and CUDA/NVIDIA environments. The following core
files are present: this SKILL.md, PORTING-NOTES.md,
triggers/from-source.md, triggers/from-profile.md, guidelines/new-code.md,
design.md, library/cpu-dispatch.md, CRC32C library sources, platform
references, benchmark tests, and all pattern files:
parallel-accumulator,missing-vzeroupper,missing-restrictttas,false-sharing,per-cpu-stats,cold-path-annotationcv-thundering-herd,mutex-to-rwlock,simd-sortsimd-upconversion+simd-upconversion-implfast-crc32c+fast-crc32c-impllibrary-version-upgrade
references/library-versions.md intentionally keeps unverified third-party DLL
entries in a TODO table; do not promote those to recommendations without primary
source verification and Windows measurement.