CommunityArt & Designgithub.com

2233admin/performance-patterns-skill

SKILL.md playbook that helps Codex and Claude route Windows, WSL, Linux, and CUDA performance problems.

Works withClaude CodeCodex CLI~Cursor
npx add-skill 2233admin/performance-patterns-skill

name: performance-patterns description: >- Detect, triage, and fix performance problems across Windows native C/C++ (clang-cl/MSVC; VTune, AMD uProf, ETW/WPA), WSL Linux, native Linux (perf/flamegraphs/GCC/Clang), and CUDA/NVIDIA environments (nvidia-smi, nvcc, Nsight Systems, Nsight Compute). Invoke when the user asks to optimize or debug performance, review SIMD/vectorized code, investigate WSL/Linux/Windows profiler output, troubleshoot CUDA driver/toolkit/GPU visibility issues, or decide whether a bottleneck is CPU host code, WSL boundary overhead, native Linux tooling, or CUDA kernels. Trigger on serial accumulator loops, narrow SIMD, _mm* intrinsics, HITM/false sharing, missing restrict/vzeroupper, condition-variable thundering herd, mutex-to-rwlock, hot library/DLL symbols, fast CRC32C, known algorithms, SIMD sort, WSL filesystem/perf issues, and CUDA launch/copy/kernel/occupancy/toolkit problems.

Performance patterns skill

A growing catalog of well-known code patterns that cause performance problems, with detection signals and resolution playbooks for each. The core pattern catalog focuses on x86 CPU code, with platform routing for Windows, WSL Linux, native Linux, and CUDA/NVIDIA environments.

The optimization knowledge is portable where the hardware is the same. What changes by platform is the mechanics: profiler vocabulary, compiler flags, debug-info format, synchronization primitives, CPU feature detection, and whether the bottleneck is CPU host code or CUDA device work.


Step 0 — Route the platform

If the user mentions WSL, Linux, CUDA, NVIDIA, GPU, Nsight, driver/toolkit mismatch, containers, or the platform is unclear, read references/platform-routing.md first.

Then load the platform-specific reference:

Platform / symptomRead
Windows native C/C++ CPU performancePORTING-NOTES.md
WSL Linux CPU performancereferences/wsl-linux.md, then references/linux-native.md
Native Linux CPU performancereferences/linux-native.md
CUDA/NVIDIA setup or GPU performancereferences/cuda.md

Use scripts/collect-perf-env.ps1 on Windows and scripts/collect-perf-env.sh inside WSL/native Linux when the environment is the problem or the user has not provided enough toolchain/profiler context.


How to use this skill

Step 1 — Load the right file for your context

ContextRead this file
You have profiling output (VTune, AMD uProf, ETW/WPA, perf, flamegraph, Nsight summary, etc.)triggers/from-profile.md
You are reading existing source code and have no profiling data yettriggers/from-source.md
You are writing new performance-sensitive C/C++ or SIMD codeguidelines/new-code.md

The trigger files cover all the same patterns; they are separated so you only load what is relevant. guidelines/new-code.md is a write-time checklist — load it instead of a trigger file when generating new code, not reviewing it.

Step 2 — Identify the matching pattern

Each trigger file contains a compact table and brief descriptions — enough to decide whether the code or profile matches a known pattern.

Step 3 — Read the pattern detail file

When a pattern matches, read the corresponding file from patterns/. Do not attempt the fix from memory.

Step 4 — Apply the fix and verify

Follow the step-by-step instructions and verification method in the pattern file, using the platform mechanics from the routing file loaded in Step 0.

Multiple patterns can co-apply. Check all plausible matches before picking one.


Reusable library modules

These standalone implementation guides are available to any agent working in this skill, not only when following a specific pattern. Load the relevant file directly if the capability is needed.

ModuleWhat it provides
library/cpu-dispatch.mdRuntime CPU feature detection and variant selection on Windows. Manual function-pointer dispatch onlytarget_clones is unavailable on the PE/COFF target (no IFUNC). Covers __builtin_cpu_supports, the explicit __cpuid/_xgetbv path, and per-variant __attribute__((target(...))). Vendor-neutral (Intel + AMD). Use whenever a function has multiple performance-level implementations that need to be wired together at runtime.
patterns/simd-upconversion-impl.mdFull step-by-step zipper algorithm for doubling vector register width in intrinsics (SSE→AVX2 or AVX2→AVX-512); AVX-512 accumulator template; post-transformation checklist (CPUID guards, vzeroupper, target attributes).
patterns/fast-crc32c-impl.mdDrop-in CRC32C library: AVX-512 VPCLMULQDQ fusion, SSE4.2 + PCLMULQDQ multi-accumulator, plain C fallback. Runtime CPU dispatch wrapper included. Use whenever new CRC32C code is needed or an existing implementation is the bottleneck.
references/platform-routing.mdRoute Windows / WSL / native Linux / CUDA requests before applying a pattern.
references/cuda.mdCUDA setup and performance triage: driver/toolkit visibility, Nsight Systems vs Nsight Compute, WSL GPU boundaries.

Cross-skill integration

Profiler/data-collection skills can invoke this skill after identifying a known pattern. This skill also works standalone from source code, profiler snippets, or environment probe output.

Skill status

Ported and reviewed for Windows native C/C++, then extended with routing notes for WSL Linux, native Linux, and CUDA/NVIDIA environments. The following core files are present: this SKILL.md, PORTING-NOTES.md, triggers/from-source.md, triggers/from-profile.md, guidelines/new-code.md, design.md, library/cpu-dispatch.md, CRC32C library sources, platform references, benchmark tests, and all pattern files:

  • parallel-accumulator, missing-vzeroupper, missing-restrict
  • ttas, false-sharing, per-cpu-stats, cold-path-annotation
  • cv-thundering-herd, mutex-to-rwlock, simd-sort
  • simd-upconversion + simd-upconversion-impl
  • fast-crc32c + fast-crc32c-impl
  • library-version-upgrade

references/library-versions.md intentionally keeps unverified third-party DLL entries in a TODO table; do not promote those to recommendations without primary source verification and Windows measurement.

Related Skills