Communityアート&デザインgithub.com

Amal-David/mlx-porting-skill

Agent Skill for MLX model porting, validation, quantization, benchmarking, and optimization.

対応~Claude Code~Codex CLI~Cursor
npx skills add Amal-David/mlx-porting-skill

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

ドキュメント

MLX model porting and optimization

Mission

Produce a correct, reproducible, architecture-aware MLX implementation. Correctness comes before speed. Every speed or memory claim must name the hardware, software versions, workload, baseline, and quality gate.

Non-negotiable rules

  1. Do not execute untrusted model code during intake. Inspect JSON, safetensors headers, source files, and licenses statically. Treat auto_map, custom modules, install hooks, and trust_remote_code as review gates.
  2. Pin the source. Record repository, revision, model files, tokenizer/processor revision, license, and checksum or artifact manifest.
  3. Build a source oracle before porting. Freeze deterministic fixtures and capture intermediate tensors at meaningful boundaries.
  4. Port the smallest eager path first. No quantization, compilation, custom kernels, batching, or speculative decoding until basic parity passes.
  5. Change one optimization dimension at a time. Keep a measurement and rollback record.
  6. Prefer native MLX operations. Try built-in fused operations, layout changes, cache design, and mx.compile before a custom Metal kernel.
  7. Do not translate CUDA folklore mechanically. A CUDA technique is only a research candidate until its Metal/MLX bottleneck and implementation are demonstrated.
  8. Never hide quality regressions behind throughput. For audio, language, vision, and generative models, use task-specific quality checks in addition to tensor tolerances.
  9. Do not publish converted weights without license and provenance checks. Preserve the original model card, attribution, generation config, tokenizer/processor files, and conversion recipe.
  10. Daily research automation is review-only. It may collect and rank candidates, but must not silently rewrite runbooks or merge recommendations.

Workflow

1. Inspect and classify

Run:

python3 scripts/inspect_model.py MODEL_OR_DIRECTORY --output inspection.json
python3 scripts/make_port_plan.py inspection.json --output PORT_PLAN.md
python3 scripts/recommend_optimizations.py inspection.json --markdown OPTIMIZATIONS.md

Read intake and routing. Confirm:

  • task and model domain;
  • architecture family and recurrent/cache state;
  • source framework and custom operations;
  • parameter count, dtypes, tied/shared weights, shards, and adapters;
  • preprocessing, tokenization, sampling, and postprocessing;
  • license and remote-code risk;
  • target Mac, memory budget, latency/throughput objective, and quality objective.

Do not begin implementation if the architecture, source revision, or evaluation target remains ambiguous. Record uncertainties in PORT_PLAN.md rather than guessing.

2. Select the closest proven MLX reference

Consult model support map and assets/architectures.yaml.

Use this order:

  1. official MLX / MLX-LM implementation;
  2. active MLX-VLM or MLX-Audio implementation with tests;
  3. active third-party MLX implementation;
  4. upstream source implementation plus architecture paper;
  5. research prototype requiring a new MLX implementation.

Reuse architectural patterns, not copied assumptions. Verify config semantics and tensor layouts against the pinned source.

3. Establish the source oracle

Follow parity and testing:

  • set deterministic seeds and inference mode;
  • save exact inputs after preprocessing;
  • capture shape/dtype/statistics and selected tensors after embeddings/frontends, every block group, bottleneck, cache update, logits/latents, and decoder/vocoder output;
  • save source outputs in portable .npz, .json, or .wav fixtures;
  • record tolerances by dtype and operation;
  • include at least one minimal, one ordinary, and one boundary case.

4. Implement the minimal eager MLX graph

Read core porting method and the matching architecture runbook:

Initial implementation constraints:

  • floating-point weights only;
  • batch size one unless batching is intrinsic;
  • no compile decorator;
  • no custom kernels;
  • explicit state and cache objects;
  • shape assertions around every nontrivial transform;
  • reversible weight-map manifest.

5. Convert weights deterministically

Create a weight map with source key, target key, source shape, target shape, transform, dtype, and tie/share rule. Validate it:

python3 scripts/validate_weight_map.py   --source source-manifest.json   --target target-manifest.json   --mapping WEIGHT_MAP.json

Never rely on load-time strict=False to conceal missing or extra weights. Categorize every exception as intentionally ignored, generated, shared, or unsupported.

6. Pass the parity ladder

Use scripts/compare_tensors.py and pass, in order:

  1. config and preprocessing parity;
  2. weight coverage and transformed-shape parity;
  3. single primitive/block parity;
  4. staged intermediate parity;
  5. end-to-end deterministic parity;
  6. task-quality parity;
  7. cache/state and incremental-generation parity;
  8. boundary, long-input, streaming, and batch parity.

When parity fails, use failure atlas. Do not optimize a failing graph.

7. Profile before choosing optimizations

Read benchmarking. Separate:

  • prefill/encoder/frontend time;
  • per-token or per-frame decode time;
  • codec/vocoder/postprocess time;
  • compilation and first-run cost;
  • peak and steady-state memory;
  • data movement, synchronization, and Python overhead.

Use:

python3 scripts/benchmark_command.py --warmup 2 --runs 8 --output benchmark.json -- COMMAND ...

8. Apply the optimization ladder

Consult assets/optimization_guidance.yaml, assets/recommendation-taxonomy.yaml, assets/techniques.yaml, and load only relevant guides:

  1. MLX runtime, compilation, and kernels
  2. attention and KV cache
  3. decoding and serving
  4. quantization
  5. training and fine-tuning

Default order:

  1. remove unintended evaluations and host transfers;
  2. correct dtype and tensor layout;
  3. use native fused operations and fast SDPA;
  4. reduce allocations and make state/cache updates explicit;
  5. compile stable regions and control recompilation;
  6. chunk prefill/frontends or stream where the architecture permits;
  7. optimize KV/cache policy and batching;
  8. quantize weights, then KV/state if justified;
  9. add speculative or multi-token decoding only for compatible autoregressive paths;
  10. write a custom Metal kernel only after profiling proves a remaining kernel bottleneck.

For every change, record hypothesis, diff, correctness result, benchmark result, memory result, quality result, and keep/revert decision.

9. Package and publish

Follow packaging and publication. Include:

  • pinned source and conversion command;
  • compatible MLX and library versions;
  • model/config/tokenizer/processor files;
  • exact quantization recipe and exclusions;
  • deterministic smoke test;
  • benchmark protocol and raw results;
  • known limitations;
  • original license and attribution;
  • no unsupported “faster” or “lossless” wording.

10. Return an engineering report

The final response must summarize:

  • architecture and selected runbook;
  • source revision and evidence references;
  • implementation and weight-map status;
  • parity matrix;
  • baseline and optimized metrics;
  • accepted and rejected optimizations;
  • remaining risks and unsupported paths;
  • reproducible commands and artifact locations.

Use the templates in assets/.

When to stop

Stop and report rather than improvising when:

  • the source license prohibits the requested distribution;
  • required source behavior only exists in unreviewed remote code;
  • parity cannot be localized after the staged checks;
  • an optimization improves a microbenchmark but worsens end-to-end latency, memory, or quality;
  • a custom kernel lacks a portable fallback or adequate tests;
  • hardware, MLX version, workload, or quality target is missing from a performance claim.

Maintenance

For source review, security, and the daily candidate pipeline, read maintenance and provenance. Run scripts/audit_skill.py and scripts/validate_sources.py before distributing this skill.

関連スキル