MLX model porting and optimization
Mission
Produce a correct, reproducible, architecture-aware MLX implementation. Correctness comes before speed. Every speed or memory claim must name the hardware, software versions, workload, baseline, and quality gate.
Non-negotiable rules
- Do not execute untrusted model code during intake. Inspect JSON, safetensors headers, source files, and licenses statically. Treat
auto_map, custom modules, install hooks, andtrust_remote_codeas review gates. - Pin the source. Record repository, revision, model files, tokenizer/processor revision, license, and checksum or artifact manifest.
- Build a source oracle before porting. Freeze deterministic fixtures and capture intermediate tensors at meaningful boundaries.
- Port the smallest eager path first. No quantization, compilation, custom kernels, batching, or speculative decoding until basic parity passes.
- Change one optimization dimension at a time. Keep a measurement and rollback record.
- Prefer native MLX operations. Try built-in fused operations, layout changes, cache design, and
mx.compilebefore a custom Metal kernel. - Do not translate CUDA folklore mechanically. A CUDA technique is only a research candidate until its Metal/MLX bottleneck and implementation are demonstrated.
- Never hide quality regressions behind throughput. For audio, language, vision, and generative models, use task-specific quality checks in addition to tensor tolerances.
- Do not publish converted weights without license and provenance checks. Preserve the original model card, attribution, generation config, tokenizer/processor files, and conversion recipe.
- Daily research automation is review-only. It may collect and rank candidates, but must not silently rewrite runbooks or merge recommendations.
Workflow
1. Inspect and classify
Run:
python3 scripts/inspect_model.py MODEL_OR_DIRECTORY --output inspection.json
python3 scripts/make_port_plan.py inspection.json --output PORT_PLAN.md
python3 scripts/recommend_optimizations.py inspection.json --markdown OPTIMIZATIONS.md
Read intake and routing. Confirm:
- task and model domain;
- architecture family and recurrent/cache state;
- source framework and custom operations;
- parameter count, dtypes, tied/shared weights, shards, and adapters;
- preprocessing, tokenization, sampling, and postprocessing;
- license and remote-code risk;
- target Mac, memory budget, latency/throughput objective, and quality objective.
Do not begin implementation if the architecture, source revision, or evaluation target remains ambiguous. Record uncertainties in PORT_PLAN.md rather than guessing.
2. Select the closest proven MLX reference
Consult model support map and assets/architectures.yaml.
Use this order:
- official MLX / MLX-LM implementation;
- active MLX-VLM or MLX-Audio implementation with tests;
- active third-party MLX implementation;
- upstream source implementation plus architecture paper;
- research prototype requiring a new MLX implementation.
Reuse architectural patterns, not copied assumptions. Verify config semantics and tensor layouts against the pinned source.
3. Establish the source oracle
Follow parity and testing:
- set deterministic seeds and inference mode;
- save exact inputs after preprocessing;
- capture shape/dtype/statistics and selected tensors after embeddings/frontends, every block group, bottleneck, cache update, logits/latents, and decoder/vocoder output;
- save source outputs in portable
.npz,.json, or.wavfixtures; - record tolerances by dtype and operation;
- include at least one minimal, one ordinary, and one boundary case.
4. Implement the minimal eager MLX graph
Read core porting method and the matching architecture runbook:
- dense decoder Transformer
- Mixture-of-Experts Transformer
- encoder Transformer
- encoder-decoder Transformer
- SSM, recurrent, and hybrid
- diffusion and flow
- vision-language and omni
- neural audio codec
- autoregressive audio LM / TTS
- flow or diffusion TTS
- vocoder
- ASR
- streaming speech
- separation and enhancement
Initial implementation constraints:
- floating-point weights only;
- batch size one unless batching is intrinsic;
- no compile decorator;
- no custom kernels;
- explicit state and cache objects;
- shape assertions around every nontrivial transform;
- reversible weight-map manifest.
5. Convert weights deterministically
Create a weight map with source key, target key, source shape, target shape, transform, dtype, and tie/share rule. Validate it:
python3 scripts/validate_weight_map.py --source source-manifest.json --target target-manifest.json --mapping WEIGHT_MAP.json
Never rely on load-time strict=False to conceal missing or extra weights. Categorize every exception as intentionally ignored, generated, shared, or unsupported.
6. Pass the parity ladder
Use scripts/compare_tensors.py and pass, in order:
- config and preprocessing parity;
- weight coverage and transformed-shape parity;
- single primitive/block parity;
- staged intermediate parity;
- end-to-end deterministic parity;
- task-quality parity;
- cache/state and incremental-generation parity;
- boundary, long-input, streaming, and batch parity.
When parity fails, use failure atlas. Do not optimize a failing graph.
7. Profile before choosing optimizations
Read benchmarking. Separate:
- prefill/encoder/frontend time;
- per-token or per-frame decode time;
- codec/vocoder/postprocess time;
- compilation and first-run cost;
- peak and steady-state memory;
- data movement, synchronization, and Python overhead.
Use:
python3 scripts/benchmark_command.py --warmup 2 --runs 8 --output benchmark.json -- COMMAND ...
8. Apply the optimization ladder
Consult assets/optimization_guidance.yaml, assets/recommendation-taxonomy.yaml, assets/techniques.yaml, and load only relevant guides:
- MLX runtime, compilation, and kernels
- attention and KV cache
- decoding and serving
- quantization
- training and fine-tuning
Default order:
- remove unintended evaluations and host transfers;
- correct dtype and tensor layout;
- use native fused operations and fast SDPA;
- reduce allocations and make state/cache updates explicit;
- compile stable regions and control recompilation;
- chunk prefill/frontends or stream where the architecture permits;
- optimize KV/cache policy and batching;
- quantize weights, then KV/state if justified;
- add speculative or multi-token decoding only for compatible autoregressive paths;
- write a custom Metal kernel only after profiling proves a remaining kernel bottleneck.
For every change, record hypothesis, diff, correctness result, benchmark result, memory result, quality result, and keep/revert decision.
9. Package and publish
Follow packaging and publication. Include:
- pinned source and conversion command;
- compatible MLX and library versions;
- model/config/tokenizer/processor files;
- exact quantization recipe and exclusions;
- deterministic smoke test;
- benchmark protocol and raw results;
- known limitations;
- original license and attribution;
- no unsupported “faster” or “lossless” wording.
10. Return an engineering report
The final response must summarize:
- architecture and selected runbook;
- source revision and evidence references;
- implementation and weight-map status;
- parity matrix;
- baseline and optimized metrics;
- accepted and rejected optimizations;
- remaining risks and unsupported paths;
- reproducible commands and artifact locations.
Use the templates in assets/.
When to stop
Stop and report rather than improvising when:
- the source license prohibits the requested distribution;
- required source behavior only exists in unreviewed remote code;
- parity cannot be localized after the staged checks;
- an optimization improves a microbenchmark but worsens end-to-end latency, memory, or quality;
- a custom kernel lacks a portable fallback or adequate tests;
- hardware, MLX version, workload, or quality target is missing from a performance claim.
Maintenance
For source review, security, and the daily candidate pipeline, read maintenance and provenance. Run scripts/audit_skill.py and scripts/validate_sources.py before distributing this skill.