/autoresearch — Self-Improving Prompt Optimization

Karpathy autoresearch pattern: Generate → Evaluate → Score → Keep/Discard → Mutate → Repeat

Architecture

Core engine: crates/sgr-agent/src/autoresearch.rs (Rust module, feature genai). Dashboard: skills/autoresearch/dashboard.py (Python, standalone viewer). Test cases: skills/autoresearch/test_cases/*.json (embedded at compile time via include_str!).

Targets

Target	What it optimizes	Eval metric
`tool-selection`	Tool descriptions in agent	% correct tool chosen for 66 test tasks
`system-prompt`	Agent system prompt (SOUL.md)	Decision quality: right tool + coherent reasoning (4 criteria)
`skill`	Any SKILL.md file	Binary criteria pass rate (4 criteria)
`decision-parser`	Structured output schema	Parse success + field quality (5 criteria)

Usage (Rust API)

use sgr_agent::autoresearch::{AutoResearch, Config, Target};

let config = Config {
    target: Target::ToolSelection,
    batch_size: 10,
    cycle_secs: 120,
    gen_model: "gemini-2.5-flash".into(),
    eval_model: "claude-sonnet-4-6".into(),
    data_dir: "autoresearch_data/tool-selection".into(),
};
let ar = AutoResearch::new(config);
ar.run(20).await?; // 20 cycles

Quick Start (CLI — when wired into rc-cli)

# Tool selection optimization (most impactful)
cargo run -- autoresearch tool-selection --cycles 20

# System prompt optimization
cargo run -- autoresearch system-prompt --cycles 10

# Optimize a specific skill
cargo run -- autoresearch skill --name delegate --cycles 15

# Optimize decision parsing
cargo run -- autoresearch decision-parser --cycles 10

# Dashboard (Python, standalone)
python3 skills/autoresearch/dashboard.py --target tool-selection --port 8501

How It Works

Each cycle:

Generate N outputs with current prompt (via Gemini — same model agent uses)
Evaluate each output against binary criteria (via Claude Sonnet — different model for objectivity)
Score = sum of passed criteria across all outputs
Keep if score > best_score, discard otherwise
Mutate the winning prompt to try improvements
Log to JSONL for dashboard tracking
Wait for next cycle (default 2 min)

Environment

GEMINI_API_KEY — for generation (tests with same model the agent uses)
ANTHROPIC_API_KEY — for evaluation and mutation (different model for objectivity)

Tips

Binary evals — yes/no criteria work best. Avoid Likert scales.
Don't over-constrain — too many narrow criteria → model games the eval.
10-20 cycles usually enough for significant improvement.
Apply results — best prompt saved to data/<target>/best_prompt.txt.

Cost

Tool selection: ~$0.05/cycle (cheap — one LLM call per test, string match eval)
System prompt: ~$0.20/cycle (gen + LLM eval per test)
Skill: ~$0.30/cycle (gen + LLM eval per test, longer outputs)
Decision parser: ~$0.15/cycle (gen + partial local eval)
20 cycles ≈ $1-6 depending on target. Permanent improvement.

fortunto2/rust-code

¿Qué es rust-code?

Preguntar en tu IA favorita

Documentación

/autoresearch — Self-Improving Prompt Optimization

Architecture

Targets

Usage (Rust API)

Quick Start (CLI — when wired into rc-cli)

How It Works

Environment

Tips

Cost

Skills relacionados

steipete/bluebubbles

steipete/eightctl

steipete/blucli

steipete/bear-notes

steipete/camsnap

steipete/gifgrep