Community라이팅 & 에디팅github.com

zhichao1208/reduce-hallucination

Audit LLM agents for hallucination using interrogation-science techniques. Ground, cite, abstain. An open agent skill.

지원 대상~Claude Code~Codex CLI~Cursor
npx skills add zhichao1208/reduce-hallucination

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

문서

reduce-hallucination

Turn the verified techniques interrogation science uses to get a knowledgeable witness to tell the truth into a standard audit for LLM agents. Each LLM node is a witness. The handoff between nodes is the transcription of testimony. User corrections and task records are physical evidence.

Theory and evidence live in references/. Read references/research-map.md for the citations behind each claim.

Three iron rules

  1. No accusation without evidence (Wellman's rule). Every finding must quote the actual prompt, schema, or output it is about. A suspicion without a quote goes in a "to verify" list, not in the findings.
  2. Surface signals do not prove truth. Fluent, confident, detailed text is not more correct than terse text. People read deception from behavior at about chance. Judge content only: grounding, consistency, whether the cited span exists.
  3. Self-correction needs a new external signal. "Are you sure? Check again" with no new evidence degrades accuracy. Every re-check must inject retrieved results or validator output.

The three pillars of a good citation

Citation quality decomposes into three checks, in ascending order of difficulty and value:

PillarQuestionStatus today
FaithfulIs the quote verbatim-real and the claim non-fabricated?Largely solvable: verbatim string-match + abstention.
RelevantIs the source on topic for the question?Largely solvable: retrieval quality + consistency probing.
SupportiveDoes the source support the claim, vs contradict it, vs stay neutral?The frontier. Hard, and the rare "contradict" class is where systems are weakest.

A citation that resolves to a real, on-topic passage but does not actually support the claim ("misgrounding") is worse than no citation, because it manufactures trust over a wrong claim and disarms the reviewer's only working defense (checking the content). Showing a citation without verifying it backend makes hallucination more dangerous, not less.

Audit protocol

Phase 1 — interrogate each witness (prompt + schema red flags)

For each LLM node, run the red-flag list. Each flag is one interrogation violation.

#Red flag (what to look for)Why it mattersHarm
R1An enum field with no null / unknown option; a one-way "output a confirmation of X" instruction.Forcing an answer produces compliance (false-confession analogue).Forced guessing, confident fabrication.
R2Prose grants an abstention ("use null if missing") but the schema has no null.Schema wins, the abstention is void.Silent fabrication.
R3The prompt asks for evidence/citation, but nothing downstream checks it.Telling without verifying teaches the model to fabricate more convincing citations.Fake anchors.
R4Directional instructions: "be conservative", "assume", "infer from", "default to".One-way presumption; combined with R1 it produces systematic bias.Skewed output in one direction.
R5An aggregate field with no defined unit ("total years", "overall score").A measurement needs its unit declared before it can be locked down.Definition mismatch read as an extraction error.
R6Few-shot examples that are off-task or that model citation-free reasoning.Examples set the expectation for a "good" answer.Teaches the wrong output style.
R7The prompt claims the node can do something it has no tool for.False premise.Fabricated process narration.
R8A "fact" handed in by the user or an upstream node, used as truth with no source tag.Only confront with verified evidence.Premise pollution.
R9A template with a mandatory slot and no empty branch.The template version of forcing an answer.A blank gets filled with the nearest value.
R10Re-check / retry written aggressively ("you were wrong, fix it") with no new evidence.Sycophancy flips correct answers.Right answers overturned.

Record each as: node | R# | quoted text | one-line consequence.

Phase 2 — audit the transcription (node boundaries / data flow)

This is where most pipeline hallucination happens. Draw the data flow and check each edge:

  1. Does uncertainty cross the boundary? Upstream confidence / missing_fields / null / notes — does anything downstream consume them, or does a template flatten them into an assertion? Look for "dead-letter" uncertainty: produced but never read.
  2. Is provenance lost? A value the upstream reasoning admits is "assumed/inferred" becomes an unmarked fact downstream. The schema should carry stated | inferred | absent.
  3. Do fact categories cross wires? Configuration or requirement data rendered as if the user/document asserted it.
  4. Does an aggregate change unit? "total medical years" upstream becomes "years as title X" downstream.
  5. Is the LLM doing deterministic computation? Date math, sums, counts in a prompt should move to a code step.

Phase 3 — match physical evidence (when task records exist)

  1. Pull intermediate node outputs for a sample of tasks (prefer ones with user corrections or failures).
  2. Compare final output vs each node's intermediate output vs the raw input, field by field; locate which boundary first lost the truth.
  3. Find prior inconsistent statements: the same field contradicting itself across nodes.
  4. If corrections are directional (systematically one way), re-check Phase-1 R4 directional instructions; random error points more to input quality.
  5. Classify honestly: "information that was never in the source" (a process gap, not a hallucination) vs "information that was in the input but stated wrong" (a real hallucination).

Phase 4 — ranked fixes (by ROI)

Layer 1 — prompt / schema (zero structural change)

  • Each R1/R2 → add a null/unknown enum + one de-stigmatizing line ("X is a correct, expected answer").
  • Each R9 → add an empty branch to the template (assertion becomes an open question).
  • Each R5 → split the unit, or output a range {floor, ceiling}.
  • Each R3 → add a {value, source, evidence_quote} triple to load-bearing fields.

Layer 2 — a deterministic code validator (zero LLM cost)

ValidatorChecks
Enum whitelistField value is in the allowed set.
Quote string-matchevidence_quote exists verbatim in the cited source; if not, reject or downgrade.
Cross-field consistencyTotals reconcile; missing_fields does not contradict filled values.
Unit computationDate diffs / sums done in code; the LLM only does semantic extraction.
Exit contractEvery user-visible fact slot is non-empty and comes from an allowed source field.

Place it before any node that produces user-visible content. On failure, write a fallback flag and route to an open question or a human.

Layer 3 — an LLM cross-examiner node (LM-vs-LM)

  • A separate node that gets only the output under review plus the source document (not the polluting context).
  • Phrase the task to refute: "Try to refute: does every claim in this output trace to the source document? List the claims that do not."
  • Output {verdict, offending_claims[]}; for high-stakes flows, re-ask reordered and take a consistency vote.
  • Cost: +1 LLM call per task — gate it to user-visible exits or high-stakes decisions only.

Phase 5 — the report

# Hallucination Audit — {agent}
## One-line verdict
## Evidence overview (nodes / version audited / task sample / correction sample + direction)
## Findings (each: location | R# | quoted text | consequence | fix #)
## Data-flow map and dead-letter channels
## Fix recommendations (Layer 1/2/3, each tagged with the findings it kills)
## To verify (suspicions without enough evidence + how to get it)
## Honest boundary (which corrections are process gaps, not hallucinations; what this audit did not cover)

Supportiveness as the frontier (optional extension)

Faithful + relevant are the floor. The open problem is supportive: does the cited source support, contradict, or stay neutral on the claim? Two notes from the research:

  • Model it as stance plus degree. Three-way stance (entail / contradict / neutral) misses partial or conditional support, which is most of regulatory and legal text. Add a "partial/conditional" flag that routes to human review.
  • Supportiveness is relative to intent. The same passage supports "we comply" and contradicts "we have a gap", depending on what the user is arguing. When stance or intent is uncertain, surface both a supporting and a contradicting source and let the user choose. Present direction as a flag inviting review, not a verdict.

Reminders

  • If a prompt is "shockingly bad", first check whether it is auto-generated boilerplate (repeated Role/Objective/Task is the tell). "A systematic problem with this generator" beats listing the same fault per agent.
  • Contrast within one agent: which fields are clean and which fail. The difference is usually the answer (fields with an unknown option tend to be clean).
  • Run the report through a plain-language editing pass: a finding must be legible to someone who has not read the theory.

Related skills

Part of a small family of open agent skills for trustworthy, clear communication:

관련 스킬

steipete/notion

Notion CLI/API for pages, Markdown content, data sources, files, comments, search, Workers, and raw API calls.

community

affaan-m/seo

Audit, plan, and implement SEO improvements across technical SEO, on-page optimization, structured data, Core Web Vitals, and content strategy. Use when the user wants better search visibility, SEO remediation, schema markup, sitemap/robots work, or keyword mapping.

community

affaan-m/brand-voice

Build a source-derived writing style profile from real posts, essays, launch notes, docs, or site copy, then reuse that profile across content, outreach, and social workflows. Use when the user wants voice consistency without generic AI writing tropes.

community

affaan-m/crosspost

Multi-platform content distribution across X, LinkedIn, Threads, and Bluesky. Adapts content per platform using content-engine patterns. Never posts identical content cross-platform. Use when the user wants to distribute content across social platforms.

community

affaan-m/x-api

X/Twitter API integration for posting tweets, threads, reading timelines, search, and analytics. Covers OAuth auth patterns, rate limits, and platform-native content posting. Use when the user wants to interact with X programmatically.

community

affaan-m/content-engine

Create platform-native content systems for X, LinkedIn, TikTok, YouTube, newsletters, and repurposed multi-platform campaigns. Use when the user wants social posts, threads, scripts, content calendars, or one source asset adapted cleanly across platforms.

community