name: skill-chunk-md
description: "Transform Markdown into CtxFST documents — a semantic world model format with structured chunks, entity graphs, and operational metadata. Use when converting notes into agent-ready knowledge bases, building entity memory or per-entity dossiers (memory-shaped, one-file-one-entity convention), adding <Chunk> tags and YAML frontmatter, extracting canonical entities from text, or preparing documents for LanceDB, Lance Graph, HelixDB, LightRAG, and HippoRAG pipelines."
Skill Chunk MD
Transform Markdown into CtxFST documents with semantic <Chunk> tags, structured frontmatter, and an explicit entity layer.
Goal
Use this skill to produce documents that support both:
- Chunk retrieval for detailed context
- Entity retrieval for navigation, graph expansion, and related-concept discovery
Do not only split text. Also extract the important domain entities, normalize them, and link each chunk to the entities it actually discusses.
Choosing a Layout
Decide early whether the source is note-shaped or memory-shaped:
- Note-shaped (one profile, one benchmark, one meeting log): use the standard multi-entity-per-file layout described below.
- Memory-shaped (a knowledge base that grows over time, per-entity dossiers, agent memory): use the Entity-Centric convention — one file per entity, filename
entity-<id-suffix>.ctxfst.md,entities:with exactly one owner, every chunk referencing the owner. Cross-references use entity IDs and resolve to other files.
Entity-centric is a convention, not a schema change. The same frontmatter rules apply; the difference is that len(entities) == 1 and the filename carries the owner. See assets/examples/entity-centric/ for a worked example, and validate with python3 scripts/validate_chunks.py <path> --entity-centric.
Target Format
CtxFST documents should contain:
- YAML frontmatter
- Document-level
entitiescatalog - Document-level
chunkscatalog - Body content wrapped in
<Chunk>tags
---
title: "Document Title"
entities:
- id: entity:python
name: Python
type: skill
aliases: [python3]
- id: entity:fastapi
name: FastAPI
type: framework
aliases: []
chunks:
- id: skill:python-api
tags: [Python, Backend, API]
entities: [entity:python, entity:fastapi]
context: "Python backend work focused on APIs built with FastAPI"
---
<Chunk id="skill:python-api">
## Python API Work
I use Python and FastAPI to build REST APIs...
</Chunk>
Core Principle
Use chunks as the content carrier and entities as the semantic index.
- Chunks answer: "What exact passage should be retrieved?"
- Entities answer: "What concept does this passage belong to?"
Tags are useful for broad filtering. Entities are the canonical graph nodes.
Core Workflow
Step 1: Analyze Document Structure
Identify semantic boundaries in the source Markdown:
- Headers (
##,###) that introduce a new topic - Thematic shifts within long sections
- Lists that describe one coherent concept
- Code blocks plus their explanation when they should stay together
Step 2: Determine Chunk Boundaries
Each chunk should be:
- Self-contained: understandable when retrieved alone
- Focused: centered on one main topic or closely related subtopic
- Retrievable: useful as a standalone answer fragment
Size guidelines:
- Minimum: ~100 tokens
- Target: 300-800 tokens
- Maximum: ~1500 tokens
Split oversized chunks when the topic changes. Merge undersized chunks when they cannot stand on their own.
Step 3: Extract Candidate Entities
Before writing frontmatter, extract the domain-specific entities from the document.
Look for:
- Hard skills
- Tools and libraries
- Frameworks
- Platforms
- Databases
- Protocols and standards
- Architectures and design patterns
- Named products or systems
Do not promote every noun into an entity. Prefer terms that would make sense as nodes in a knowledge graph.
Step 4: Normalize and Deduplicate Entities
Convert raw mentions into canonical entities.
Normalization rules:
- Use the most recognizable canonical name:
PostgreSQL, notpostgres - Merge aliases into one entity:
JS->JavaScript,K8s->Kubernetes - Keep acronym + full name only when both are genuinely used as aliases
- Remove generic terms like
system,project,tool,computer,work - Remove incidental mentions that are not semantically important to retrieval
If unsure whether something is an entity, ask:
- Would this help navigate related knowledge?
- Would this deserve its own node in a skills graph?
- Would retrieving chunks by this term produce meaningful results?
If the answer is no, keep it out of the entity list.
Step 5: Generate Entity IDs
Use the format: entity:{canonical-name}
Examples:
entity:pythonentity:fastapientity:postgresqlentity:event-driven-architecture
IDs must be lowercase kebab-case after the prefix.
Step 6: Generate Chunk IDs
Use the format: {category}:{topic}[-{subtopic}]
| Category | Use Case | Examples |
|---|---|---|
skill: | Technical skills | skill:python, skill:react-hooks |
about: | Personal or org info | about:background, about:mission |
project: | Project descriptions | project:graphrag, project:api-v2 |
principle: | Guidelines and values | principle:security-first |
workflow: | Processes | workflow:deployment, workflow:review |
reference: | Reference material | reference:api-auth, reference:schema |
Step 7: Create YAML Frontmatter
Define both entities and chunks.
---
title: "My Skills Document"
entities:
- id: entity:python
name: Python
type: skill
aliases: [python3]
- id: entity:fastapi
name: FastAPI
type: framework
aliases: []
- id: entity:pandas
name: Pandas
type: library
aliases: []
chunks:
- id: skill:python-api
tags: [Python, Backend, API]
entities: [entity:python, entity:fastapi]
context: "Python backend skills focused on REST APIs and service implementation"
created_at: "2026-03-08"
version: 1
type: text
priority: high
dependencies: []
- id: skill:python-data
tags: [Python, Data]
entities: [entity:python, entity:pandas]
context: "Python data-processing skills focused on tabular analysis and ETL work"
created_at: "2026-03-08"
version: 1
type: text
priority: medium
dependencies: []
---
Step 8: Link Chunks to Entities
Every chunk should list the canonical entities it actually discusses.
Linking rules:
- Include entities that are central to the chunk
- Skip entities that appear only in passing
- Prefer 1-6 entities per chunk
- Do not copy all document entities into every chunk
- If two chunks mention the same entity for different reasons, differentiate that in
context
Step 9: Wrap Content with <Chunk> Tags
Apply <Chunk> tags that match the frontmatter chunk IDs.
<Chunk id="skill:python-api">
## Python API Work
I use Python and FastAPI to build REST APIs and internal services.
</Chunk>
Step 10: Validate and Export
Use the included scripts:
python3 scripts/validate_chunks.py document.md
python3 scripts/export_to_lancedb.py document.md --output chunks.json
python3 scripts/diagnose_chunks.py document.md --level suggest
Entity Rules
What counts as an entity
Prefer entities that are:
- Specific
- reusable across documents
- meaningful as graph nodes
- useful for retrieval or expansion
Good examples:
PythonFastAPIDockerKubernetesCI/CDEvent-Driven ArchitectureJWT
Usually not entities:
experiencetechnologyfeatureapplicationcomputerproblemtask
Entity types
Use one of these default types:
skilltoollibraryframeworkplatformdatabasearchitectureprotocolconceptdomainproduct
If none fit well, use concept instead of inventing many custom types.
World Model entity types
When building documents that participate in a world model or agent loop, these additional types are available:
state— world state node (e.g.,entity:resume-parsed)action— executable action (e.g.,entity:analyze-resume)goal— task objective (e.g.,entity:learn-kubernetes-path)agent— actor or user (e.g.,entity:ian-chou)evidence— observation result (e.g.,entity:docker-3yr-experience)
World Model YAML fields
SKILL.md files that participate in a world model may include these optional YAML frontmatter fields:
---
name: analyze-resume
description: "Parse raw resume and extract skill evidence"
# === World Model Fields (all optional) ===
preconditions:
- "entity:has-raw-resume"
- "NOT entity:has-parsed-resume"
postconditions:
- "entity:has-parsed-resume"
- "entity:has-skill-evidence"
related_nodes:
- "entity:resume-parsing"
- "entity:skill-extraction"
related_skills:
- "career-mapping"
- "skill-gap-analysis"
cost: low
idempotent: true
---
| Field | Type | Description |
|---|---|---|
preconditions | string[] | State entities that must exist before this skill can execute |
postconditions | string[] | State entities created or updated after execution |
related_nodes | string[] | Anchor points in the semantic graph |
related_skills | string[] | Sequential or complementary skill names |
cost | enum | low, medium, high — estimated execution cost |
idempotent | bool | Whether safe to re-run without side effects |
Preconditions use NOT prefix for negation (e.g., "NOT entity:has-parsed-resume" means that state must not exist).
Tags vs entities
Use tags for:
- broad classification
- filtering
- document organization
Use entities for:
- canonical concept identity
- chunk-to-graph linking
- similarity and traversal workflows
- cross-document concept reuse
Example:
- Tag:
Backend - Entity:
entity:fastapi
Frontmatter Schema
Entity schema
entities:
- id: entity:fastapi
name: FastAPI
type: framework
aliases: []
Required fields:
idnametype
Optional field:
aliases
Chunk schema
chunks:
- id: skill:python-api
tags: [Python, Backend, API]
entities: [entity:python, entity:fastapi]
context: "Python backend skills focused on REST APIs built with FastAPI"
created_at: "2026-03-08"
version: 1
type: text
priority: high
dependencies: []
Recommended fields:
identitiescontext
Optional fields:
tagscreated_atversiontypeprioritydependencies
Context Writing Rules
Each chunk context should:
- explain the chunk's role in the document
- mention the distinguishing use case
- reflect the linked entities naturally
- avoid copying the first sentence verbatim
Good:
"Python backend skills focused on REST APIs, async services, and FastAPI-based implementation"
Weak:
"This chunk talks about Python"
Example Transformation
Before
## About Me
I'm a backend engineer focused on APIs and distributed systems.
## Python
I use Python for REST APIs and internal tools.
### Key Libraries
- FastAPI for web services
- Pandas for data processing
After
---
title: "Profile"
entities:
- id: entity:python
name: Python
type: skill
aliases: [python3]
- id: entity:fastapi
name: FastAPI
type: framework
aliases: []
- id: entity:pandas
name: Pandas
type: library
aliases: []
chunks:
- id: about:background
tags: [About, Experience]
entities: []
context: "Professional background as a backend engineer working on APIs and distributed systems"
created_at: "2026-03-08"
version: 1
type: text
priority: medium
dependencies: []
- id: skill:python
tags: [Python, Backend]
entities: [entity:python, entity:fastapi, entity:pandas]
context: "Python skills for API development, service work, and data processing"
created_at: "2026-03-08"
version: 1
type: text
priority: high
dependencies: [about:background]
---
<Chunk id="about:background">
## About Me
I'm a backend engineer focused on APIs and distributed systems.
</Chunk>
<Chunk id="skill:python">
## Python
I use Python for REST APIs and internal tools.
### Key Libraries
- FastAPI for web services
- Pandas for data processing
</Chunk>
GraphRAG Guidance
When preparing documents for graph-oriented retrieval:
- Treat
entitiesas the canonical node inventory - Treat chunk
entitiesarrays as chunk-to-entity edges - Treat
dependenciesas chunk-to-chunk prerequisite links - Keep tags broad and entities precise
Do not rely on tags alone when the goal is entity-based navigation.
When Not to Extract Many Entities
Be conservative when:
- the document is very short
- the document is mostly narrative with few domain terms
- the same concept is repeated with no useful distinctions
- the source is noisy and entity extraction would create junk nodes
In these cases, return a smaller, cleaner entity set.
Validation
After conversion, verify:
- Every
<Chunk>ID exists inchunks - Every
chunks[].entitiesreference exists inentities - Entity IDs are unique and canonical
- Chunk IDs are unique
- No nested chunks exist
- Context fields differentiate similar chunks
- Generic or noisy entities have been removed
python3 scripts/validate_chunks.py path/to/document.md
Diagnostics
When the user asks to diagnose, review both chunk quality and entity quality.
Check categories
- Chunk similarity
- Flag chunks that are too similar to distinguish during retrieval
- Context quality
- Flag vague, repetitive, or overly short context fields
- Tag overlap
- Flag tags that appear everywhere and no longer help filtering
- Entity noise
- Flag generic or low-value entities
- Entity duplication
- Flag aliases that should be merged into one canonical node
- Chunk-to-entity linking
- Flag chunks with missing, excessive, or irrelevant entity links
Diagnostic prompts
Diagnose only:
Check this document for chunk and entity quality issues
Diagnose with suggestions:
Diagnose this document's chunk and entity structure and suggest fixes
Fix-oriented review:
Rewrite the frontmatter so the entities are canonical and each chunk links only to the right entities
Resources
- Chunk syntax details: See references/chunk-syntax.md
- Entity format reference: See references/entity-format.md
- Semantic chunking theory: See references/semantic-chunking.md
- Examples: See assets/examples/ for before/after samples