Skill Chunk MD

Transform Markdown into CtxFST documents with semantic <Chunk> tags, structured frontmatter, and an explicit entity layer.

Goal

Use this skill to produce documents that support both:

Chunk retrieval for detailed context
Entity retrieval for navigation, graph expansion, and related-concept discovery

Do not only split text. Also extract the important domain entities, normalize them, and link each chunk to the entities it actually discusses.

Choosing a Layout

Decide early whether the source is note-shaped or memory-shaped:

Note-shaped (one profile, one benchmark, one meeting log): use the standard multi-entity-per-file layout described below.
Memory-shaped (a knowledge base that grows over time, per-entity dossiers, agent memory): use the Entity-Centric convention — one file per entity, filename entity-<id-suffix>.ctxfst.md, entities: with exactly one owner, every chunk referencing the owner. Cross-references use entity IDs and resolve to other files.

Entity-centric is a convention, not a schema change. The same frontmatter rules apply; the difference is that len(entities) == 1 and the filename carries the owner. See assets/examples/entity-centric/ for a worked example, and validate with python3 scripts/validate_chunks.py <path> --entity-centric.

Target Format

CtxFST documents should contain:

YAML frontmatter
Document-level entities catalog
Document-level chunks catalog
Body content wrapped in <Chunk> tags

---
title: "Document Title"
entities:
  - id: entity:python
    name: Python
    type: skill
    aliases: [python3]
  - id: entity:fastapi
    name: FastAPI
    type: framework
    aliases: []
chunks:
  - id: skill:python-api
    tags: [Python, Backend, API]
    entities: [entity:python, entity:fastapi]
    context: "Python backend work focused on APIs built with FastAPI"
---

<Chunk id="skill:python-api">
## Python API Work
I use Python and FastAPI to build REST APIs...
</Chunk>

Core Principle

Use chunks as the content carrier and entities as the semantic index.

Chunks answer: "What exact passage should be retrieved?"
Entities answer: "What concept does this passage belong to?"

Tags are useful for broad filtering. Entities are the canonical graph nodes.

Core Workflow

Step 1: Analyze Document Structure

Identify semantic boundaries in the source Markdown:

Headers (##, ###) that introduce a new topic
Thematic shifts within long sections
Lists that describe one coherent concept
Code blocks plus their explanation when they should stay together

Step 2: Determine Chunk Boundaries

Each chunk should be:

Self-contained: understandable when retrieved alone
Focused: centered on one main topic or closely related subtopic
Retrievable: useful as a standalone answer fragment

Size guidelines:

Minimum: ~100 tokens
Target: 300-800 tokens
Maximum: ~1500 tokens

Split oversized chunks when the topic changes. Merge undersized chunks when they cannot stand on their own.

Step 3: Extract Candidate Entities

Before writing frontmatter, extract the domain-specific entities from the document.

Look for:

Hard skills
Tools and libraries
Frameworks
Platforms
Databases
Protocols and standards
Architectures and design patterns
Named products or systems

Do not promote every noun into an entity. Prefer terms that would make sense as nodes in a knowledge graph.

Step 4: Normalize and Deduplicate Entities

Convert raw mentions into canonical entities.

Normalization rules:

Use the most recognizable canonical name: PostgreSQL, not postgres
Merge aliases into one entity: JS -> JavaScript, K8s -> Kubernetes
Keep acronym + full name only when both are genuinely used as aliases
Remove generic terms like system, project, tool, computer, work
Remove incidental mentions that are not semantically important to retrieval

If unsure whether something is an entity, ask:

Would this help navigate related knowledge?
Would this deserve its own node in a skills graph?
Would retrieving chunks by this term produce meaningful results?

If the answer is no, keep it out of the entity list.

Step 5: Generate Entity IDs

Use the format: entity:{canonical-name}

Examples:

entity:python
entity:fastapi
entity:postgresql
entity:event-driven-architecture

IDs must be lowercase kebab-case after the prefix.

Step 6: Generate Chunk IDs

Use the format: {category}:{topic}[-{subtopic}]

Category	Use Case	Examples
`skill:`	Technical skills	`skill:python`, `skill:react-hooks`
`about:`	Personal or org info	`about:background`, `about:mission`
`project:`	Project descriptions	`project:graphrag`, `project:api-v2`
`principle:`	Guidelines and values	`principle:security-first`
`workflow:`	Processes	`workflow:deployment`, `workflow:review`
`reference:`	Reference material	`reference:api-auth`, `reference:schema`

Step 7: Create YAML Frontmatter

Define both entities and chunks.

---
title: "My Skills Document"
entities:
  - id: entity:python
    name: Python
    type: skill
    aliases: [python3]
  - id: entity:fastapi
    name: FastAPI
    type: framework
    aliases: []
  - id: entity:pandas
    name: Pandas
    type: library
    aliases: []
chunks:
  - id: skill:python-api
    tags: [Python, Backend, API]
    entities: [entity:python, entity:fastapi]
    context: "Python backend skills focused on REST APIs and service implementation"
    created_at: "2026-03-08"
    version: 1
    type: text
    priority: high
    dependencies: []
  - id: skill:python-data
    tags: [Python, Data]
    entities: [entity:python, entity:pandas]
    context: "Python data-processing skills focused on tabular analysis and ETL work"
    created_at: "2026-03-08"
    version: 1
    type: text
    priority: medium
    dependencies: []
---

Step 8: Link Chunks to Entities

Every chunk should list the canonical entities it actually discusses.

Linking rules:

Include entities that are central to the chunk
Skip entities that appear only in passing
Prefer 1-6 entities per chunk
Do not copy all document entities into every chunk
If two chunks mention the same entity for different reasons, differentiate that in context

Step 9: Wrap Content with `<Chunk>` Tags

Apply <Chunk> tags that match the frontmatter chunk IDs.

<Chunk id="skill:python-api">
## Python API Work

I use Python and FastAPI to build REST APIs and internal services.
</Chunk>

Step 10: Validate and Export

Use the included scripts:

python3 scripts/validate_chunks.py document.md
python3 scripts/export_to_lancedb.py document.md --output chunks.json
python3 scripts/diagnose_chunks.py document.md --level suggest

Entity Rules

What counts as an entity

Prefer entities that are:

Specific
reusable across documents
meaningful as graph nodes
useful for retrieval or expansion

Good examples:

Python
FastAPI
Docker
Kubernetes
CI/CD
Event-Driven Architecture
JWT

Usually not entities:

experience
technology
feature
application
computer
problem
task

Entity types

Use one of these default types:

skill
tool
library
framework
platform
database
architecture
protocol
concept
domain
product

If none fit well, use concept instead of inventing many custom types.

World Model entity types

When building documents that participate in a world model or agent loop, these additional types are available:

state — world state node (e.g., entity:resume-parsed)
action — executable action (e.g., entity:analyze-resume)
goal — task objective (e.g., entity:learn-kubernetes-path)
agent — actor or user (e.g., entity:ian-chou)
evidence — observation result (e.g., entity:docker-3yr-experience)

World Model YAML fields

SKILL.md files that participate in a world model may include these optional YAML frontmatter fields:

---
name: analyze-resume
description: "Parse raw resume and extract skill evidence"
# === World Model Fields (all optional) ===
preconditions:
  - "entity:has-raw-resume"
  - "NOT entity:has-parsed-resume"
postconditions:
  - "entity:has-parsed-resume"
  - "entity:has-skill-evidence"
related_nodes:
  - "entity:resume-parsing"
  - "entity:skill-extraction"
related_skills:
  - "career-mapping"
  - "skill-gap-analysis"
cost: low
idempotent: true
---

Field	Type	Description
`preconditions`	`string[]`	State entities that must exist before this skill can execute
`postconditions`	`string[]`	State entities created or updated after execution
`related_nodes`	`string[]`	Anchor points in the semantic graph
`related_skills`	`string[]`	Sequential or complementary skill names
`cost`	`enum`	`low`, `medium`, `high` — estimated execution cost
`idempotent`	`bool`	Whether safe to re-run without side effects

Preconditions use NOT prefix for negation (e.g., "NOT entity:has-parsed-resume" means that state must not exist).

Tags vs entities

Use tags for:

broad classification
filtering
document organization

Use entities for:

canonical concept identity
chunk-to-graph linking
similarity and traversal workflows
cross-document concept reuse

Example:

Tag: Backend
Entity: entity:fastapi

Frontmatter Schema

Entity schema

entities:
  - id: entity:fastapi
    name: FastAPI
    type: framework
    aliases: []

Required fields:

id
name
type

Optional field:

aliases

Chunk schema

chunks:
  - id: skill:python-api
    tags: [Python, Backend, API]
    entities: [entity:python, entity:fastapi]
    context: "Python backend skills focused on REST APIs built with FastAPI"
    created_at: "2026-03-08"
    version: 1
    type: text
    priority: high
    dependencies: []

Recommended fields:

id
entities
context

Optional fields:

tags
created_at
version
type
priority
dependencies

Context Writing Rules

Each chunk context should:

explain the chunk's role in the document
mention the distinguishing use case
reflect the linked entities naturally
avoid copying the first sentence verbatim

Good:

"Python backend skills focused on REST APIs, async services, and FastAPI-based implementation"

Weak:

"This chunk talks about Python"

Example Transformation

Before

## About Me

I'm a backend engineer focused on APIs and distributed systems.

## Python

I use Python for REST APIs and internal tools.

### Key Libraries
- FastAPI for web services
- Pandas for data processing

After

---
title: "Profile"
entities:
  - id: entity:python
    name: Python
    type: skill
    aliases: [python3]
  - id: entity:fastapi
    name: FastAPI
    type: framework
    aliases: []
  - id: entity:pandas
    name: Pandas
    type: library
    aliases: []
chunks:
  - id: about:background
    tags: [About, Experience]
    entities: []
    context: "Professional background as a backend engineer working on APIs and distributed systems"
    created_at: "2026-03-08"
    version: 1
    type: text
    priority: medium
    dependencies: []
  - id: skill:python
    tags: [Python, Backend]
    entities: [entity:python, entity:fastapi, entity:pandas]
    context: "Python skills for API development, service work, and data processing"
    created_at: "2026-03-08"
    version: 1
    type: text
    priority: high
    dependencies: [about:background]
---

<Chunk id="about:background">
## About Me

I'm a backend engineer focused on APIs and distributed systems.
</Chunk>

<Chunk id="skill:python">
## Python

I use Python for REST APIs and internal tools.

### Key Libraries
- FastAPI for web services
- Pandas for data processing
</Chunk>

GraphRAG Guidance

When preparing documents for graph-oriented retrieval:

Treat entities as the canonical node inventory
Treat chunk entities arrays as chunk-to-entity edges
Treat dependencies as chunk-to-chunk prerequisite links
Keep tags broad and entities precise

Do not rely on tags alone when the goal is entity-based navigation.

When Not to Extract Many Entities

Be conservative when:

the document is very short
the document is mostly narrative with few domain terms
the same concept is repeated with no useful distinctions
the source is noisy and entity extraction would create junk nodes

In these cases, return a smaller, cleaner entity set.

Validation

After conversion, verify:

Every <Chunk> ID exists in chunks
Every chunks[].entities reference exists in entities
Entity IDs are unique and canonical
Chunk IDs are unique
No nested chunks exist
Context fields differentiate similar chunks
Generic or noisy entities have been removed

python3 scripts/validate_chunks.py path/to/document.md

Diagnostics

When the user asks to diagnose, review both chunk quality and entity quality.

Check categories

Chunk similarity
- Flag chunks that are too similar to distinguish during retrieval
Context quality
- Flag vague, repetitive, or overly short context fields
Tag overlap
- Flag tags that appear everywhere and no longer help filtering
Entity noise
- Flag generic or low-value entities
Entity duplication
- Flag aliases that should be merged into one canonical node
Chunk-to-entity linking
- Flag chunks with missing, excessive, or irrelevant entity links

Diagnostic prompts

Diagnose only:

Check this document for chunk and entity quality issues

Diagnose with suggestions:

Diagnose this document's chunk and entity structure and suggest fixes

Fix-oriented review:

Rewrite the frontmatter so the entities are canonical and each chunk links only to the right entities

Resources

Chunk syntax details: See references/chunk-syntax.md
Entity format reference: See references/entity-format.md
Semantic chunking theory: See references/semantic-chunking.md
Examples: See assets/examples/ for before/after samples

在你喜歡的 AI 中提問

說明文件