Book SFT Pipeline

A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.

When to Activate

Activate this skill when:

Building fine-tuning datasets from literary works
Creating author-voice or style-transfer models
Preparing training data for Tinker or similar SFT platforms
Designing text segmentation pipelines for long-form content
Training small models (8B or less) on limited data

Core Concepts

The Three Pillars of Book SFT

1. Intelligent Segmentation Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.

2. Diverse Instruction Generation Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.

3. Style Over Content The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR AGENT                           │
│  Coordinates pipeline phases, manages state, handles failures   │
└──────────────────────┬──────────────────────────────────────────┘
                       │
       ┌───────────────┼───────────────┬───────────────┐
       ▼               ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  EXTRACTION  │ │ SEGMENTATION │ │  INSTRUCTION │ │   DATASET    │
│    AGENT     │ │    AGENT     │ │    AGENT     │ │   BUILDER    │
│ ePub → Text  │ │ Text → Chunks│ │ Chunks →     │ │ Pairs →      │
│              │ │ 150-400 words│ │ Prompts      │ │ JSONL        │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
                       │
       ┌───────────────┴───────────────┐
       ▼                               ▼
┌──────────────┐               ┌──────────────┐
│   TRAINING   │               │  VALIDATION  │
│    AGENT     │               │    AGENT     │
│ LoRA on      │               │ AI detector  │
│ Tinker       │               │ Originality  │
└──────────────┘               └──────────────┘

Phase 1: Text Extraction

Critical Rules

Always source ePub over PDF - OCR errors become learned patterns
Use paragraph-level extraction - Extract from <p> tags to preserve breaks
Remove front/back matter - Copyright and TOC pollute the dataset

# Extract text from ePub paragraphs
from epub2 import EPub
from bs4 import BeautifulSoup

def extract_epub(path):
    book = EPub(path)
    chapters = []
    for item in book.flow:
        html = book.get_chapter(item.id)
        soup = BeautifulSoup(html, 'html.parser')
        paragraphs = [p.get_text().strip() for p in soup.find_all('p')]
        chapters.append('\n\n'.join(p for p in paragraphs if p))
    return '\n\n'.join(chapters)

Phase 2: Intelligent Segmentation

Smaller Chunks + Overlap

Smaller chunks (150-400 words) produce more training examples and better style transfer than larger chunks (250-650).

def segment(text, min_words=150, max_words=400):
    paragraphs = text.split('\n\n')
    chunks, buffer, buffer_words = [], [], 0
    
    for para in paragraphs:
        words = len(para.split())
        if buffer_words + words > max_words and buffer_words >= min_words:
            chunks.append('\n\n'.join(buffer))
            # Keep last paragraph for overlap
            buffer = [buffer[-1], para] if buffer else [para]
            buffer_words = sum(len(p.split()) for p in buffer)
        else:
            buffer.append(para)
            buffer_words += words
    
    if buffer:
        chunks.append('\n\n'.join(buffer))
    return chunks

Expected Results

For an 86,000-word book:

Old method (250-650 words): ~150 chunks
New method (150-400 + overlap): ~300 chunks
With 2 variants per chunk: 600+ training examples

Phase 3: Diverse Instruction Generation

The Key Insight

Using a single prompt template causes memorization. Diverse templates teach the underlying style.

SYSTEM_PROMPTS = [
    "You are an expert creative writer capable of emulating specific literary styles.",
    "You are a literary writer with deep knowledge of classic prose styles.",
    "You are a creative writer skilled at emulating distinctive authorial voices.",
    "You write prose that captures the essence of modernist literature.",
    "You are a talented writer who can channel classic American authors.",
]

PROMPT_TEMPLATES = [
    "Write a passage in the style of {author}: {desc}",
    "Channel {author}'s voice to write about: {desc}",
    "In {author}'s distinctive prose style, describe: {desc}",
    "Write this scene as {author} would have: {desc}",
    "Using {author}'s repetitive technique, describe: {desc}",
    "Capture the rhythm of {author} in this passage: {desc}",
    "Write like {author}: {desc}",
    "In the voice of {author}, write: {desc}",
    "This is a literary exercise. Write like {author}: {desc}",
    "Can you write in {author}'s style? {desc}",
]

Instruction Generation

INSTRUCTION_PROMPT = """Describe what is happening in this excerpt in 2-3 sentences.
Focus on: characters present, actions, emotions, setting.
Do NOT quote the text directly.

Excerpt:
{text}
"""

# Use a fast, cheap LLM (e.g., Gemini Flash)
instruction = llm_call(INSTRUCTION_PROMPT.format(text=chunk))

Phase 4: Dataset Construction

Message Format

{
    "messages": [
        {"role": "system", "content": "You are an expert creative writer..."},
        {"role": "user", "content": "Write in the style of Author: Scene description..."},
        {"role": "assistant", "content": "The actual book text from chunk..."}
    ]
}

Multiple Variants Per Chunk

def build_examples(chunk, instruction, author, variants=2):
    examples = []
    for i in range(variants):
        system = SYSTEM_PROMPTS[i % len(SYSTEM_PROMPTS)]
        template = PROMPT_TEMPLATES[(chunk.id + i) % len(PROMPT_TEMPLATES)]
        user = template.format(author=author, desc=instruction)
        examples.append({"messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": user},
            {"role": "assistant", "content": chunk.text}
        ]})
    return examples

Phase 5: LoRA Training on Tinker

Configuration

CONFIG = {
    "model_name": "Qwen/Qwen3-8B-Base",  # Base, not instruct
    "lora_rank": 32,                      # 352MB adapter
    "learning_rate": 5e-4,                # Higher for LoRA
    "batch_size": 4,
    "epochs": 3,
}

Why Base Model?

Use base (pretrained) models, not instruction-tuned versions:

Base models are more malleable for new styles
Instruct models have patterns that resist overwriting
Style is a low-level pattern that base models capture better

Training Loop

import tinker
from tinker import types

training_client = await service_client.create_lora_training_client_async(
    base_model="Qwen/Qwen3-8B-Base",
    rank=32
)

for epoch in range(3):
    for batch in batches:
        await training_client.forward_backward_async(batch, loss_fn="cross_entropy")
        await training_client.optim_step_async(types.AdamParams(learning_rate=5e-4))

result = await training_client.save_weights_for_sampler_async(name="final")

Phase 6: Validation

Modern Scenario Test

Test with scenarios that couldn't exist in the original book:

TEST_PROMPTS = [
    "Write about a barista making lattes",
    "Describe lovers communicating through text messages",
    "Write about someone anxious about climate change",
]

If the model applies style markers to modern scenarios, it learned style, not content.

Originality Verification

# Search training data for output phrases
grep "specific phrase from output" dataset.jsonl
# Should return: No matches

AI Detector Testing

Test outputs with GPTZero, Pangram, or ZeroGPT.

Known Issues and Solutions

Character Name Leakage

Symptom: Model uses original character names in new scenarios. Cause: Limited name diversity from one book. Solution: Train on multiple books or add synthetic examples.

Model Parrots Exact Phrases

Symptom: Outputs contain exact sentences from training data. Cause: Too few prompt variations or too many epochs. Solution: Use 15+ templates, limit to 3 epochs.

Fragmented Outputs

Symptom: Sentences feel incomplete. Cause: Poor segmentation breaking mid-thought. Solution: Always break at paragraph boundaries.

Guidelines

Always source ePub over PDF - OCR errors become learned patterns
Never break mid-sentence - Boundaries must be grammatically complete
Use diverse prompts - 15+ templates, 5+ system prompts
Use base models - Not instruct versions
Use smaller chunks - 150-400 words for more examples
Reserve test set - 50 examples minimum
Test on modern scenarios - Proves style transfer vs memorization
Verify originality - Grep training data for output phrases

Expected Results

Metric	Value
Training examples	500-1000 per book
Model	Qwen/Qwen3-8B-Base
LoRA rank	32
Adapter size	~350 MB
Training time	~15 min
Loss reduction	90%+
Style transfer success	~50% perfect

Cost Estimate

Component	Cost
LLM (instruction generation)	~$0.50
Tinker training (15 min)	~$1.50
Total	~$2.00

Integration with Context Engineering Skills

This example applies several skills from the Agent Skills for Context Engineering collection:

project-development

The pipeline follows the staged, idempotent architecture pattern:

Acquire: Extract text from ePub
Prepare: Segment into training chunks
Process: Generate synthetic instructions
Parse: Build message format
Render: Output Tinker-compatible JSONL
Train: LoRA fine-tuning
Validate: Modern scenario testing

Each phase is resumable and produces intermediate artifacts for debugging.

context-compression

Segmentation is a form of context compression for training. The core insight from context-compression applies: information density matters more than information quantity. Smaller, coherent chunks (150-400 words) produce better style transfer than larger, diluted chunks.

The two-tier strategy mirrors context compression evaluation:

Tier 1: Fast, deterministic compression
Tier 2: LLM-assisted for edge cases

multi-agent-patterns

The pipeline uses the supervisor/orchestrator pattern:

Orchestrator coordinates phases and manages state
Specialized agents (Extraction, Segmentation, Instruction, Builder) have isolated contexts
Each agent receives only the information needed for its task

This matches the principle that sub-agents exist primarily to isolate context rather than simulate roles.

evaluation

Validation follows the end-state evaluation pattern:

Functional testing: Does output match expected style markers?
Originality verification: Is content genuinely generated?
External validation: AI detector scores

The "modern scenario" test is a form of out-of-distribution evaluation that proves generalization.

context-fundamentals

Prompt diversity prevents attention collapse on single patterns. When training with identical prompt structures, the model memorizes the instruction-response mapping. Diverse templates force attention across the style patterns themselves.

References

Internal references:

Segmentation Strategies - Text chunking patterns
Tinker Format Specification - Datum structure
Tinker API Documentation - Full API reference

Related skills from Agent Skills for Context Engineering:

project-development - Pipeline architecture patterns
context-compression - Compression strategies
multi-agent-patterns - Agent coordination
evaluation - Evaluation frameworks
context-fundamentals - Attention and information density

External resources:

Research Paper - Chakrabarty et al. 2025
Dataset on Hugging Face
Gertrude Stein Case Study - Complete working example

Skill Metadata

Created: 2025-12-26 Last Updated: 2025-12-28 Author: Muratcan Koylan Version: 2.0.0 Standalone: Yes (separate from main context-engineering collection)

在你喜欢的 AI 中提问

文档

Book SFT Pipeline

When to Activate

Core Concepts

The Three Pillars of Book SFT

Pipeline Architecture

Phase 1: Text Extraction

Critical Rules

Phase 2: Intelligent Segmentation

Smaller Chunks + Overlap

Expected Results

Phase 3: Diverse Instruction Generation

The Key Insight

Instruction Generation

Phase 4: Dataset Construction

Message Format

Multiple Variants Per Chunk

Phase 5: LoRA Training on Tinker

Configuration

Why Base Model?

Training Loop

Phase 6: Validation

Modern Scenario Test

Originality Verification

AI Detector Testing

Known Issues and Solutions

Character Name Leakage

Model Parrots Exact Phrases

Fragmented Outputs

Guidelines

Expected Results

Cost Estimate

Integration with Context Engineering Skills

project-development

context-compression

multi-agent-patterns

evaluation

context-fundamentals

References

Skill Metadata

Individual skills in this repo

advanced-evaluation

bdi-mental-states

comprehensive-research-agent

context-compression

context-degradation

context-fundamentals

context-optimization

evaluation

filesystem-context

harness-engineering

hosted-agents

latent-briefing

long-horizon-prompting

memory-systems

multi-agent-patterns

project-development

self-improvement-loops

tool-design

相关技能

steipete/bluebubbles

steipete/eightctl

steipete/blucli

steipete/bear-notes

steipete/camsnap

steipete/gifgrep