Ingestion

InitRunner's ingestion pipeline extracts text from source files, splits it into chunks, generates embeddings, and stores vectors in a local SQLite database. Once ingested, an agent can search documents at runtime via the auto-registered search_documents tool.

Quick Start

apiVersion: initrunner/v1
kind: Agent
metadata:
  name: kb-agent
  description: Knowledge base agent
spec:
  role: |
    You are a knowledge assistant. Use search_documents to find relevant
    content before answering. Always cite your sources.
  model:
    provider: openai
    name: gpt-4o-mini
  ingest:
    sources:
      - "./docs/**/*.md"
      - "./knowledge-base/**/*.txt"
    chunking:
      strategy: fixed
      chunk_size: 512
      chunk_overlap: 50

# Ingest documents
initrunner ingest role.yaml

# Run the agent (search_documents is auto-registered)
initrunner run role.yaml -p "What does the onboarding guide say?"

Walkthrough: Build a Knowledge Base Agent

This walkthrough builds a complete RAG agent from scratch — set up docs, configure the agent, ingest, and query.

1. Set up your docs directory

mkdir -p docs
# Add your markdown files to ./docs/

2. Create the agent

apiVersion: initrunner/v1
kind: Agent
metadata:
  name: rag-agent
  description: Knowledge base Q&A agent with document ingestion
spec:
  role: |
    You are a helpful documentation assistant. You answer user questions
    using the ingested knowledge base.

    Rules:
    - ALWAYS call search_documents before answering a question
    - Base your answers only on information found in the documents
    - Cite the source document for each claim (e.g., "Per the Getting Started
      guide, ...")
    - If search_documents returns no relevant results, say so honestly rather
      than guessing
    - When a user asks about a topic covered across multiple documents,
      synthesize the information and cite all relevant sources
    - Use read_file to view a full document when the search snippet is not
      enough context
  model:
    provider: openai
    name: gpt-4o-mini
    temperature: 0.1
  ingest:
    sources:
      - ./docs/**/*.md
    chunking:
      strategy: paragraph
      chunk_size: 512
      chunk_overlap: 50
    embeddings:
      provider: openai
      model: text-embedding-3-small
  tools:
    - type: filesystem
      root_path: ./docs
      read_only: true
      allowed_extensions:
        - .md
  guardrails:
    max_tokens_per_run: 30000
    max_tool_calls: 15
    timeout_seconds: 120

Why paragraph chunking? It splits on double newlines first, then merges small paragraphs until chunk_size is reached. This preserves natural document structure — a paragraph about "installation" stays together instead of being split mid-sentence. Use fixed for code files and logs where structure doesn't matter.

3. Ingest the documents

initrunner ingest rag-agent.yaml

Resolving sources...
  ./docs/**/*.md → 4 files
Extracting text...
  docs/getting-started.md (2,847 chars)
  docs/faq.md (3,214 chars)
  docs/api-reference.md (5,102 chars)
  docs/changelog.md (1,456 chars)
Chunking (paragraph, size=512, overlap=50)...
  → 28 chunks
Embedding with openai:text-embedding-3-small...
  → 28 embeddings
Stored in ~/.initrunner/stores/rag-agent.db

4. Query the agent

initrunner run rag-agent.yaml -p "How do I create a database?"

The agent calls search_documents("create database"), gets matching chunks with source file names and similarity scores, then answers with citations.

5. Re-index when docs change

# Safe to re-run — deletes old chunks and re-inserts
initrunner ingest rag-agent.yaml

See the Examples page for the complete RAG agent with sample docs.

Pipeline

Resolve sources — Glob patterns are expanded into file paths relative to the role file's directory.
Extract text — Each file is passed through a format-specific extractor.
Chunk text — Extracted text is split into overlapping chunks.
Embed — Chunks are converted to vector embeddings.
Store — Embeddings and text are stored in SQLite backed by sqlite-vec.

Configuration

Field	Type	Default	Description
`sources`	`list[str]`	(required)	Glob patterns for source files
`watch`	`bool`	`false`	Reserved for future use
`chunking.strategy`	`str`	`"fixed"`	`"fixed"` or `"paragraph"`
`chunking.chunk_size`	`int`	`512`	Maximum chunk size in characters
`chunking.chunk_overlap`	`int`	`50`	Overlapping characters between chunks
`embeddings.provider`	`str`	`""`	Embedding provider (empty = derives from model)
`embeddings.model`	`str`	`""`	Embedding model (empty = provider default)
`store_backend`	`str`	`"sqlite_vec"`	Vector store backend
`store_path`	`str \| null`	`null`	Custom path (default: `~/.initrunner/stores/<agent-name>.db`)

Extension	Extractor
`.txt`	Plain text (UTF-8)
`.md`	Plain text (UTF-8)
`.rst`	Plain text (UTF-8)
`.csv`	CSV rows joined with commas and newlines
`.json`	Pretty-printed JSON
`.html`, `.htm`	HTML to Markdown (scripts/styles removed)

Optional Formats (`pip install initrunner[ingest]`)

Extension	Extractor	Library
`.pdf`	PDF to Markdown	`pymupdf4llm`
`.docx`	Paragraphs joined with double newlines	`python-docx`
`.xlsx`	Sheets as CSV with title headers	`openpyxl`

The `search_documents` Tool

When spec.ingest is configured, a search tool is auto-registered:

search_documents(query: str, top_k: int = 5) -> str

Creates an embedding from the query.
Searches the vector store for the most similar chunks.
Returns results with source attribution and similarity scores.

If no documents have been ingested, the tool returns a message directing you to run initrunner ingest.

Re-indexing

Running initrunner ingest again is safe and idempotent:

Resolves glob patterns to find current files.
Deletes all existing chunks from each source file.
Inserts new chunks from fresh extraction.

Files that no longer match the patterns keep their old chunks.

Embedding Models

Provider resolution priority:

ingest.embeddings.model — if set, used directly
ingest.embeddings.provider — used to look up the default
spec.model.provider — falls back to agent's model provider

Provider	Default Embedding Model
`openai`	`openai:text-embedding-3-small`
`anthropic`	`openai:text-embedding-3-small`
`google`	`google:text-embedding-004`

Scaffold

initrunner init --name kb-agent --template rag