InitRunner

Ingestion

InitRunner's ingestion pipeline extracts text from source files, splits it into chunks, generates embeddings, and stores vectors in a local SQLite database. Once ingested, an agent can search documents at runtime via the auto-registered search_documents tool.

Quick Start

apiVersion: initrunner/v1
kind: Agent
metadata:
  name: kb-agent
  description: Knowledge base agent
spec:
  role: |
    You are a knowledge assistant. Use search_documents to find relevant
    content before answering. Always cite your sources.
  model:
    provider: openai
    name: gpt-4o-mini
  ingest:
    sources:
      - "./docs/**/*.md"
      - "./knowledge-base/**/*.txt"
    chunking:
      strategy: fixed
      chunk_size: 512
      chunk_overlap: 50
# Ingest documents
initrunner ingest role.yaml

# Run the agent (search_documents is auto-registered)
initrunner run role.yaml -p "What does the onboarding guide say?"

Walkthrough: Build a Knowledge Base Agent

This walkthrough builds a complete RAG agent from scratch — set up docs, configure the agent, ingest, and query.

1. Set up your docs directory

mkdir -p docs
# Add your markdown files to ./docs/

2. Create the agent

apiVersion: initrunner/v1
kind: Agent
metadata:
  name: rag-agent
  description: Knowledge base Q&A agent with document ingestion
spec:
  role: |
    You are a helpful documentation assistant. You answer user questions
    using the ingested knowledge base.

    Rules:
    - ALWAYS call search_documents before answering a question
    - Base your answers only on information found in the documents
    - Cite the source document for each claim (e.g., "Per the Getting Started
      guide, ...")
    - If search_documents returns no relevant results, say so honestly rather
      than guessing
    - When a user asks about a topic covered across multiple documents,
      synthesize the information and cite all relevant sources
    - Use read_file to view a full document when the search snippet is not
      enough context
  model:
    provider: openai
    name: gpt-4o-mini
    temperature: 0.1
  ingest:
    sources:
      - ./docs/**/*.md
    chunking:
      strategy: paragraph
      chunk_size: 512
      chunk_overlap: 50
    embeddings:
      provider: openai
      model: text-embedding-3-small
  tools:
    - type: filesystem
      root_path: ./docs
      read_only: true
      allowed_extensions:
        - .md
  guardrails:
    max_tokens_per_run: 30000
    max_tool_calls: 15
    timeout_seconds: 120

Why paragraph chunking? It splits on double newlines first, then merges small paragraphs until chunk_size is reached. This preserves natural document structure — a paragraph about "installation" stays together instead of being split mid-sentence. Use fixed for code files and logs where structure doesn't matter.

3. Ingest the documents

initrunner ingest rag-agent.yaml
Resolving sources...
  ./docs/**/*.md → 4 files
Extracting text...
  docs/getting-started.md (2,847 chars)
  docs/faq.md (3,214 chars)
  docs/api-reference.md (5,102 chars)
  docs/changelog.md (1,456 chars)
Chunking (paragraph, size=512, overlap=50)...
  → 28 chunks
Embedding with openai:text-embedding-3-small...
  → 28 embeddings
Stored in ~/.initrunner/stores/rag-agent.db

4. Query the agent

initrunner run rag-agent.yaml -p "How do I create a database?"

The agent calls search_documents("create database"), gets matching chunks with source file names and similarity scores, then answers with citations.

5. Re-index when docs change

# Safe to re-run — deletes old chunks and re-inserts
initrunner ingest rag-agent.yaml

See the Examples page for the complete RAG agent with sample docs.

Pipeline

  1. Resolve sources — Glob patterns are expanded into file paths relative to the role file's directory.
  2. Extract text — Each file is passed through a format-specific extractor.
  3. Chunk text — Extracted text is split into overlapping chunks.
  4. Embed — Chunks are converted to vector embeddings.
  5. Store — Embeddings and text are stored in SQLite backed by sqlite-vec.

Configuration

FieldTypeDefaultDescription
sourceslist[str](required)Glob patterns for source files
watchboolfalseReserved for future use
chunking.strategystr"fixed""fixed" or "paragraph"
chunking.chunk_sizeint512Maximum chunk size in characters
chunking.chunk_overlapint50Overlapping characters between chunks
embeddings.providerstr""Embedding provider (empty = derives from model)
embeddings.modelstr""Embedding model (empty = provider default)
store_backendstr"sqlite_vec"Vector store backend
store_pathstr | nullnullCustom path (default: ~/.initrunner/stores/<agent-name>.db)

Chunking Strategies

Fixed (strategy: fixed)

Splits text into fixed-size character windows with overlap. Best for uniform document types, code files, and logs.

Paragraph (strategy: paragraph)

Splits on double newlines first, then merges small paragraphs until chunk_size is reached. Preserves natural document structure. Best for prose, markdown, and documentation.

Supported File Formats

Core Formats (always available)

ExtensionExtractor
.txtPlain text (UTF-8)
.mdPlain text (UTF-8)
.rstPlain text (UTF-8)
.csvCSV rows joined with commas and newlines
.jsonPretty-printed JSON
.html, .htmHTML to Markdown (scripts/styles removed)

Optional Formats (pip install initrunner[ingest])

ExtensionExtractorLibrary
.pdfPDF to Markdownpymupdf4llm
.docxParagraphs joined with double newlinespython-docx
.xlsxSheets as CSV with title headersopenpyxl

The search_documents Tool

When spec.ingest is configured, a search tool is auto-registered:

search_documents(query: str, top_k: int = 5) -> str
  • Creates an embedding from the query.
  • Searches the vector store for the most similar chunks.
  • Returns results with source attribution and similarity scores.

If no documents have been ingested, the tool returns a message directing you to run initrunner ingest.

Re-indexing

Running initrunner ingest again is safe and idempotent:

  1. Resolves glob patterns to find current files.
  2. Deletes all existing chunks from each source file.
  3. Inserts new chunks from fresh extraction.

Files that no longer match the patterns keep their old chunks.

Embedding Models

Provider resolution priority:

  1. ingest.embeddings.model — if set, used directly
  2. ingest.embeddings.provider — used to look up the default
  3. spec.model.provider — falls back to agent's model provider
ProviderDefault Embedding Model
openaiopenai:text-embedding-3-small
anthropicopenai:text-embedding-3-small
googlegoogle:text-embedding-004

Scaffold

initrunner init --name kb-agent --template rag

On this page