Ingestion
InitRunner's ingestion pipeline extracts text from source files, splits it into chunks, generates embeddings, and stores vectors in a local SQLite database. Once ingested, an agent can search documents at runtime via the auto-registered search_documents tool.
Quick Start
apiVersion: initrunner/v1
kind: Agent
metadata:
name: kb-agent
description: Knowledge base agent
spec:
role: |
You are a knowledge assistant. Use search_documents to find relevant
content before answering. Always cite your sources.
model:
provider: openai
name: gpt-4o-mini
ingest:
sources:
- "./docs/**/*.md"
- "./knowledge-base/**/*.txt"
chunking:
strategy: fixed
chunk_size: 512
chunk_overlap: 50# Ingest documents
initrunner ingest role.yaml
# Run the agent (search_documents is auto-registered)
initrunner run role.yaml -p "What does the onboarding guide say?"Walkthrough: Build a Knowledge Base Agent
This walkthrough builds a complete RAG agent from scratch — set up docs, configure the agent, ingest, and query.
1. Set up your docs directory
mkdir -p docs
# Add your markdown files to ./docs/2. Create the agent
apiVersion: initrunner/v1
kind: Agent
metadata:
name: rag-agent
description: Knowledge base Q&A agent with document ingestion
spec:
role: |
You are a helpful documentation assistant. You answer user questions
using the ingested knowledge base.
Rules:
- ALWAYS call search_documents before answering a question
- Base your answers only on information found in the documents
- Cite the source document for each claim (e.g., "Per the Getting Started
guide, ...")
- If search_documents returns no relevant results, say so honestly rather
than guessing
- When a user asks about a topic covered across multiple documents,
synthesize the information and cite all relevant sources
- Use read_file to view a full document when the search snippet is not
enough context
model:
provider: openai
name: gpt-4o-mini
temperature: 0.1
ingest:
sources:
- ./docs/**/*.md
chunking:
strategy: paragraph
chunk_size: 512
chunk_overlap: 50
embeddings:
provider: openai
model: text-embedding-3-small
tools:
- type: filesystem
root_path: ./docs
read_only: true
allowed_extensions:
- .md
guardrails:
max_tokens_per_run: 30000
max_tool_calls: 15
timeout_seconds: 120Why
paragraphchunking? It splits on double newlines first, then merges small paragraphs untilchunk_sizeis reached. This preserves natural document structure — a paragraph about "installation" stays together instead of being split mid-sentence. Usefixedfor code files and logs where structure doesn't matter.
3. Ingest the documents
initrunner ingest rag-agent.yamlResolving sources...
./docs/**/*.md → 4 files
Extracting text...
docs/getting-started.md (2,847 chars)
docs/faq.md (3,214 chars)
docs/api-reference.md (5,102 chars)
docs/changelog.md (1,456 chars)
Chunking (paragraph, size=512, overlap=50)...
→ 28 chunks
Embedding with openai:text-embedding-3-small...
→ 28 embeddings
Stored in ~/.initrunner/stores/rag-agent.db4. Query the agent
initrunner run rag-agent.yaml -p "How do I create a database?"The agent calls search_documents("create database"), gets matching chunks with source file names and similarity scores, then answers with citations.
5. Re-index when docs change
# Safe to re-run — deletes old chunks and re-inserts
initrunner ingest rag-agent.yamlSee the Examples page for the complete RAG agent with sample docs.
Pipeline
- Resolve sources — Glob patterns are expanded into file paths relative to the role file's directory.
- Extract text — Each file is passed through a format-specific extractor.
- Chunk text — Extracted text is split into overlapping chunks.
- Embed — Chunks are converted to vector embeddings.
- Store — Embeddings and text are stored in SQLite backed by sqlite-vec.
Configuration
| Field | Type | Default | Description |
|---|---|---|---|
sources | list[str] | (required) | Glob patterns for source files |
watch | bool | false | Reserved for future use |
chunking.strategy | str | "fixed" | "fixed" or "paragraph" |
chunking.chunk_size | int | 512 | Maximum chunk size in characters |
chunking.chunk_overlap | int | 50 | Overlapping characters between chunks |
embeddings.provider | str | "" | Embedding provider (empty = derives from model) |
embeddings.model | str | "" | Embedding model (empty = provider default) |
store_backend | str | "sqlite_vec" | Vector store backend |
store_path | str | null | null | Custom path (default: ~/.initrunner/stores/<agent-name>.db) |
Chunking Strategies
Fixed (strategy: fixed)
Splits text into fixed-size character windows with overlap. Best for uniform document types, code files, and logs.
Paragraph (strategy: paragraph)
Splits on double newlines first, then merges small paragraphs until chunk_size is reached. Preserves natural document structure. Best for prose, markdown, and documentation.
Supported File Formats
Core Formats (always available)
| Extension | Extractor |
|---|---|
.txt | Plain text (UTF-8) |
.md | Plain text (UTF-8) |
.rst | Plain text (UTF-8) |
.csv | CSV rows joined with commas and newlines |
.json | Pretty-printed JSON |
.html, .htm | HTML to Markdown (scripts/styles removed) |
Optional Formats (pip install initrunner[ingest])
| Extension | Extractor | Library |
|---|---|---|
.pdf | PDF to Markdown | pymupdf4llm |
.docx | Paragraphs joined with double newlines | python-docx |
.xlsx | Sheets as CSV with title headers | openpyxl |
The search_documents Tool
When spec.ingest is configured, a search tool is auto-registered:
search_documents(query: str, top_k: int = 5) -> str- Creates an embedding from the query.
- Searches the vector store for the most similar chunks.
- Returns results with source attribution and similarity scores.
If no documents have been ingested, the tool returns a message directing you to run initrunner ingest.
Re-indexing
Running initrunner ingest again is safe and idempotent:
- Resolves glob patterns to find current files.
- Deletes all existing chunks from each source file.
- Inserts new chunks from fresh extraction.
Files that no longer match the patterns keep their old chunks.
Embedding Models
Provider resolution priority:
ingest.embeddings.model— if set, used directlyingest.embeddings.provider— used to look up the defaultspec.model.provider— falls back to agent's model provider
| Provider | Default Embedding Model |
|---|---|
openai | openai:text-embedding-3-small |
anthropic | openai:text-embedding-3-small |
google | google:text-embedding-004 |
Scaffold
initrunner init --name kb-agent --template rag