Reasoning Primitives
InitRunner's reasoning system gives agents structured cognitive tools and execution strategies. Two orthogonal layers compose naturally: cognitive tools (think, todo, spawn) that the LLM uses voluntarily within a turn, and reasoning strategies (react, todo_driven, plan_execute, reflexion) that orchestrate behavior across turns in Autonomous Mode.
All reasoning tools are run-scoped. They are built fresh per-run with isolated state, never leaking across REPL/daemon sessions.
Quick Start
Minimal autonomous agent with structured reasoning:
apiVersion: initrunner/v1
kind: Agent
metadata:
name: planner
description: Autonomous planner with structured reasoning
spec:
role: |
You are a senior project planner. Break tasks into structured
todo lists, research each item, and synthesize findings.
model:
provider: openai
name: gpt-5.4-mini-2026-03-17
tools:
- type: think
critique: true
- type: todo
max_items: 20
- type: search
provider: duckduckgo
reasoning:
pattern: todo_driven
auto_plan: true
autonomy:
max_plan_steps: 20
guardrails:
max_iterations: 15
autonomous_token_budget: 100000Run it:
initrunner run planner.yaml -a -p "Research the top 3 Python web frameworks and compare them."The agent will:
- Create a structured todo list (batch_add_todos)
- Work through each item (get_next_todo, update_todo)
- Auto-complete when all items reach terminal status
Think Tool
Gives the agent a scratchpad that accumulates reasoning as a numbered chain. Unlike a plain "thought recorded" response, the agent sees its full reasoning history on every call, so it survives context trimming.
Configuration
tools:
- type: think
critique: true # nudge self-critique every 5th thought
max_thoughts: 30 # ring buffer capacity (default: 50)| Field | Type | Default | Description |
|---|---|---|---|
critique | bool | false | Append self-critique nudge every 5th thought |
max_thoughts | int | 50 | Ring buffer capacity (1-200) |
How it works
Each think(thought) call appends the thought and returns the full numbered chain:
Thoughts (3):
1. The user wants a CLI tool, so startup time matters
2. Python is faster to develop but Rust compiles to a single binary
3. Distribution is the key differentiator hereWith critique: true, every 5th thought appends:
You have recorded 5 thoughts. Before proceeding, critically evaluate your reasoning. What assumptions might be wrong? What have you missed?
The ring buffer evicts the oldest thought when full, bounding token overhead to ~3500 tokens at 50 thoughts.
When to use
- Always add
type: thinkfor agents doing multi-step reasoning - Enable critique for complex tasks where self-correction matters
- Reduce max_thoughts for agents with tight token budgets
Modes
The think tool works in both single-shot and autonomous mode. In single-shot, the agent can call it multiple times within one run. In autonomous mode, thoughts persist across iterations through the run-scoped state.
Todo Tool
Priority-aware task management with dependency resolution. Operates on the agent's unified ReflectionState, giving a single source of truth for progress.
Configuration
tools:
- type: todo
max_items: 30 # max concurrent items (default: 30)
shared: false # sub-agent visibility (default: false)
shared_path: "" # SQLite path (required when shared: true)| Field | Type | Default | Description |
|---|---|---|---|
max_items | int | 30 | Maximum concurrent items (1-100) |
shared | bool | false | Back state with SQLite for sub-agent access |
shared_path | str | "" | SQLite file path (required when shared: true) |
Tool functions exposed to the LLM
| Tool | Description |
|---|---|
add_todo(description, priority?, depends_on?) | Create an item. Returns its 8-char ID + the full formatted list. |
batch_add_todos(items) | Create multiple items at once. Supports inter-batch dependency refs via index ("0", "1", ...). |
update_todo(id, status?, notes?, priority?) | Update fields on an existing item. Returns the full formatted list. |
remove_todo(id) | Remove an item and clean up dangling dependency references. |
list_todos(status_filter?) | Show all items, or filter by status. |
get_next_todo() | Return the highest-priority pending item whose dependencies are all in terminal status. |
finish_task(summary, status) | Explicitly signal task completion (completed/blocked/failed). |
Statuses
| Status | Terminal? | Icon | Description |
|---|---|---|---|
pending | No | [ ] | Not started |
in_progress | No | [>] | Currently being worked on |
completed | Yes | [x] | Successfully finished |
failed | Yes | [!] | Failed |
skipped | Yes | [-] | Intentionally skipped |
Priority ordering
critical > high > medium > low. get_next_todo() returns the highest-priority pending item whose dependencies are all in terminal status.
Dependencies
Items can depend on other items by ID. The agent specifies dependencies when creating items:
add_todo("Deploy to staging", priority="high", depends_on=["abc12345"])Or in batch, using 0-based batch indices:
batch_add_todos([
{"description": "Write tests", "priority": "high"},
{"description": "Run tests", "depends_on": ["0"]},
{"description": "Deploy", "depends_on": ["1"]}
])Cycles are detected via Kahn's algorithm and rejected immediately. When an item is removed, dangling dependency references in other items are cleaned up.
Auto-completion
When every item in the list reaches a terminal status (completed, failed, or skipped), the autonomous loop automatically signals completion. The agent does not need to call finish_task explicitly, though it can do so at any time to override.
Shared mode
When shared: true, the todo list is backed by SQLite with WAL mode for concurrent access. Sub-agents spawned via the delegate or spawn tool can read and update the same list.
tools:
- type: todo
shared: true
shared_path: ./.initrunner/shared_todo.dbSpawn Tool
Non-blocking parallel agent execution. Spawn sub-agents as background tasks, poll for results, and await completion, all within a single agent run.
Configuration
tools:
- type: spawn
max_concurrent: 3 # parallel task limit (default: 4, max: 16)
timeout_seconds: 120 # per-task timeout (default: 300)
agents:
- name: researcher
role_file: ./agents/researcher.yaml
description: Researches a specific topic
- name: coder
role_file: ./agents/coder.yaml
description: Writes and reviews code
shared_memory: # optional shared memory for sub-agents
store_path: ./.initrunner/shared.db
max_memories: 1000| Field | Type | Default | Description |
|---|---|---|---|
agents | list | required | Agent refs with name, role_file or url, and description |
max_concurrent | int | 4 | Maximum parallel tasks (1-16) |
max_depth | int | 3 | Maximum delegation depth |
timeout_seconds | int | 300 | Per-task wall-clock timeout |
shared_memory | object | null | Shared LanceDB memory config |
Each agent ref needs either role_file (inline execution via InlineInvoker) or url (remote execution via McpInvoker).
Since v2026.6.5, max_depth is enforced across spawned sub-agents: delegation depth rides on context variables and is re-seeded across the spawn pool's thread boundary, so a recursive spawn topology can no longer exceed the limit.
Tool functions exposed to the LLM
| Tool | Description |
|---|---|
spawn_agent(agent_name, prompt) | Submit a background task. Returns immediately with a task_id. |
poll_tasks(task_ids?) | Check status of specific tasks or all. Returns a formatted status table. |
await_tasks(task_ids) | Block until all specified tasks complete. Returns their results. |
await_any(task_ids) | Block until any one task completes. Returns its result. |
cancel_task(task_id) | Cancel a running background task. |
How it works
The spawn pool maintains a private asyncio event loop in a daemon thread. When the agent calls spawn_agent, the task is submitted via asyncio.run_coroutine_threadsafe(). The underlying invokers (InlineInvoker for local agents, McpInvoker for remote) run via asyncio.to_thread().
Task statuses: running, completed, failed, timeout.
The pool is cleaned up when the run ends. Remaining tasks are cancelled and the event loop is stopped.
Typical usage pattern
1. spawn_agent("researcher", "Find stats on Python adoption") -> task_a
2. spawn_agent("researcher", "Find stats on Rust adoption") -> task_b
3. await_tasks([task_a, task_b]) -> results
4. Synthesize results into final answerNative Extended Thinking
Some reasoning-capable OpenAI models run an internal reasoning pass before they answer. spec.model.thinking turns that pass on and sets how much effort it gets. The value maps directly onto PydanticAI's ModelSettings['thinking'], so the same effort level you set here is what reaches the provider.
spec:
model:
provider: openai
name: o3-mini
thinking: highLeave thinking unset to use the provider default. Set it to the YAML boolean false (not the string "false") to explicitly disable thinking on a model that would otherwise reason:
spec:
model:
provider: openai
name: o3-mini
thinking: false # YAML boolean, disables the internal reasoning pass| Value | Effect |
|---|---|
minimal | Smallest, fastest, cheapest reasoning budget |
low | Light reasoning |
medium | Balanced reasoning |
high | Heavy reasoning for hard problems |
xhigh | Maximum reasoning budget |
false | Explicitly disable the internal reasoning pass |
Thinking is configured in YAML only. There is no --thinking CLI flag.
Supported models
thinking is only valid on reasoning-capable OpenAI models: the o-series (any model name starting with o) and the gpt-5 family, excluding any gpt-5-chat name. Setting it on any other provider or model raises a load-time error, so a misconfiguration surfaces before the agent runs rather than mid-run:
thinking is only supported on reasoning-capable OpenAI models
(the o-series and the gpt-5 family), not '{provider}:{name}'.
Remove the thinking field or switch to a supported model.The newer gpt-5.1 and gpt-5.2 models accept thinking as well, since they are part of the gpt-5 family.
Per-persona and per-flow-agent overrides
There is no standalone per-persona thinking field. A team persona overrides thinking by giving the persona its own model: block, which takes precedence over the team's spec.model. Put the thinking value inside that block:
personas:
- name: reviewer
model:
provider: openai
name: o3-mini
thinking: highSee Team Mode for how personas resolve their model.
A flow agent references a role by name and has no inline model block, so it overrides thinking through that role's own YAML (spec.model.thinking). See Flow for how flow agents bind to roles.
The --model CLI flag swaps provider:model (or an alias) but preserves the thinking value set in YAML, so overriding the model on the command line keeps your configured effort level.
Relationship to spec.reasoning and the Thinking capability
model.thinking and spec.reasoning are orthogonal. model.thinking controls model-level effort inside a single request. spec.reasoning (react, todo_driven, plan_execute, reflexion) orchestrates behavior across turns in Autonomous Mode. They compose freely. A todo_driven agent can run with thinking: high, getting heavy per-request reasoning on top of cross-turn todo orchestration.
PydanticAI's Thinking capability (under spec.capabilities) reaches the same model setting at the capability layer. Prefer model.thinking. Declaring both a Thinking capability and model.thinking logs a warning advising you to keep model.thinking, and declaring a Thinking capability alongside spec.reasoning logs a warning that the two are orthogonal. The loader does not error in either case.
Cost and token usage
Every run records thinking_tokens separately on its RunResult, and the audit log persists a thinking_tokens column alongside tokens_in, tokens_out, and total_tokens. Existing audit databases gain the column automatically on first open, and rows written before the migration report 0.
This lets you see how much of a run's cost went to internal reasoning versus the visible answer, so you can tune effort down when the extra reasoning is not paying off. See Cost Tracking for how token counts roll up into spend.
Reasoning Strategies
The spec.reasoning config controls how the autonomous runner orchestrates agent behavior across turns. Strategies operate in Autonomous Mode only (-a flag).
Configuration
spec:
reasoning:
pattern: todo_driven # react | todo_driven | plan_execute | reflexion
auto_plan: true # prepend planning instructions to first turn
reflection_rounds: 0 # post-completion self-critique rounds, 0-3 (reflexion only)
success_criteria: # criteria an LLM judge verifies each round (reflexion only)
- correctness
- completeness
auto_detect: true # infer pattern from tool/autonomy config| Field | Type | Default | Description |
|---|---|---|---|
pattern | string | "react" | Reasoning pattern to use |
auto_plan | bool | false | Prepend "create a todo list" to first turn |
reflection_rounds | int | 0 | Number of self-critique rounds after completion (0-3) |
reflection_dimensions | list | null | Custom dimensions for reflexion self-critique, overrides defaults (max 3) |
success_criteria | list | null | Criteria an LLM-as-judge verifies each reflexion round (reflexion only). Auto-derived from reflection_dimensions names when unset. Max 10. |
auto_detect | bool | true | Infer pattern from tool/autonomy config |
Patterns
react (default)
Standard ReAct loop. The LLM decides when and how to use tools. No extra orchestration from the runner. This is the pattern every agent uses today.
reasoning:
pattern: reacttodo_driven
Plan-first execution. The runner prepends instructions to create a structured todo list on the first turn. Continuation prompts guide the agent: "Check your todo list. Get the next item and work on it."
Requires a todo tool in spec.tools.
tools:
- type: todo
reasoning:
pattern: todo_driven
auto_plan: true # recommendedHow it works:
- First turn: prompt is prefixed with "Before starting, create a structured todo list..."
- Subsequent turns: "Check your todo list. Call get_next_todo..."
- Loop exits when all items reach terminal status (auto-completion) or the agent calls
finish_task
plan_execute
Two-phase execution. Phase 1 (planning): the agent creates a comprehensive plan without executing. Phase 2 (execution): the agent works through plan items. The agent explicitly calls finalize_plan() to transition from planning to execution.
Requires a todo tool in spec.tools.
tools:
- type: todo
reasoning:
pattern: plan_executeHow it works:
- First turn: "PHASE 1 - PLANNING: Analyze this task and create a comprehensive todo list. Focus only on planning. Do not execute yet."
- Planning continues until the agent calls
finalize_plan()to signal the plan is complete. The tool rejects empty plans. - Phase transition: "PHASE 2 - EXECUTION: Work through your plan."
- Execution continues until auto-completion or
finish_task
The finalize_plan() tool is auto-registered when plan_execute is the active pattern. It takes no arguments and returns a confirmation that the phase has transitioned.
reflexion
Post-completion self-critique with dimension-specific evaluation. After the agent finishes (calls finish_task or todo auto-completes), the runner re-opens the state and injects structured critique prompts targeting specific quality dimensions.
Requires reflection_rounds > 0.
reasoning:
pattern: reflexion
reflection_rounds: 3 # one round per dimensionHow it works:
- Agent works normally until completion
- Each reflection round focuses on a specific dimension. By default, the three dimensions are correctness, completeness, and clarity, each with a structured evaluation rubric. Example prompt: "REFLECTION (1/3) -- CORRECTNESS: Check for factual errors, logical flaws, or incorrect assumptions."
- Agent gets
reflection_roundsadditional turns to self-correct against each dimension - Final output is from the last iteration
Reflexion is its own reasoning pattern, not a modifier on other patterns. Setting pattern: todo_driven with reflection_rounds does not add a critique pass; the runner picks the todo_driven strategy and ignores the reflexion rounds. To get self-critique alongside todo work, leave pattern unset and set reflection_rounds (or reflection_dimensions). Auto-detection then selects reflexion, and the agent can still use its todo tools during the run.
Configuring reflection dimensions
Override the default dimensions with custom evaluation criteria:
reasoning:
pattern: reflexion
reflection_rounds: 3
reflection_dimensions:
- name: correctness
prompt: "Check for factual errors, logical flaws, and incorrect assumptions."
- name: completeness
prompt: "Are there missing sections, gaps in coverage, or unanswered questions?"
- name: clarity
prompt: "Is the structure logical and easy to follow? Is the language clear?"When reflection_dimensions is set, each round uses the corresponding dimension's prompt. Both reflection_rounds and reflection_dimensions are capped at 3.
reflexion (with verification)
Basic reflexion trusts the agent to find and fix its own issues each round. Verified reflexion adds an LLM-as-judge that gates each round against explicit success_criteria. Before composing the next continuation prompt, the runner judges the latest summary against your criteria:
- A round that passes every criterion is recorded as verified. The loop advances to the next dimension, and once the last round verifies it marks the state complete and asks the agent to call
finish_task, stopping early instead of burning the remaining rounds. - A round that fails injects the per-criterion reasons into the next prompt so the agent can address them directly.
Enable it by listing criteria:
reasoning:
pattern: reflexion
reflection_rounds: 2
success_criteria:
- correctness
- completenessIf you already define reflection_dimensions and leave success_criteria unset, the criteria are auto-derived from the dimension names:
reasoning:
pattern: reflexion
reflection_dimensions:
- name: correctness
prompt: "Check for factual errors, logical flaws, and incorrect assumptions."
- name: completeness
prompt: "Are there missing sections, gaps in coverage, or unanswered questions?"
# success_criteria auto-derived as [correctness, completeness]success_criteria must be non-empty when provided and is capped at 10 entries.
The judge model
The judge reuses the role's configured model when it is resolved, and falls back to openai:gpt-4o-mini when the role's model is not set. It runs at temperature 0.0 and returns a strict per-criterion pass or fail with a reason for each.
The judge call is best-effort. If it raises, that round falls back to the plain dimension prompt. After 2 consecutive judge failures the judge is disabled for the rest of the run, so the loop degrades to basic reflexion rather than stalling on a broken judge.
Prompt tagging
A verified round is tagged in the next prompt, for example:
REFLECTION (2/2) -- COMPLETENESS [VERIFIED]:A failed round uses the plain dimension prompt with the judge's reasons appended:
Judge feedback on the previous round:
- correctness: <reason the criterion failed>
Address these issues in your revision.Verdict records
Each verdict is recorded on ReflectionState.judge_verdicts as a dict of {round, all_passed, criteria_results}, where each result carries criterion, passed, and reason. These verdicts are mirrored onto RunResult.judge_verdicts for the audit layer. The list stays empty when success_criteria is not configured, so basic reflexion carries no verdict overhead.
When to use verification
Reach for verified reflexion when "looks done" is not enough and you can state what done means as concrete criteria (for example: code compiles, every requirement is addressed, no contradictions). For open-ended drafting where success is subjective, basic reflexion is usually the right level. The judge shares the eval machinery described in Evals, and pairs well with Structured Output when the criteria check fields of a typed result.
Auto-detection
When auto_detect: true (the default) and no explicit pattern is set:
| Condition | Detected pattern |
|---|---|
Has todo tool + spec.autonomy configured | todo_driven |
Has reflection_rounds > 0 | reflexion |
| Everything else | react |
Explicit pattern setting always overrides auto-detection.
Validation
The loader validates reasoning config at build time:
todo_drivenorplan_executewithout atodotool raisesRoleLoadErrorreflexionwithreflection_rounds == 0raisesRoleLoadError
Zero-Config Examples
You don't need to set spec.reasoning explicitly. Auto-detection picks the right pattern:
Minimal todo agent (auto-detects todo_driven)
apiVersion: initrunner/v1
kind: Agent
metadata:
name: task-agent
description: Agent with structured task tracking
spec:
role: You are a helpful assistant that plans work carefully.
model:
provider: openai
name: gpt-5.4-mini-2026-03-17
tools:
- type: think
- type: todo
autonomy:
max_plan_steps: 15
guardrails:
max_iterations: 10
autonomous_token_budget: 50000initrunner run task-agent.yaml -a -p "Summarize the key differences between REST and GraphQL"Single-shot with think (auto-detects react)
apiVersion: initrunner/v1
kind: Agent
metadata:
name: reasoner
description: Agent that thinks before answering
spec:
role: |
You are a careful analyst. Always use the think tool to reason
step by step before giving your answer.
model:
provider: openai
name: gpt-5.4-mini-2026-03-17
tools:
- type: think
critique: trueinitrunner run reasoner.yaml -p "Should we migrate from REST to GraphQL?"Composing Primitives
The Tools compose naturally through LLM reasoning. No special wiring needed.
think + todo (structured reasoning)
The agent uses think to reason about each todo item before working on it:
tools:
- type: think
critique: true
- type: todo
reasoning:
pattern: todo_driven
auto_plan: truetodo + spawn (parallel research)
The agent creates a todo list, spawns background agents for parallelizable items, awaits results, then updates statuses:
tools:
- type: todo
- type: spawn
agents:
- name: researcher
role_file: ./agents/researcher.yaml
reasoning:
pattern: todo_driven
auto_plan: truetodo + reflexion (self-correcting planner)
To get self-critique with todo tooling, run the reflexion pattern and leave the todo tool available. The agent plans and works through its todo list, then takes one critique round. Because reflexion is its own pattern, do not set pattern: todo_driven here. Pairing todo_driven with reflection_rounds would silently drop the critique pass.
tools:
- type: todo
- type: think
critique: true
reasoning:
pattern: reflexion
reflection_rounds: 1Run-Scoped Tool Architecture
Reasoning tools carry per-run state (thought chains, todo lists, spawn pools). Standard tools are built once at agent-build time and reused across runs. Run-scoped tools are different: they are built fresh for each run with isolated state, preventing leaks across REPL/daemon sessions.
How it works
- Tool author marks a tool as run-scoped in the registration decorator:
@register_tool("todo", TodoToolConfig, run_scoped=True)
def build_todo_toolset(config, ctx, state):
...build_toolsets()automatically skips run-scoped tools during agent construction- The runner calls
build_run_scoped_toolsets()at the start of each run to construct them with fresh state - Run-scoped toolsets are passed as
extra_toolsetstoexecute_run()
Creating custom run-scoped tools
If you're building a custom tool that needs per-run state:
from initrunner.agent.tools._registry import register_tool, ToolBuildContext
from initrunner.agent.schema.tools import ToolConfigBase
from pydantic_ai.toolsets.function import FunctionToolset
class MyStatefulConfig(ToolConfigBase):
type: Literal["my_stateful"] = "my_stateful"
@register_tool("my_stateful", MyStatefulConfig, run_scoped=True)
def build_my_toolset(config, ctx):
state = [] # fresh per-run
toolset = FunctionToolset()
@toolset.tool_plain
def record(value: str) -> str:
state.append(value)
return f"Recorded {len(state)} values."
return toolsetFull Example: Autonomous Research Lead
apiVersion: initrunner/v1
kind: Agent
metadata:
name: research-lead
description: Autonomous research lead with parallel workers and self-critique
spec:
role: |
You are a research lead. Given a topic:
1. Break it into research questions (todo list)
2. Spawn researchers for parallelizable questions
3. Synthesize findings into a structured report
4. Self-critique before finalizing
model:
provider: openai
name: gpt-5.4-mini-2026-03-17
tools:
- type: think
critique: true
- type: todo
max_items: 15
- type: spawn
max_concurrent: 3
agents:
- name: web-researcher
role_file: ./agents/web-researcher.yaml
description: Searches the web and summarizes findings
- name: data-analyst
role_file: ./agents/data-analyst.yaml
description: Analyzes data and produces charts
- type: filesystem
root_path: ./output
read_only: false
reasoning:
pattern: reflexion
reflection_rounds: 1
autonomy:
max_plan_steps: 20
guardrails:
max_iterations: 20
autonomous_token_budget: 150000
timeout_seconds: 600The reflexion pattern runs the agent through its todo and spawn work, then takes one critique round before finalizing. The todo and spawn tools stay available throughout; reflexion only adds the post-completion critique on top of normal tool use.
initrunner run research-lead.yaml -a -p "Compare the top 3 vector databases for production RAG systems"