Testing

InitRunner includes built-in tools for testing agents before deploying them: schema validation, dry-run mode (no API calls), and an eval-style test suite runner.

Validation

Validate a role YAML against the schema without running the agent:

initrunner validate role.yaml

This checks:

YAML syntax and structure
Required fields (apiVersion, kind, metadata.name, spec.role)
Field types and value ranges (e.g. temperature between 0.0 and 2.0)
Tool configurations (valid types, required fields per type)
Skill references (file exists, frontmatter is valid)
Trigger configurations (valid cron expressions, valid paths)
Security policy structure

Validation exits with code 0 on success and non-zero on failure, making it suitable for CI pipelines.

Dry-Run Mode

Run an agent without making any LLM API calls:

initrunner run role.yaml --dry-run -p "Test prompt"

Dry-run mode replaces the configured model with a TestModel that returns deterministic placeholder responses. This lets you verify:

Tool registration and discovery
Trigger configuration and startup
Memory system initialization
Skill loading and merging
Guardrail enforcement logic
Sink configuration

No API keys are required and no tokens are consumed. Use dry-run mode during development to catch configuration errors before spending on API calls.

Test Suites

The initrunner test command runs structured test suites against an agent using an eval framework.

initrunner test role.yaml -s test_suite.yaml

Test suite format

A test suite is a YAML file using the standard InitRunner envelope: apiVersion, kind, metadata, and a list of cases. Each case has a name, a prompt, and a list of assertions.

apiVersion: initrunner/v1
kind: TestSuite
metadata:
  name: support-agent-tests
cases:
  - name: answers_product_question
    prompt: "What is the return policy?"
    assertions:
      - type: contains
        value: "30 days"
      - type: contains
        value: "refund"

  - name: rejects_off_topic
    prompt: "What's the weather like?"
    assertions:
      - type: not_contains
        value: "forecast"
      - type: max_tokens
        limit: 200

  - name: uses_search_tool
    prompt: "Find articles about shipping delays"
    assertions:
      - type: tool_calls
        expected: ["search_documents"]
      - type: contains
        value: "shipping"

  - name: stays_within_budget
    prompt: "Write a comprehensive guide to our product line"
    assertions:
      - type: max_tokens
        limit: 4096
      - type: max_latency
        limit_ms: 30000

Top-level fields:

Field	Type	Default	Description
`apiVersion`	`string`	(required)	Must be `initrunner/v1`
`kind`	`string`	(required)	Must be `TestSuite`
`metadata.name`	`string`	(required)	Suite name shown in the results table
`cases`	`list`	`[]`	Test cases in the suite

Case fields:

Field	Type	Default	Description
`name`	`string`	(required)	Unique case name
`prompt`	`string`	(required)	Prompt sent to the agent
`assertions`	`list`	`[]`	Assertions to evaluate against the run
`tags`	`list[string]`	`[]`	Tags for `--tag` filtering
`expected_output`	`string`	`null`	Simulated model output, used only in `--dry-run`; ignored otherwise

This same YAML runs unchanged on both run paths (see How suites run below). You write one suite, and the choice of runner does not change what you write.

Assertion types

Assertions are a discriminated union on type. There are eleven types. Output-based assertions check the final response; the timeline and span types check how the run unfolded.

Type	Key fields	Description
`contains`	`value`, `case_insensitive` (default `false`)	Output contains the substring
`not_contains`	`value`, `case_insensitive` (default `false`)	Output does not contain the substring
`regex`	`pattern`	`re.search` matches the pattern anywhere in the output
`tool_calls`	`expected`, `mode` (default `subset`)	Tools called during the run, compared as sets; message includes an F1 score
`max_tokens`	`limit`	Total tokens are `<= limit`
`max_latency`	`limit_ms`	Wall-clock duration in ms is `<= limit_ms`
`llm_judge`	`criteria`, `model` (default `openai:gpt-4o-mini`)	An LLM scores each criterion; skipped and marked failed in `--dry-run` on the default runner
`tool_order`	`sequence`, `strict` (default `false`)	Tool calls occur in the given order
`reasoning_budget`	`max_reasoning_tokens`	Reasoning tokens are `<= max_reasoning_tokens`
`memory_consulted`	`expected` (default `true`), `tools`	Whether a memory tool was called
`span`	`name_contains`, `attribute`, `attribute_value`, `count`	Matches a span (or a timeline entry)

The default mode for tool_calls is subset (all expected tools must appear, extras allowed); the other modes are exact (sets equal) and superset (no tools beyond expected). The last four types read the run-event timeline and are covered in Span and timeline assertions. For full per-field detail on every type, see Agent Evals.

Running tests

# Run a test suite against the live model
initrunner test role.yaml -s test_suite.yaml

# Dry-run (no API calls, uses TestModel)
initrunner test role.yaml -s test_suite.yaml --dry-run

# Verbose output, with concurrency
initrunner test role.yaml -s test_suite.yaml -v -j 4

# Filter by tag (repeatable, values OR'd) and save JSON
initrunner test role.yaml -s test_suite.yaml --tag search -o results.json

PATH may be an agent directory, a role YAML, or an installed role name.

Flag	Type	Default	Description
`-s, --suite`	`str`	(required)	Path to the test suite YAML
`--dry-run`	`bool`	`false`	Use TestModel instead of real API calls
`-v, --verbose`	`bool`	`false`	Show assertion details for each case
`-j, --concurrency`	`int`	`1`	Number of concurrent workers (each builds its own agent)
`-o, --output`	`path`	(none)	Save results as JSON (schema is unchanged across run paths)
`--tag`	`list[str]`	`[]`	Filter cases by tag; repeatable, values OR'd
`--pydantic-evals`	`bool`	`false`	Run via pydantic-evals (needs the observability extra)
`--report`	`bool`	`false`	Print the native pydantic-evals report (per-evaluator scores, averages, span analyses); implies `--pydantic-evals`. Since v2026.6.4.
`--report-json`	`path`	(none)	Save the full native pydantic-evals report as JSON; implies `--pydantic-evals`. Since v2026.6.4.
`--model`	`str`	(role default)	Override the model used for the run

The command exits with code 0 when every case passes and code 1 on any failure or error, so you can use it in CI without extra wiring.

Test output

The command prints a header line, then a table of cases, then a summary. Pass -v to add a Details column with one line per assertion.

Running support-agent-tests (4 cases) against support-agent

                       Test Suite: support-agent-tests
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓
┃ Case                      ┃ Status ┃ Duration ┃ Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
│ answers_product_question  │ PASS   │ 1200ms   │    340 │
│ rejects_off_topic         │ PASS   │ 800ms    │     95 │
│ uses_search_tool          │ PASS   │ 2100ms   │    520 │
│ stays_within_budget       │ FAIL   │ 1800ms   │   4301 │
└───────────────────────────┴────────┴──────────┴────────┘

3/4 passed ✗ Some tests failed  5256 tokens | 5900ms total

With -v, the Details column shows each assertion result, for example ✗ Tokens 4301 exceeded limit 4096.

Looking for the full eval framework? See Agent Evals for LLM judge assertions, concurrent execution, tag-based filtering, JSON output, and more.

How suites run: pydantic-evals

You keep writing the exact same YAML. This release changes only how a suite runs, not what you put in it. The default runner stays in place, and adding --pydantic-evals opts a run into a second path.

uv pip install "initrunner[observability]"
initrunner test role.yaml -s test_suite.yaml --pydantic-evals -v

On this path, each case becomes a pydantic_evals.Case, and its assertions are translated into evaluators inside a Dataset. Running the dataset produces a native pydantic-evals EvaluationReport. The two runners share the same assertion logic, so a case that passes on one path passes on the other.

The flag requires the observability extra, which bundles pydantic-evals. Without it, the run raises a MissingExtraError whose message ends in the install hint uv pip install initrunner[observability].

The LLM judge is reused on this path as a custom evaluator that calls the same judge code as the default runner. Span capture uses an in-memory OpenTelemetry exporter and does not need a network backend or Logfire, so a local, no-Logfire setup still works. See Observability for how spans are produced.

The output table and exit codes are identical to the default runner (exit 0 when all pass, exit 1 on any failure or error), so you can switch a CI job to --pydantic-evals without changing anything else.

Span and timeline assertions

Four assertion types describe how a run unfolded rather than what the final response said: tool_order, reasoning_budget, memory_consulted, and span. They read the structured run-event timeline, so they work on the default runner without Logfire or OTLP. The span type additionally queries a real OpenTelemetry span tree when you run with --pydantic-evals against an instrumented agent, and falls back to the timeline otherwise.

apiVersion: initrunner/v1
kind: TestSuite
metadata:
  name: process-checks
cases:
  - name: searches_then_summarizes
    prompt: "Research and summarize the latest on Docker volumes"
    assertions:
      - type: tool_order
        sequence: ["web_search", "summarize"]
        strict: false
      - type: span
        name_contains: "web_search"
        count: 1
      - type: reasoning_budget
        max_reasoning_tokens: 1000
      - type: memory_consulted
        expected: false

A run that reports zero reasoning tokens always passes reasoning_budget, so models that emit no thinking are never penalized. With strict: false, tool_order checks relative order and allows gaps; strict: true requires the observed tool-call sequence to equal sequence exactly. For the full field reference, see Agent Evals.

Reaching the EvaluationReport from Python

When you want aggregate metrics or your own reporting, call the pydantic-evals runner directly. It returns a PydanticEvalsResult with two attributes: .suite_result, the familiar SuiteResult whose to_dict() and JSON export are unchanged, and .report, the native pydantic-evals EvaluationReport.

from pathlib import Path

from initrunner.agent.loader import load_and_build
from initrunner.eval.runner import load_suite, run_suite_pydantic_evals

role, agent = load_and_build(Path("role.yaml"))
suite = load_suite(Path("test_suite.yaml"))

result = run_suite_pydantic_evals(agent, role, suite)

result.report.print()           # native pydantic-evals report
for case in result.report.cases:
    print(case.name, case.assertions)

print(result.suite_result.all_passed)  # same SuiteResult as the CLI

Since v2026.6.4, you do not need Python for this: initrunner test --report prints the same native report to the console, and --report-json <path> writes it to disk. Both imply --pydantic-evals and need the observability extra. See Agent Evals for the CLI details.

Testing Workflow

A practical workflow for developing and testing agents:

Validate. Catch schema errors early:
```
initrunner validate role.yaml
```
Dry-run. Verify tool registration and config without API calls:
```
initrunner run role.yaml --dry-run -p "Test prompt"
```
Interactive test. Manual testing in REPL mode:
```
initrunner run role.yaml -i
```
Suite test. Run automated assertions against real model output:
```
initrunner test role.yaml -s tests/regression.yaml
```

CI integration. Validate and dry-run in CI, suite tests on schedule:

# In CI pipeline
initrunner validate role.yaml
initrunner test role.yaml -s tests/smoke.yaml --dry-run

Async Tests

Tests for the async runtime use pytest-asyncio:

Test File	Coverage
`test_executor_async.py`	`execute_run_async`, `execute_run_stream_async`, async retry logic
`test_signal_async.py`	Async signal handler, double-Ctrl-C force exit

These tests use @pytest.mark.asyncio and mock PydanticAI's agent.run() / agent.run_stream() to avoid real LLM calls.

On this page