Testing
InitRunner includes built-in tools for testing agents before deploying them: schema validation, dry-run mode (no API calls), and an eval-style test suite runner.
Validation
Validate a role YAML against the schema without running the agent:
initrunner validate role.yamlThis checks:
- YAML syntax and structure
- Required fields (
apiVersion,kind,metadata.name,spec.role) - Field types and value ranges (e.g.
temperaturebetween 0.0 and 2.0) - Tool configurations (valid types, required fields per type)
- Skill references (file exists, frontmatter is valid)
- Trigger configurations (valid cron expressions, valid paths)
- Security policy structure
Validation exits with code 0 on success and non-zero on failure, making it suitable for CI pipelines.
Dry-Run Mode
Run an agent without making any LLM API calls:
initrunner run role.yaml --dry-run -p "Test prompt"Dry-run mode replaces the configured model with a TestModel that returns deterministic placeholder responses. This lets you verify:
- Tool registration and discovery
- Trigger configuration and startup
- Memory system initialization
- Skill loading and merging
- Guardrail enforcement logic
- Sink configuration
No API keys are required and no tokens are consumed. Use dry-run mode during development to catch configuration errors before spending on API calls.
Test Suites
The initrunner test command runs structured test suites against an agent using an eval framework.
initrunner test role.yaml -s test_suite.yamlTest suite format
A test suite is a YAML file using the standard InitRunner envelope: apiVersion, kind, metadata, and a list of cases. Each case has a name, a prompt, and a list of assertions.
apiVersion: initrunner/v1
kind: TestSuite
metadata:
name: support-agent-tests
cases:
- name: answers_product_question
prompt: "What is the return policy?"
assertions:
- type: contains
value: "30 days"
- type: contains
value: "refund"
- name: rejects_off_topic
prompt: "What's the weather like?"
assertions:
- type: not_contains
value: "forecast"
- type: max_tokens
limit: 200
- name: uses_search_tool
prompt: "Find articles about shipping delays"
assertions:
- type: tool_calls
expected: ["search_documents"]
- type: contains
value: "shipping"
- name: stays_within_budget
prompt: "Write a comprehensive guide to our product line"
assertions:
- type: max_tokens
limit: 4096
- type: max_latency
limit_ms: 30000Top-level fields:
| Field | Type | Default | Description |
|---|---|---|---|
apiVersion | string | (required) | Must be initrunner/v1 |
kind | string | (required) | Must be TestSuite |
metadata.name | string | (required) | Suite name shown in the results table |
cases | list | [] | Test cases in the suite |
Case fields:
| Field | Type | Default | Description |
|---|---|---|---|
name | string | (required) | Unique case name |
prompt | string | (required) | Prompt sent to the agent |
assertions | list | [] | Assertions to evaluate against the run |
tags | list[string] | [] | Tags for --tag filtering |
expected_output | string | null | Simulated model output, used only in --dry-run; ignored otherwise |
This same YAML runs unchanged on both run paths (see How suites run below). You write one suite, and the choice of runner does not change what you write.
Assertion types
Assertions are a discriminated union on type. There are eleven types. Output-based assertions check the final response; the timeline and span types check how the run unfolded.
| Type | Key fields | Description |
|---|---|---|
contains | value, case_insensitive (default false) | Output contains the substring |
not_contains | value, case_insensitive (default false) | Output does not contain the substring |
regex | pattern | re.search matches the pattern anywhere in the output |
tool_calls | expected, mode (default subset) | Tools called during the run, compared as sets; message includes an F1 score |
max_tokens | limit | Total tokens are <= limit |
max_latency | limit_ms | Wall-clock duration in ms is <= limit_ms |
llm_judge | criteria, model (default openai:gpt-4o-mini) | An LLM scores each criterion; skipped and marked failed in --dry-run on the default runner |
tool_order | sequence, strict (default false) | Tool calls occur in the given order |
reasoning_budget | max_reasoning_tokens | Reasoning tokens are <= max_reasoning_tokens |
memory_consulted | expected (default true), tools | Whether a memory tool was called |
span | name_contains, attribute, attribute_value, count | Matches a span (or a timeline entry) |
The default mode for tool_calls is subset (all expected tools must appear, extras allowed); the other modes are exact (sets equal) and superset (no tools beyond expected). The last four types read the run-event timeline and are covered in Span and timeline assertions. For full per-field detail on every type, see Agent Evals.
Running tests
# Run a test suite against the live model
initrunner test role.yaml -s test_suite.yaml
# Dry-run (no API calls, uses TestModel)
initrunner test role.yaml -s test_suite.yaml --dry-run
# Verbose output, with concurrency
initrunner test role.yaml -s test_suite.yaml -v -j 4
# Filter by tag (repeatable, values OR'd) and save JSON
initrunner test role.yaml -s test_suite.yaml --tag search -o results.jsonPATH may be an agent directory, a role YAML, or an installed role name.
| Flag | Type | Default | Description |
|---|---|---|---|
-s, --suite | str | (required) | Path to the test suite YAML |
--dry-run | bool | false | Use TestModel instead of real API calls |
-v, --verbose | bool | false | Show assertion details for each case |
-j, --concurrency | int | 1 | Number of concurrent workers (each builds its own agent) |
-o, --output | path | (none) | Save results as JSON (schema is unchanged across run paths) |
--tag | list[str] | [] | Filter cases by tag; repeatable, values OR'd |
--pydantic-evals | bool | false | Run via pydantic-evals (needs the observability extra) |
--report | bool | false | Print the native pydantic-evals report (per-evaluator scores, averages, span analyses); implies --pydantic-evals. Since v2026.6.4. |
--report-json | path | (none) | Save the full native pydantic-evals report as JSON; implies --pydantic-evals. Since v2026.6.4. |
--model | str | (role default) | Override the model used for the run |
The command exits with code 0 when every case passes and code 1 on any failure or error, so you can use it in CI without extra wiring.
Test output
The command prints a header line, then a table of cases, then a summary. Pass -v to add a Details column with one line per assertion.
Running support-agent-tests (4 cases) against support-agent
Test Suite: support-agent-tests
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓
┃ Case ┃ Status ┃ Duration ┃ Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
│ answers_product_question │ PASS │ 1200ms │ 340 │
│ rejects_off_topic │ PASS │ 800ms │ 95 │
│ uses_search_tool │ PASS │ 2100ms │ 520 │
│ stays_within_budget │ FAIL │ 1800ms │ 4301 │
└───────────────────────────┴────────┴──────────┴────────┘
3/4 passed ✗ Some tests failed 5256 tokens | 5900ms totalWith -v, the Details column shows each assertion result, for example ✗ Tokens 4301 exceeded limit 4096.
Looking for the full eval framework? See Agent Evals for LLM judge assertions, concurrent execution, tag-based filtering, JSON output, and more.
How suites run: pydantic-evals
You keep writing the exact same YAML. This release changes only how a suite runs, not what you put in it. The default runner stays in place, and adding --pydantic-evals opts a run into a second path.
uv pip install "initrunner[observability]"
initrunner test role.yaml -s test_suite.yaml --pydantic-evals -vOn this path, each case becomes a pydantic_evals.Case, and its assertions are translated into evaluators inside a Dataset. Running the dataset produces a native pydantic-evals EvaluationReport. The two runners share the same assertion logic, so a case that passes on one path passes on the other.
The flag requires the observability extra, which bundles pydantic-evals. Without it, the run raises a MissingExtraError whose message ends in the install hint uv pip install initrunner[observability].
The LLM judge is reused on this path as a custom evaluator that calls the same judge code as the default runner. Span capture uses an in-memory OpenTelemetry exporter and does not need a network backend or Logfire, so a local, no-Logfire setup still works. See Observability for how spans are produced.
The output table and exit codes are identical to the default runner (exit 0 when all pass, exit 1 on any failure or error), so you can switch a CI job to --pydantic-evals without changing anything else.
Span and timeline assertions
Four assertion types describe how a run unfolded rather than what the final response said: tool_order, reasoning_budget, memory_consulted, and span. They read the structured run-event timeline, so they work on the default runner without Logfire or OTLP. The span type additionally queries a real OpenTelemetry span tree when you run with --pydantic-evals against an instrumented agent, and falls back to the timeline otherwise.
apiVersion: initrunner/v1
kind: TestSuite
metadata:
name: process-checks
cases:
- name: searches_then_summarizes
prompt: "Research and summarize the latest on Docker volumes"
assertions:
- type: tool_order
sequence: ["web_search", "summarize"]
strict: false
- type: span
name_contains: "web_search"
count: 1
- type: reasoning_budget
max_reasoning_tokens: 1000
- type: memory_consulted
expected: falseA run that reports zero reasoning tokens always passes reasoning_budget, so models that emit no thinking are never penalized. With strict: false, tool_order checks relative order and allows gaps; strict: true requires the observed tool-call sequence to equal sequence exactly. For the full field reference, see Agent Evals.
Reaching the EvaluationReport from Python
When you want aggregate metrics or your own reporting, call the pydantic-evals runner directly. It returns a PydanticEvalsResult with two attributes: .suite_result, the familiar SuiteResult whose to_dict() and JSON export are unchanged, and .report, the native pydantic-evals EvaluationReport.
from pathlib import Path
from initrunner.agent.loader import load_and_build
from initrunner.eval.runner import load_suite, run_suite_pydantic_evals
role, agent = load_and_build(Path("role.yaml"))
suite = load_suite(Path("test_suite.yaml"))
result = run_suite_pydantic_evals(agent, role, suite)
result.report.print() # native pydantic-evals report
for case in result.report.cases:
print(case.name, case.assertions)
print(result.suite_result.all_passed) # same SuiteResult as the CLISince v2026.6.4, you do not need Python for this: initrunner test --report prints the same native report to the console, and --report-json <path> writes it to disk. Both imply --pydantic-evals and need the observability extra. See Agent Evals for the CLI details.
Testing Workflow
A practical workflow for developing and testing agents:
-
Validate. Catch schema errors early:
initrunner validate role.yaml -
Dry-run. Verify tool registration and config without API calls:
initrunner run role.yaml --dry-run -p "Test prompt" -
Interactive test. Manual testing in REPL mode:
initrunner run role.yaml -i -
Suite test. Run automated assertions against real model output:
initrunner test role.yaml -s tests/regression.yaml -
CI integration. Validate and dry-run in CI, suite tests on schedule:
# In CI pipeline initrunner validate role.yaml initrunner test role.yaml -s tests/smoke.yaml --dry-run
Async Tests
Tests for the async runtime use pytest-asyncio:
| Test File | Coverage |
|---|---|
test_executor_async.py | execute_run_async, execute_run_stream_async, async retry logic |
test_signal_async.py | Async signal handler, double-Ctrl-C force exit |
These tests use @pytest.mark.asyncio and mock PydanticAI's agent.run() / agent.run_stream() to avoid real LLM calls.