Durability

A long multi-agent flow can be interrupted. The process gets killed, the host reboots, or a downstream agent errors out partway through. Without durability, resuming means starting over from the entry agent, paying for every sub-agent again even though most of them already finished.

Durability fixes this. When you enable it on a flow, InitRunner records each completed sub-agent into an append-only, HMAC-signed checkpoint journal keyed by flow_run_id. On resume, completed sub-agents are replayed from the journal and execution continues at the first one that did not finish.

The journal is the audit store. There is no broker, no worker pool, and no extra service to run. Durability reuses the same SQLite database and the same HMAC signing key as the audit trail, so it stays local-first and tamper-evident.

Durability is off by default. Single-shot agent runs and the REPL are never affected. Only flows that opt in pay the small cost of one checkpoint row per completed sub-agent.

Enabling durability

Add a durability block to your flow's spec. Setting enabled: true is all you need:

# flow.yaml
apiVersion: initrunner/v1
kind: Flow
metadata:
  name: my-pipeline
spec:
  durability:
    enabled: true
  agents:
    producer:
      role: roles/producer.yaml
      sink:
        type: delegate
        target: consumer
    consumer:
      role: roles/consumer.yaml
      needs:
        - producer

You do not have to set backend yourself. When enabled: true and backend is left at its default of none, a model validator upgrades it to journal automatically. Durability is active (checkpoints written and consulted) only when enabled is true and backend is journal.

Configuration

The durability block lives at spec.durability:

Field	Type	Default	Description
`enabled`	`bool`	`false`	Turn the audit-backed checkpoint journal on. Setting `true` alone is enough; the backend auto-upgrades to `journal`.
`backend`	`"none" \| "journal"`	`"none"`	`none` means no journaling (single-shot and REPL unaffected). `journal` is the audit-backed durable ledger. `enabled: true` implies `journal`.
`retry_policy`	`"exponential" \| "linear" \| "none"`	`"exponential"`	Reserved for retry tuning. Defined in the schema but not yet wired to behavior.
`max_retries`	`int`	`3`	Reserved for retry tuning. Not yet wired to behavior.
`retry_delay_seconds`	`int`	`1`	Reserved for retry tuning. Not yet wired to behavior.

retry_policy, max_retries, and retry_delay_seconds are accepted by the schema today but nothing in the checkpoint or resume path reads them yet. Treat them as reserved. Set enabled: true and the journal works regardless of these values.

Running and resuming

Run the flow normally:

initrunner flow up flow.yaml

A durable run records each completed sub-agent and its flow_run_id into the audit store. The flow up command has no --resume flag; resume is a separate command.

To find the flow_run_id of a run you want to resume, query the delegate routing events in the audit trail:

initrunner flow events
initrunner flow events --run-id <flow_run_id>

Then resume the interrupted run by passing the flow file and the flow_run_id as two positional arguments:

initrunner flow resume flow.yaml <flow_run_id>

On resume the CLI prints how many checkpointed services exist for that run and lists which ones it is replaying.

flow resume accepts a few optional flags:

initrunner flow resume flow.yaml <flow_run_id> \
  --prompt "..." \
  --entry producer \
  --audit-db ./audit.db

Flag	Description
`--prompt`, `-p`	Prompt for the entry agent if it never checkpointed. Rarely needed.
`--entry`	Override the entry agent for the resumed run.
`--audit-db`	Path to the audit database holding the journal.

What happens on resume

Sub-agents that produced a successful checkpoint are replayed from the journal. Their recorded output flows downstream with no model call.
The first sub-agent that failed or was paused for approval is re-run, along with everything after it. A checkpoint is replayable only when the recorded run succeeded and was not paused.

A clean, fully successful run prunes its own checkpoints when it finishes, so the journal only retains rows for runs that still need resuming. A run counts as successful only when it did not time out and every step succeeded; a timeout or any failed step keeps the checkpoints.

Requirements

Resume needs two things, and the CLI exits with an error if either is missing:

Durability enabled on the flow. If spec.durability is not active, the CLI tells you to add a durability: {enabled: true} block and exits.
Audit logging on. The journal lives in the audit store, so resume always enables audit logging and errors if no audit logger is available. flow resume has no --no-audit flag, unlike flow up.

What gets recorded

Each checkpoint is keyed by (flow_run_id, service_name). Replaying a service overwrites its prior row rather than duplicating it, so the journal holds at most one checkpoint per service per run. Each checkpoint stores:

The delegation envelope: prompt, trace, original prompt, source service, the one-shot flag, and the topology index.
The run result: output, token counts, tool-call names, success, status, and any pending approvals.
The agent message history, serialized with PydanticAI's ModelMessagesTypeAdapter so message parts round-trip cleanly.
A record_hash and prev_hash linking the row into an HMAC chain. The journal has its own chain, separate from the main audit_log chain, but signed with the same key. That makes the journal tamper-evident in the same way as the rest of the audit trail.

Secrets are scrubbed from both the envelope and the result before they are written. Checkpoint writes never raise: a serialization or write error degrades durability for that run but does not crash the flow, matching the never-crash contract of audit.log().

Determinism and idempotency

Resume assumes your sub-agents are reasonably deterministic and that their tools are idempotent or side-effect aware.

A completed sub-agent is not re-run on resume, so its external side effects are not repeated. Conversely, a re-run sub-agent (the first incomplete one and everything after it) repeats whatever side effects it performed before the interruption. If a tool mutates external state, such as sending a message, writing a file, or calling a paid API, design it to be safe under at-least-once execution.

Daemon flows and resume-after-failure

When you run a trigger-driven flow daemon (initrunner flow up with triggers) and durability is enabled, the daemon journals every triggered run. It builds a checkpoint journal only when durability is active and audit logging is on, so non-durable flows and runs with audit disabled get nothing extra.

The daemon prunes a run's checkpoints only when that run completes cleanly, meaning every sub-agent reported success and the run was not itself a resume. Anything else leaves the journal in place.

The v2026.5.5 fix

Before v2026.5.5, a daemon run could prune its checkpoints even when a sub-agent finished with success=False. A failed sub-agent does not raise, so the graph run returns normally, and the old prune logic treated that as a clean run. The journal was wiped and resume-after-failure was impossible.

The daemon now tracks run-local success through an on_service_complete callback. If any sub-agent returns success=False, the run is marked failed and its checkpoints are left in place. A daemon run that fails or crashes keeps its journal, so:

initrunner flow resume flow.yaml <flow_run_id>

replays the services that already completed and re-runs only the one that did not finish.