Durability¶
The problem¶
An agent run is a long, fragile sequence of expensive, irreversible actions: call the model (\(), charge a card (\)$), send an email (can't unsend), spawn a sub-agent. Now the worker process dies halfway through — a deploy, an OOM kill, a spot-instance reclaim.
Two bad outcomes are possible:
- Lose the work — restart from scratch, re-pay for everything already done.
- Repeat the work — replay the run and charge the card again.
Durability is the machinery that avoids both: a crashed run resumes where it left off without re-executing the side-effects that already completed.
Two structures do all the work¶
Ravi achieves this with two append-only structures, both defined as kernel Protocols so they can be backed by memory (dev) or Postgres + Redis (production) without changing agent code.
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#E8EAF6','primaryTextColor': '#1A237E','primaryBorderColor': '#3949AB','lineColor': '#546E7A','fontSize': '13px'}}}%%
graph TB
classDef store fill:#F3E5F5,stroke:#6A1B9A,color:#4A148C,font-weight:bold
classDef desc fill:#FAFAFA,stroke:#BDBDBD,color:#424242
subgraph EL["Event Log — what happened"]
E0["seq 0: run.started"]:::store
E1["seq 1: tool.called"]:::store
E2["seq 2: tool.result"]:::store
E3["seq 3: run.completed"]:::store
E0 --> E1 --> E2 --> E3
end
subgraph JN["Journal — results of side-effects"]
J1["effect_id abc123 → {ok, value}"]:::store
J2["effect_id def456 → {ok, value}"]:::store
end Event Log — the ordered, append-only history of a run. Every meaningful step appends a RunLogEntry with a monotonic seq and a kind (run.started, tool.called, tool.result, child.spawned, run.suspended, run.completed, …). The truth of a run is fold(entries from seq 0). A checkpoint is just a compaction snapshot — never the source of authority.
Journal — an idempotency cache. Before any external effect runs, it computes a deterministic effect_id and asks the journal whether that effect already completed. If yes, it returns the cached result instead of re-running.
The at-most-once protocol¶
Every journaled step — every ctx.llm(), every ctx.tool() — follows the same three-step dance:
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#E3F2FD','primaryTextColor': '#0D47A1','primaryBorderColor': '#1565C0','lineColor': '#546E7A','fontSize': '13px'}}}%%
flowchart TD
classDef process fill:#E8EAF6,stroke:#3949AB,color:#1A237E
classDef decision fill:#FFF3E0,stroke:#E65100,color:#BF360C,font-weight:bold
classDef cache fill:#E8F5E9,stroke:#2E7D32,color:#1B5E20,font-weight:bold
START([Step begins]) --> ID["Compute effect_id =<br/>hash(run_id + step_seq + kind + args)"]:::process
ID --> LOOK{"Journal.lookup(effect_id)"}:::decision
LOOK -->|"HIT — already done"| RET["Return cached result<br/>(do NOT re-run)"]:::cache
LOOK -->|"MISS"| EXEC["Execute the real effect"]:::process
EXEC --> REC["Journal.record(result)"]:::process
REC --> RET2["Return result"]:::process The effect_id is deterministic: the same logical step in the same run always hashes to the same id (Effect.make_id sorts the args before hashing, so argument order doesn't matter). This is the linchpin — on replay, step N computes the same id it did the first time, finds the journal hit, and skips the work.
What replay looks like¶
When a worker crashes, the run goes back to PENDING. Another worker leases it and replays:
%%{init: {'theme': 'base', 'themeVariables': {'actorBkg': '#E8EAF6','actorBorder': '#3949AB','actorTextColor': '#1A237E','noteBkgColor': '#FFFDE7','noteBorderColor': '#F57F17','signalColor': '#546E7A','fontSize': '12px'}}}%%
sequenceDiagram
autonumber
participant W as Worker (new)
participant A as Agent.run()
participant J as Journal
participant EXT as External (LLM / tool)
Note over W,EXT: Run crashed after the LLM call + journal record,<br/>but before the tool finished.
W->>A: replay run from inbox
A->>J: lookup(effect_id for LLM call)
J-->>A: HIT → cached LLM response
Note right of A: Model is NOT called again — no second charge
A->>J: lookup(effect_id for tool call)
J-->>A: MISS
A->>EXT: execute tool for real
EXT-->>A: result
A->>J: record(tool result)
A-->>W: run completes Completed effects are free on replay (journal hits). Only the work that hadn't finished re-runs. The run reaches the same end state as if it had never crashed.
The honest caveat¶
There is one unavoidable window: if the worker dies after executing an effect but before recording it, the journal has no entry. On replay that step is a MISS and runs again.
Ravi chooses at-most-once, not at-least-once: it does not retry on that uncertainty, so you never double-charge — but in that rare window an effect can be silently lost rather than repeated. Tools that are genuinely idempotent (a GET, a Stripe charge with an idempotency key) can be safely retried and should say so in their description so callers know they get effectively exactly-once.
Who drives all this: the Worker¶
You never invoke the protocol yourself. The Worker does, around every run:
- Lease a run from the Scheduler (with a heartbeat to keep long LLM calls from losing the lease).
- Append
run.startedto the event log, drain the agent's inbox. - Build a fresh
RunContextand callagent.run(ctx, inbox). - On success → append
run.completed, release the lease. - On crash → append
run.failed/run.cancelled,nackthe messages so the run is retried.
Inside agent.run, every ctx.llm() / ctx.tool() / ctx.spawn() call routes through _journaled(), which runs the at-most-once protocol above and bumps step_seq.
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#E8EAF6','primaryTextColor': '#1A237E','primaryBorderColor': '#3949AB','lineColor': '#546E7A','fontSize': '13px'}}}%%
graph LR
classDef runtime fill:#E3F2FD,stroke:#1565C0,color:#0D47A1,font-weight:bold
classDef agent fill:#E8EAF6,stroke:#3949AB,color:#1A237E,font-weight:bold
classDef store fill:#F3E5F5,stroke:#6A1B9A,color:#4A148C
SCH[Scheduler]:::runtime -->|"lease run"| WK[Worker]:::runtime
WK -->|"agent.run(ctx, inbox)"| CTX["RunContext"]:::agent
CTX -->|"journaled step"| J[("Journal")]:::store
CTX -->|"append entry"| EL[("Event Log")]:::store
WK -->|"release / heartbeat"| SCH Dev and production are the same code¶
The whole point of putting these behind kernel Protocols is that the agent never knows which backend is live:
| Backend | Dev (Stage 0) | Production (Stage 1) |
|---|---|---|
| Event Log | in-memory | Postgres append-only table, (run_id, seq) PK |
| Journal | in-memory dict | Redis keyed by effect_id |
| Scheduler / Inbox | in-memory asyncio | Postgres queue with leases |
# Dev — everything in-process, no infra needed
async with Runtime() as rt:
...
# Production — durable backends injected by the infrastructure factory
from ravi.infrastructure.runtime import build_postgres_runtime
async with build_postgres_runtime(postgres_url=..., redis_url=...) as rt:
...
Same agent. Same call site. The durability guarantees turn on with the backend swap.
Where this lives¶
| Piece | Location |
|---|---|
Effect, EffectResult, Journal Protocol | kernel/runtime/effects.py |
RunLogEntry, EventLog Protocol | kernel/runtime/log_entry.py |
_journaled() wrapper | agents/runtime/context.py |
| Worker run loop | agents/runtime/worker.py |
| In-memory backends | agents/runtime/backends/ |
| Postgres / Redis backends | infrastructure/runtime/ |
Next: Human-in-the-Loop — how a run pauses for a human and resumes later (durability is what makes the wait free).