Design: Dynamic Agent Spawning (M4.1b)¶

Status: In design — v0.4 Author: Jeryn Mathew Varghese Last updated: 2026-04

Motivation¶

Civitas topologies today are fully static: all agents and supervisors are declared in YAML (or code) before the runtime starts, and the supervision tree does not change shape at runtime. This is intentional — static trees are easy to reason about, and most production systems should be static.

However, LLM-driven orchestrators often need to create specialist agents on demand and decommission them when work is done. A research orchestrator might spawn one ResearchAgent per query; a paralleliser might spin up N workers for a batch job. Hardcoding the worker count defeats the purpose.

Dynamic agent spawning adds a controlled runtime escape hatch: a designated DynamicSupervisor node can accept new children at runtime, subject to capacity and governance constraints, while keeping the rest of the tree static and predictable.

OTP Analogy¶

Erlang separates Supervisor (static child spec, all three restart strategies) from DynamicSupervisor (starts empty, children added at runtime, ONE_FOR_ONE only). The separation exists because ONE_FOR_ALL and REST_FOR_ONE strategies depend on a fixed, ordered child list — they lose coherence when children arrive and leave dynamically. ONE_FOR_ONE is the only strategy where each child is fully independent and restart decisions never consult sibling state.

Civitas follows the same separation for the same reasons.

OTP	Civitas
`Supervisor` (static)	`Supervisor` (static, all strategies)
`DynamicSupervisor`	`DynamicSupervisor` (starts empty, ONE_FOR_ONE only)
`DynamicSupervisor.start_child/2`	`self.spawn(AgentClass, name, ...)`
`DynamicSupervisor.terminate_child/2`	`self.despawn(name)`

Design Decisions¶

D1 — `DynamicSupervisor` is a first-class node, not an extension of `Supervisor`¶

Supervisor keeps its fixed child spec and full strategy support. DynamicSupervisor is a separate class that starts with an empty child list and enforces ONE_FOR_ONE. This keeps both abstractions simple and avoids strategy carve-outs inside Supervisor.

D2 — `DynamicSupervisor` is declared as a static child in topology YAML¶

The supervisor itself is a fixed, named node in the tree. What changes at runtime is its children. This means:

The static tree structure is always visible in topology show
The DynamicSupervisor appears as a named node; dynamic children hang off it
Blast radius is contained — dynamic children cannot affect static siblings
Shutdown is clean — stopping the DynamicSupervisor stops all its children in one sweep

supervision:
  name: root
  strategy: ONE_FOR_ONE
  children:
    - name: orchestrator
      type: OrchestratorAgent
    - name: workers
      type: dynamic_supervisor
      max_children: 20

root (Supervisor, ONE_FOR_ONE)
├── orchestrator  (OrchestratorAgent, static)
└── workers       (DynamicSupervisor, static node — dynamic children)
    ├── researcher-1  (spawned at runtime)
    └── researcher-2  (spawned at runtime)

D3 — `self.spawn()` targets the nearest ancestor `DynamicSupervisor`¶

An agent does not name the target supervisor. The runtime walks up the ancestry chain from the calling agent and finds the first DynamicSupervisor. If none exists in the chain, self.spawn() raises SpawnError.

Rationale: - The topology YAML already makes the relationship explicit — the DynamicSupervisor is declared above the spawning agent, so reading the YAML tells you where children land - Naming the target at the call site would duplicate information already in the topology - Allowing agents to spawn into remote subtrees would couple unrelated parts of the tree; if cross-tree spawning is needed, the correct pattern is to send a message to the agent that owns that DynamicSupervisor

# Inside OrchestratorAgent.handle()
# Finds the nearest DynamicSupervisor ancestor ("workers") automatically
agent_name = await self.spawn(ResearchAgent, name="researcher-1", config={"topic": "..."})
await self.despawn("researcher-1")

D4 — `on_spawn_requested` is a governance veto hook¶

Before a spawn is executed, the runtime calls on_spawn_requested(agent_class, name, config) on the DynamicSupervisor. The default implementation approves all requests. Subclassing allows governance logic: rate limits, allowlists, policy checks.

class GovernedWorkerPool(DynamicSupervisor):
    async def on_spawn_requested(
        self, agent_class: type, name: str, config: dict
    ) -> bool:
        if agent_class not in ALLOWLIST:
            logger.warning("Spawn of %s denied by policy", agent_class.__name__)
            return False
        return True

D5 — Two decommission operations: `despawn()` and `stop()`¶

Decommissioning a dynamic child has two explicit operations with distinct semantics:

despawn(name) — hard stop. Cancels the agent's asyncio task immediately. on_stop() still fires. Pending ask() callers into the agent receive SpawnError("agent despawned"). The slot is freed immediately. Use when you need the capacity back now and don't care about in-flight work.

stop(name, drain, timeout) — soft stop. Awaitable — returns when the agent is fully stopped. The agent stops accepting new messages immediately (senders receive SpawnError("agent stopping")), then:

drain="current" — finishes the message currently being handled, runs on_stop(), stops
drain="all" — drains the full mailbox, runs on_stop(), stops
timeout (default 30.0s) — if drain isn't complete within the timeout, falls back to a hard stop, then returns

# Hard stop — immediate, slot freed now
await self.despawn("worker-1")

# Soft stop — finish current message, then stop
await self.stop("worker-1", drain="current")

# Soft stop — drain full mailbox, up to 60s, then hard stop if needed
await self.stop("worker-1", drain="all", timeout=60.0)

# Safe to spawn a replacement immediately after either
await self.spawn(ResearchAgent, name="worker-1", config={...})

D6 — Spawn requests are bus messages from day one¶

self.spawn() always sends a civitas.dynamic.spawn message to the DynamicSupervisor by name — even in-process. The message carries the dotted class path and serialised config rather than a class reference (classes cannot be serialised across process boundaries).

# Internal message shape — not part of public API
{
    "type": "civitas.dynamic.spawn",
    "class_path": "myapp.agents.ResearchAgent",  # dotted import path
    "name": "researcher-1",
    "config": {...},                              # must be JSON-serialisable
}

This means:

v0.4 (in-process): supervisor receives the message, imports the class locally, instantiates it. Call site: await self.spawn(ResearchAgent, name="researcher-1", config={...}) — the runtime resolves ResearchAgent to its dotted path automatically.
v0.5 (cross-process): same message, same path, routed over ZMQ/NATS to the worker process running the supervisor. The receiving worker imports and instantiates the class. Requires homogeneous deployments where all workers have the same codebase.

The public API (self.spawn(), self.despawn(), self.stop()) never changes between versions. The transport is the only difference.

D7 — Two independent capacity limits¶

DynamicSupervisor supports two optional, independent limits:

max_children — concurrent live children. A slot is freed when a child exits or is despawned. Prevents resource exhaustion.
max_total_spawns — lifetime spawn budget. Monotonically increasing, never resets. Useful for audit, billing, or security constraints.

self.spawn() raises SpawnError if either limit is reached, with a clear reason in the message. Both limits are in-memory — they reset if the DynamicSupervisor crashes and restarts. For durable budgets, track spawn counts in self.state on the orchestrator and enforce via on_spawn_requested.

- name: workers
  type: dynamic_supervisor
  max_children: 20        # at most 20 alive at once (default: unbounded)
  max_total_spawns: 1000  # at most 1000 spawns ever (default: unbounded)

Restart Semantics (Q2 — resolved)¶

Dynamic children use transient restart mode by default. The restart field is configurable per DynamicSupervisor.

Exit type	`permanent`	`transient` (default)	`never`
Crash (abnormal exit)	Restart	Restart	Remove, notify
Clean exit / `despawn()`	Restart	Remove	Remove
Restarts exhausted	Escalate to parent	Remove, notify orchestrator	—

Key rule — no escalation on exhaustion. When a dynamic child exhausts its restarts, the DynamicSupervisor removes the child and fires on_child_terminated(name, reason) on the spawning agent. It does not escalate to its parent supervisor. Escalating would bring down the static tree over a transient worker failure, defeating the purpose of containment.

Notification hook on AgentProcess:

async def on_child_terminated(self, name: str, reason: str) -> None:
    """Called when a dynamically spawned child is permanently removed.

    reason is one of: "restarts_exhausted", "despawned", "clean_exit"
    Default implementation logs a warning. Override to re-spawn, alert, etc.
    """

Dynamic child list is in-memory only. If the DynamicSupervisor itself crashes and is restarted by its parent, it starts empty. Orchestrators that need durability must checkpoint spawned agent names via self.state.

- name: workers
  type: dynamic_supervisor
  max_children: 20
  restart: transient      # permanent | transient (default) | never
  max_restarts: 3
  restart_window: 60

Open Questions¶

The following questions are deferred to design sessions before implementation:

Q3 — on_spawn_requested placement: Is the hook on DynamicSupervisor, on AgentProcess, or both? Who is the right place to enforce governance?
Q4 — max_children semantics: Is the limit on concurrent live children, total ever spawned, or both?
~~Q5 — Despawn semantics~~ → two explicit operations: despawn() (hard stop) and stop() (soft stop, awaitable, drain="current"|"all", timeout fallback)
~~Q6 — Cross-process spawning~~ → bus message protocol from day one; in-process only in v0.4; v0.5 routes same message to remote worker (homogeneous deployments only)
~~Q7 — topology show live state~~ → TopologyServer(GenServer) supervised HTTP endpoint; CLI pings GET /topology for live tree; falls back to static YAML if unreachable

D8 — `TopologyServer(GenServer)` exposes a JSON HTTP management endpoint¶

topology show gets live state by pinging a supervised TopologyServer GenServer running inside the runtime. It is declared as a normal child in topology YAML — optional, supervised, lifecycle-bound to the runtime.

supervision:
  name: root
  strategy: ONE_FOR_ONE
  children:
    - name: orchestrator
      type: OrchestratorAgent
    - name: workers
      type: dynamic_supervisor
      max_children: 20
    - name: topology_server
      type: topology_server       # built-in shorthand, like http_gateway
      config:
        host: 127.0.0.1           # default: localhost only
        port: 6789                # default

Endpoints (read-only, JSON):

GET /topology           → full supervision tree (static + live dynamic children)
GET /agents             → flat list of all running agents + status
GET /agents/{name}      → single agent: status, restart count, metrics
GET /health             → liveness check {"status": "ok"}

topology show behaviour:

Reads topology YAML to find topology_server config (host + port)
Attempts GET /topology — if reachable, renders full live tree with dynamic children populated and DynamicSupervisor nodes visually distinguished (dotted border / [dynamic] label)
If not reachable (runtime not running), renders static YAML tree with (runtime not running) annotation on DynamicSupervisor nodes

Why a GenServer, not a standalone HTTP server:

Supervised — crashes are restarted automatically by the parent supervisor
Lifecycle-bound — starts and stops with the runtime, no orphaned processes
Direct access to runtime internals — queries the supervisor tree and registry without IPC
Consistent — everything in Civitas is a supervised process; the management endpoint is no exception
Universal — JSON over HTTP works on all platforms, in containers, with curl, and with the Textual dashboard

Future — Textual dashboard:

civitas dashboard will be rebuilt on Textual (interactive Python TUI framework). It will consume TopologyServer endpoints for live tree rendering and per-agent metrics — the same JSON, the same endpoint, no additional protocol.

API Surface (provisional)¶

# DynamicSupervisor — topology YAML
# type: dynamic_supervisor
# max_children: 20
# max_total_spawns: 1000
# restart: transient
# max_restarts: 3
# restart_window: 60

# AgentProcess — spawn / decommission
await self.spawn(AgentClass, name="worker-1", config={...})  # -> str (agent name)
await self.despawn("worker-1")                               # hard stop
await self.stop("worker-1", drain="current")                 # soft stop, finish current message
await self.stop("worker-1", drain="all", timeout=60.0)       # soft stop, drain mailbox

# AgentProcess — lifecycle notifications
async def on_child_terminated(self, name: str, reason: str) -> None: ...
# reason: "restarts_exhausted" | "despawned" | "clean_exit"

# DynamicSupervisor — governance hook
async def on_spawn_requested(
    self, agent_class: type, name: str, config: dict[str, Any]
) -> bool: ...

# Runtime — external entry points (for non-agent callers)
await runtime.spawn("workers", ResearchAgent, name="researcher-1", config={...})
await runtime.despawn("workers", "researcher-1")
await runtime.stop("workers", "researcher-1", drain="all", timeout=30.0)

Non-Goals (v0.4)¶

Cross-process spawning (ZMQ / NATS) — deferred to v0.5
Spawning into a remote DynamicSupervisor by name from an unrelated subtree
Visual topology editor integration (M4.1 is deferred)
Per-agent spawn quotas (only global max_children per DynamicSupervisor in v0.4)
Textual dashboard — planned as follow-on; TopologyServer endpoints are the foundation