Design: EvalLoop (M2.5 + M2.6)¶
Status: M2.5 Implemented — v0.3 | M2.6 Planned — v0.4 Author: Jeryn Mathew Varghese Last updated: 2026-04
Motivation¶
LLM agents can go off-rails — they hallucinate, ignore constraints, loop, or produce unsafe outputs. Detecting this from inside the agent's own handle() is possible but conflates application logic with correctness monitoring. You also cannot catch patterns that only emerge across multiple messages.
EvalLoop introduces a dedicated, supervised EvalAgent process that sits alongside regular agents in the supervision tree. Agents emit observable events; the EvalAgent scores them and injects correction signals back. If needed, it halts the offending agent entirely.
There are two tiers:
| Tier | What | When |
|---|---|---|
| M2.5 — Local EvalLoop | In-process EvalAgent with real-time correction signals |
v0.3 |
| M2.6 — Remote Eval Exporters | Plugin adapters for Arize Phoenix, Fiddler, Langfuse, Braintrust, LangSmith | v0.4 |
Both tiers share the same EvalEvent schema. The local agent consumes it in-process; remote exporters translate it to OTEL GenAI spans and forward to external eval engines.
Architecture¶
AgentProcess
│
├── await self.emit_eval("llm_output", {"content": response, ...})
│ │
│ ├──▶ EvalAgent (local, in-process)
│ │ │
│ │ ├── on_eval_event(event) → CorrectionSignal | None
│ │ ├── rate limit check
│ │ └── send civitas.eval.correction / civitas.eval.halt
│ │
│ └──▶ EvalExporter (remote, M2.6)
│ ├── Arize Phoenix (OTEL GenAI spans)
│ ├── Fiddler (production guardrails)
│ ├── Langfuse (open-source)
│ ├── Braintrust (eval science)
│ └── LangSmith (LangChain ecosystem)
│
└── on_correction(message) ◀── civitas.eval.correction (nudge / redirect)
[auto-halt] ◀── civitas.eval.halt
Why a separate process: the evaluator is independently supervised, independently restartable, stateful (rate limit counters, violation history), and swappable without touching agent code. This follows the OTP design principle: concerns that can fail independently should be separate processes.
Scope boundary¶
| Concern | M2.5 (Local) | M2.6 (Remote) |
|---|---|---|
EvalAgent, EvalEvent, CorrectionSignal |
✅ | — |
emit_eval() on AgentProcess |
✅ | — |
on_correction() hook |
✅ | — |
| Rate limiting per target agent | ✅ | — |
Topology YAML (type: eval_agent) |
✅ | — |
EvalExporter protocol |
✅ (interface only) | — |
| Arize Phoenix plugin | — | ✅ |
| Fiddler plugin (two-way guardrails) | — | ✅ |
| Langfuse plugin | — | ✅ |
| Braintrust plugin | — | ✅ |
| LangSmith plugin | — | ✅ |
Core types¶
EvalEvent¶
Emitted by agents via await self.emit_eval(event_type, payload). Schema is aligned with OTEL GenAI Semantic Conventions so remote exporters can forward as standard spans.
@dataclass
class EvalEvent:
agent_name: str # who emitted it
event_type: str # e.g. "llm_output", "tool_call", "decision", "custom"
payload: dict[str, Any] # event data (model output, tool result, etc.)
trace_id: str = ""
message_id: str = ""
timestamp: float = field(default_factory=time.time)
Event type conventions:
| event_type | When to use |
|---|---|
llm_output |
After receiving an LLM response |
tool_call |
Before or after a tool invocation |
decision |
When the agent makes a routing or branching decision |
message_sent |
When the agent sends a message to another agent |
custom |
Any application-specific checkpoint |
CorrectionSignal¶
Returned by EvalAgent.on_eval_event(). Three severity levels:
| Severity | Meaning | Agent behaviour |
|---|---|---|
nudge |
Soft guidance — minor issue detected | Agent receives correction in on_correction(), continues running |
redirect |
Significant concern — approach needs to change | Agent receives correction in on_correction(), should alter course |
halt |
Critical violation | Agent's message loop is stopped via civitas.eval.halt |
@dataclass
class CorrectionSignal:
severity: Literal["nudge", "redirect", "halt"]
reason: str
payload: dict[str, Any] = field(default_factory=dict)
EvalAgent¶
class EvalAgent(AgentProcess):
def __init__(
self,
name: str,
max_corrections_per_window: int = 10,
window_seconds: float = 60.0,
**kwargs,
): ...
async def on_eval_event(self, event: EvalEvent) -> CorrectionSignal | None:
"""Override to implement eval logic. Return None to take no action."""
return None
EvalAgent.handle() receives civitas.eval.event messages, calls on_eval_event(), checks the rate limiter, then sends the correction. For halt, it sends a civitas.eval.halt message which breaks the target agent's message loop cleanly (same path as graceful shutdown — on_stop() still runs).
Rate limiting uses a sliding window keyed by target agent name. Once an agent has received max_corrections_per_window corrections in the last window_seconds, further corrections are dropped (and logged). This prevents correction storms against a broken agent.
AgentProcess integration¶
Two additions to AgentProcess:
async def emit_eval(
self,
event_type: str,
payload: dict[str, Any],
eval_agent: str = "eval_agent",
) -> None:
"""Send an EvalEvent to the named EvalAgent. No-op if bus not wired."""
async def on_correction(self, message: Message) -> None:
"""Called when this agent receives a civitas.eval.correction message.
Override to react to nudge/redirect signals. Default: no-op."""
civitas.eval.halt is handled in _message_loop() — it breaks the loop the same way _agency.shutdown does, ensuring on_stop() always runs.
Topology YAML¶
supervision:
name: root
strategy: ONE_FOR_ONE
children:
- type: civitas.evalloop.EvalAgent
name: eval_agent
- type: myapp.agents.ResearchAgent
name: researcher
type: eval_agent shorthand is also supported in Runtime.from_config().
EvalExporter protocol (M2.6)¶
Defined in M2.5 as an interface; implemented in M2.6. Remote eval engines receive the same EvalEvent objects, translated to their expected format.
class EvalExporter(Protocol):
async def export(self, event: EvalEvent) -> None:
"""Forward an EvalEvent to a remote eval engine."""
...
emit_eval() will forward to all registered exporters in addition to the local EvalAgent. Each exporter is responsible for translating EvalEvent to the target platform's format:
| Platform | Integration model | Notes |
|---|---|---|
| Arize Phoenix | OTEL GenAI spans via OTLP | Strongest OTEL support; instrument once |
| Fiddler | Fiddler SDK + fiddler-client |
Two-way: sends events, receives guardrail decisions |
| Langfuse | Langfuse Python SDK | Open-source; self-hostable |
| Braintrust | Braintrust Python SDK | Strong eval science focus |
| LangSmith | LangSmith SDK | LangChain ecosystem |
Fiddler is the most interesting integration: Fiddler can return a guardrail decision (block/allow) synchronously. The FiddlerExporter would surface this as a CorrectionSignal, making Fiddler a remote eval engine that drives local halt behaviour.
OTEL alignment¶
EvalEvent fields map to OTEL GenAI Semantic Conventions:
| EvalEvent field | OTEL GenAI attribute |
|---|---|
agent_name |
gen_ai.agent.name |
event_type |
gen_ai.operation.name |
payload["model"] |
gen_ai.request.model |
payload["input_tokens"] |
gen_ai.usage.input_tokens |
payload["output_tokens"] |
gen_ai.usage.output_tokens |
payload["content"] |
gen_ai.output.text |
trace_id |
OTEL trace context |
This alignment means a single emit_eval() call produces data consumable by any OTEL-native platform without transformation.
Open questions¶
-
Two-way Fiddler guardrails — should
FiddlerExporter.export()be async and block the agent until Fiddler responds? Fiddler claims sub-100ms latency; this is viable but adds per-eval latency to the hot path. Alternative: fire-and-forget export, Fiddler sends halt back asynchronously. -
EvalExporter registration — should exporters be registered on the
EvalAgentinstance or globally on theRuntime? Per-agent is more flexible but adds configuration surface. -
Sampling — high-throughput agents may emit thousands of eval events per second. Should
emit_eval()support a sampling rate, or should that be the exporter's concern? -
Correction acknowledgement — should agents be required to acknowledge corrections? Currently
on_correction()is a best-effort hook with no reply.
Acceptance criteria (M2.5)¶
-
EvalAgentcan receive eval events from any agent in the supervision tree -
on_eval_event()is the single override point — no other methods required -
nudgeandredirectsignals delivered viaon_correction()hook -
haltstops the target agent cleanly (on_stop()still runs) - Rate limiter drops excess corrections silently (logged at WARNING)
-
emit_eval()is a no-op when no bus is wired (safe to call in tests) -
type: eval_agentsupported in topology YAML - ≥ 12 unit tests; ≥ 1 integration test with a real supervision tree
-
EvalExporterprotocol defined and documented, not yet implemented