All docs
Docs
Core Concepts

Agent evals and regression replay

Replay stored workflow runs against JSON, evidence, proposal, and approval assertions before changing policy or capability behavior.

Overview

Agent evals turn Synapsor's durable run graph into a regression suite. Before changing a policy chunk, capability plan, memory rule, or model lane, run previous decisions through a named eval and inspect what would fail.

The current controlled-beta MVP evaluates stored workflow run output, run-step metadata, and source-table dataset rows with explicit expected columns. Frozen replay is implemented for reproducibility checks; branch and model replay are represented in the eval run metadata while deeper model re-execution and semantic scoring remain follow-up work.

Eval suites are tenant/principal scoped. A user can create, run, inspect failures, and diff eval runs only with the same authenticated session that owns the run set.

Regression loop

Use stored run graphs as the test fixture. The eval reports which decisions lose evidence, proposals, or allowed output values.

Input
  • Completed workflow runs
  • Stored output JSON
  • Evidence/proposal step links
  • Target branch label
Eval result
  • Passed/failed cases
  • Assertion-level failure reasons
  • Diff handle per source run
  • Baseline vs candidate delta

Developer notes

  • Use frozen replay first to prove stored evidence/proposal/run metadata remains reproducible.
  • Use branch replay labels when comparing a candidate policy branch, and treat MVP results as assertion checks over stored runs.
  • Store eval_run_id with deployment checks so you can diff baseline and candidate evals later.
  • Do not treat the current MVP as semantic LLM judging; add explicit JSON/evidence/proposal assertions.