Overview
Agent evals turn Synapsor's durable run graph into a regression suite. Before changing a policy chunk, capability plan, memory rule, or model lane, run previous decisions through a named eval and inspect what would fail.
The current controlled-beta MVP evaluates stored workflow run output, run-step metadata, and source-table dataset rows with explicit expected columns. Frozen replay is implemented for reproducibility checks; branch and model replay are represented in the eval run metadata while deeper model re-execution and semantic scoring remain follow-up work.
Eval suites are tenant/principal scoped. A user can create, run, inspect failures, and diff eval runs only with the same authenticated session that owns the run set.
Regression loop
Use stored run graphs as the test fixture. The eval reports which decisions lose evidence, proposals, or allowed output values.
- Completed workflow runs
- Stored output JSON
- Evidence/proposal step links
- Target branch label
- Passed/failed cases
- Assertion-level failure reasons
- Diff handle per source run
- Baseline vs candidate delta
Developer notes
- Use frozen replay first to prove stored evidence/proposal/run metadata remains reproducible.
- Use branch replay labels when comparing a candidate policy branch, and treat MVP results as assertion checks over stored runs.
- Store eval_run_id with deployment checks so you can diff baseline and candidate evals later.
- Do not treat the current MVP as semantic LLM judging; add explicit JSON/evidence/proposal assertions.