Core Concepts

Agent evals and regression replay

Replay stored workflow runs against JSON, evidence, proposal, and approval assertions before changing policy or capability behavior.

Overview

Agent evals turn Synapsor's durable run graph into a regression suite. Before changing a policy chunk, capability plan, memory rule, or model lane, run previous decisions through a named eval and inspect what would fail.

The current controlled-beta MVP evaluates stored workflow run output, run-step metadata, and source-table dataset rows with explicit expected columns. Frozen replay is implemented for reproducibility checks; branch and model replay are represented in the eval run metadata while deeper model re-execution and semantic scoring remain follow-up work.

Eval suites are tenant/principal scoped. A user can create, run, inspect failures, and diff eval runs only with the same authenticated session that owns the run set.

Regression loop

Use stored run graphs as the test fixture. The eval reports which decisions lose evidence, proposals, or allowed output values.

Input

Completed workflow runs
Stored output JSON
Evidence/proposal step links
Target branch label

Eval result

Passed/failed cases
Assertion-level failure reasons
Diff handle per source run
Baseline vs candidate delta

Developer notes

Use frozen replay first to prove stored evidence/proposal/run metadata remains reproducible.
Use branch replay labels when comparing a candidate policy branch, and treat MVP results as assertion checks over stored runs.
Store eval_run_id with deployment checks so you can diff baseline and candidate evals later.
Do not treat the current MVP as semantic LLM judging; add explicit JSON/evidence/proposal assertions.

Eval SQL

SQL

-- Create a suite from stored workflow runs. billing.waiver_regression is your eval name.
CREATE AGENT EVAL billing.waiver_regression
DESCRIPTION 'Regression suite for late-fee waiver decisions'
SOURCE AGENT RUNS
WHERE workflow = 'billing.late_fee_waiver_flow' -- workflow name from CREATE AGENT WORKFLOW
AND status = 'completed'                         -- run status from END AGENT RUN
REPLAY MODE frozen                               -- no model call; inspect stored run state
ASSERT evidence_bundle_id IS NOT NULL            -- output or step evidence must exist
ASSERT proposal_id IS NOT NULL                   -- output or step proposal must exist
ASSERT JSON_VALUE(output, '$.decision') IN ('approved', 'rejected', 'held_for_review');

-- Run against main or a candidate branch label. Synapsor returns eval_run_id.
RUN AGENT EVAL billing.waiver_regression
AGAINST BRANCH policy_v2_candidate
REPLAY MODE branch;

-- Use the returned eval_run_id to inspect failures.
SHOW AGENT EVAL FAILURES
FOR EVAL RUN 'evalrun_...';

-- Compare two eval runs after a policy/model/capability change.
DIFF AGENT EVAL
BASELINE 'evalrun_baseline_...'
CANDIDATE 'evalrun_candidate_...';

Examples

sql

-- Dataset-table evals are useful before you have a large stored run history.
CREATE TABLE billing_waiver_eval_cases (
  case_id VARCHAR PRIMARY KEY,
  tenant_id VARCHAR,
  decision VARCHAR,
  expected_decision VARCHAR,
  amount_cents INT,
  max_amount_cents INT,
  evidence_bundle_id VARCHAR
);

INSERT INTO billing_waiver_eval_cases
VALUES ('case_pass', 'acme', 'approved', 'approved', 2500, 5000, 'evb_pass');

CREATE AGENT EVAL billing.waiver_dataset
SOURCE TABLE billing_waiver_eval_cases
REPLAY MODE frozen
ASSERT evidence_bundle_id IS NOT NULL
ASSERT JSON_VALUE(output, '$.decision') = ROW expected_decision
ASSERT JSON_VALUE(output, '$.amount_cents') <= ROW max_amount_cents;

RUN AGENT EVAL billing.waiver_dataset AGAINST BRANCH main;