Evaluation

Evaluation is how Arivie proves v0.1 is production-ready — not merely demo-ready. RFC-002 §3.9–3.10 (REQ-30–32) and the pivot’s eval section define the suites, thresholds, and operational budgets that gate release.

Sprint 1 dogfood suite

The current dogfood suite is a 12-probe golden-SQL runner exercised against the orders semantic entity. Each probe checks that the agent emits SQL whose result set matches a hand-authored golden query, plus probe-specific validation rules (e.g. “must not claim zero revenue” for zero-row traps).

Composition today:

8 normal queries (revenue, top-N, time windows, GROUP BY)
2 ambiguous-question probes
2 zero-row traps

Run locally with:

pnpm eval

The harness now uses an in-process PGlite database by default, so no Docker is required. Set USE_TESTCONTAINERS=1 to run against the original testcontainers Postgres path.

The runner is built on Mastra runEvals with a composite scorer exposed from @arivie/core/eval. The current pass threshold is 9/12 (75%).

BIRD-Interact

REQ-31 wires the BIRD-Interact harness in v0.1 and requires a publishable score in the launch README. The adapter maps BIRD questions to Arivie’s Probe type; CI uses a subset for speed and the full suite for release runs (RFC-002 §9 validation sketch).

Quality attribute budgets

REQ-32 lists budgets that must hold at v0.1 launch:

Attribute	Budget
Latency p95 (first token)	< 5 s; cold instance < 1 s; warm < 200 ms
Bundle size	`@arivie/core` < 100 KB gz; `@arivie/react` < 30 KB gz
Test coverage	`@arivie/core`, `@arivie/db-postgres`, `@arivie/agent` ≥ 80%
Dogfood pass rate	≥ 90% across preload, browse, and rag
MCP parity	100% pass per release

Sprint 5 ships automated CI gates (check-bundle-size, coverage, eval pass-rate). Sprint 4 adds Puppeteer quickstart (pnpm test:quickstart) as the runtime gate that a new developer can follow documented steps and receive a streamed answer within the documented 10-minute user-experience bound (RFC §9.3.8).

Mode coverage

The pivot requires mode-coverage probes — every context mode (preload, browse, rag) must be exercised in CI. Auto-detection via arivie lint chooses a default mode from token count; consumers override with semantic.mode when they know their catalog size.