Model quality & drift monitoring

How we know the classifier is improving rather than drifting. We run a deterministic 5% sample of recent classifications through a second model nightly, compare the verdicts, and alert when disagreement crosses a fixed threshold.

Last updated: 6 June 2026

Why this page exists

A classifier that quietly drifts is worse than one that fails loudly. CoverProof’s classifier is pinned to a specific methodology version with a SHA-256 prompt fingerprint (see Methodology), but model weights upstream at Anthropic can be updated by the provider. We mitigate this by running a continuous drift check against a held-out judge model and surfacing the disagreement rate.

This page exists so a CCO can verify, before adopting CoverProof, that the monitoring is real — and so an auditor reviewing an evidence pack a year later can verify the classifier was being actively monitored at the time the pack was generated.

Current measured snapshot

Exact-match accuracy

53.3%

Against the 30-fixture gold roster for s250-v12.

Raw false-negative rate

50.0%

Exposed-class misses before the operating point and mandatory human-review gate are applied.

False-positive rate

4.5%

Not-exposed rows classified as exposed in the held-out evaluation snapshot.

Expected calibration error

0.245

Brier score 0.262; lower is better for both calibration measures.

Faithfulness

Pending

Pending Sonnet re-measurement before this becomes a public headline metric.

Cross-family agreement

Pending

No measured independent-judge rate is published yet.

Run: DAY90 · Snapshot date: 2026-06-06 · Methodology: s250-v12 · Model: claude-sonnet-4-6

These figures are read from the committed evaluation snapshot, not typed into this page. The current baseline is prod-faithful to the shipped Sonnet classifier, but the local CLI evaluation path still has no temperature control, so run-to-run noise is disclosed. The accuracy label is exact-match against the evaluation roster, not a solicitor-labelled legal truth measure. Faithfulness and cross-family agreement are withheld from headline metrics until the Sonnet re-measurement is complete.

How drift monitoring works

Deterministic sampling. A nightly background job selects approximately 5% of recently-confirmed gap classifications using a SHA-256 hash bucket. Same input set, same selection — auditable and reproducible.
Judge re-classification. Each sample is re-classified by a judge model against the same statute extract and verdict schema. The judge call is logged with its own model id, raw response, and timestamp.
Tiered disagreement detection. A row counts as disagreement if either of the two independent verdicts (s.250 status, governance coverage) differs, or if confidence differs by more than 30 percentage points.
Alert threshold. Rolling 30-day disagreement rate > 10pp triggers an internal alert. The threshold is fixed — not per-firm-configurable.

Audit-chain verification

The worker service also runs a daily audit-chain verification for every firm. It re-walks the hash chain using the same verifyAuditChainRows logic embedded in evidence packs, then appends either audit_chain.verification_passed or audit_chain.verification_failed to that firm’s audit trail.

A failed verification sends an internal alert with the first failure kind and failure count. The job records results by appending new audit events; it does not rewrite existing audit rows.

Judge model disclosure

Primary classifier: Anthropic claude-sonnet-4-6 (temperature 0, methodology version pinned per classification).

Judge model: Anthropic claude-haiku-4-5 (same family).

The judge is in the same model family as the primary classifier, which makes it an effective signal for prompt-level drift — exactly the kind of drift we expect from a methodology-versioned classifier. A genuinely independent cross-family check now runs as an additional layer (see below); we publish the judge model so your team can evaluate the strength of today’s signal directly.

Cross-family validation (independent judge)

A same-family judge can share the primary classifier’s blind spots. To get a genuinely independent signal, we also run an offline check with a different model family — a GPT-family model — over a held-out sample, and compare its verdict against the Claude classifier’s. The judge is held to structured output (its s.250 status and SM&CR coverage verdicts are constrained to the same fixed schema values, so they cannot drift outside the allowed set), and every run is recorded to the methodology audit trail.

Status: Sonnet re-measurement pending. The cross-family judge is wired and runs over a held-out sample, but we publish an agreement rate only from a production-faithful Sonnet-parity run — never a placeholder or stale pre-parity number. The methodology is published here so it can be evaluated on its design until the measured rate ships.

What we publish here

The drift system is live today. Aggregate metrics ship as soon as the deployed sample is statistically informative; until then we publish the methodology so it can be evaluated on its design, not on a sample size that would over- or under-state the result.

What ships publicly: alert threshold, sampling rate, judge model, the three disagreement detection rules, methodology version this page was generated for.
What ships aggregated (post-launch): rolling 30-day disagreement rate across all firms, breakdown by disagreement kind, alert history. No firm-level data.
What never ships: firm-level drift samples, individual classification disagreements (this is firm-scoped data under Row-Level Security and stays inside the firm’s tenant boundary).

Scope and design choices

Same-family judge, by design for prompt drift. The current nightly judge is in the same model family as the primary classifier — strong at detecting prompt-driven drift, which is the regime methodology versioning is built around. A cross-family validator now runs offline as an additional, independent layer (see Cross-family validation above).
Nightly cadence with version pinning. Drift detection runs nightly, and every classification is pinned to a SHA-256 methodology fingerprint at the moment it was produced. An evidence pack generated today carries the methodology version it was generated against — an auditor a year later can reproduce the conditions exactly.
5% sample, 10pp threshold, conservative on purpose. A 5% deterministic sample paired with a 10pp disagreement threshold is calibrated to surface meaningful drift quickly without alerting on noise. Both numbers are published here so your team can evaluate the trade-off directly.