Validation Review Incident Runbook
Objective
Provide deterministic incident handling for the validation review web program across the three blocking failure classes in scope for #282.
Incident Class A: Render Failures (/render)
Trigger
POST /v2/validation-runs/{runId}/render returns repeated failures or render jobs remain non-completing.
Triage Commands
RUN_ID="<run-id>"
TOKEN="<bearer-token>"
API_BASE="https://api-nexus.lona.agency"
curl -sS "$API_BASE/v2/validation-runs/$RUN_ID" \
-H "Authorization: Bearer $TOKEN" \
-H "X-Request-Id: req-validation-render-status-001"
curl -sS "$API_BASE/v2/validation-runs/$RUN_ID/artifact" \
-H "Authorization: Bearer $TOKEN" \
-H "X-Request-Id: req-validation-render-artifact-001"
Containment And Recovery
- Keep JSON artifact workflow active; do not block reviewer decision on HTML/PDF.
- Re-submit render request with a fresh
Idempotency-Key. - Capture request/response payloads and backend logs in the issue evidence comment.
Incident Class B: Auth Failures (401)
Trigger
Web proxy or direct API calls return 401 Unauthorized during run creation/review/render retrieval.
Triage Commands
TOKEN="<bearer-token>"
API_BASE="https://api-nexus.lona.agency"
curl -i "$API_BASE/v2/validation-runs/nonexistent" \
-H "Authorization: Bearer $TOKEN" \
-H "X-Request-Id: req-validation-auth-check-001"
curl -i "$API_BASE/v1/health"
Containment And Recovery
- Re-authenticate the reviewer session and retry via
/validation. - Verify proxy auth resolution behavior in
/frontend/src/lib/validation/server/auth.ts. - Confirm identity scope is auth-derived and no caller-supplied tenant/user override path is being used.
Incident Class C: Regression Replay Failures (/validation-regressions/replay)
Trigger
Replay response returns gate-blocking decision (mergeGateStatus=blocked or releaseGateStatus=blocked), or policy checks fail.
Triage Commands
pytest backend/tests/contracts/test_validation_replay_policy.py
pytest backend/tests/contracts/test_validation_release_gate_check.py
PYTHONPATH=backend python -m src.platform_api.validation.release_gate_check
Containment And Recovery
- Freeze merge/release progression for the candidate change.
- Compare baseline and candidate evidence references from replay payload.
- Patch deterministic or policy drift, rerun contract/replay tests, and re-run replay.
Incident Class D: Deep-Link Run Load Failures (#279)
Trigger
Reviewer opens /validation?runId=<runId> but run detail/artifact does not resolve in web UI.
Triage Commands
RUN_ID="<run-id>"
TOKEN="<bearer-token>"
API_BASE="https://api-nexus.lona.agency"
curl -sS "$API_BASE/v2/validation-runs/$RUN_ID" \
-H "Authorization: Bearer $TOKEN" \
-H "X-Request-Id: req-validation-deeplink-run-001"
curl -sS "$API_BASE/v2/validation-runs/$RUN_ID/artifact" \
-H "Authorization: Bearer $TOKEN" \
-H "X-Request-Id: req-validation-deeplink-artifact-001"
Containment And Recovery
- Verify the deep link contains the exact
runIdreturned by CLI/SDK output. - If API calls pass, reload
/validation?runId=<runId>with a fresh authenticated session. - If API calls fail, treat as contract/auth incident and follow Class B/C flow.
Governance And Review Findings Disposition
- Resolve Cursor and Greptile findings before merge when a fix is feasible in-scope.
- If a finding is intentionally deferred, post explicit disposition in the PR thread with rationale, owner, and follow-up issue.
- Keep review threads resolved before merge (
review-governance).
Evidence Capture Template
Use this template in child issue #282 and mirror summary in parent #288.
Validation Review incident update:
- Parent: #288
- Child: #282
- Incident class: render_failure | auth_failure | regression_failure
- Run ID / Replay ID: <id>
- Request IDs: <list>
- Impact: <scope + user effect>
- Containment: <actions completed>
- Recovery: <actions completed>
- CI checks: contracts-governance=<status>, docs-governance=<status>, llm-package-governance=<status>
- Cursor/Greptile findings: resolved | disposition linked
- Evidence links: <logs, artifacts, workflow runs, PR>
Traceability
- Child issue: #282
- Related deep-link issue: #279
- Parent issue: #288
- Contract source:
/docs/architecture/specs/platform-api.openapi.yaml - Replay gate check:
/backend/src/platform_api/validation/release_gate_check.py - Contract tests:
/backend/tests/contracts/test_validation_replay_policy.py - Governance workflows:
/.github/workflows/contracts-governance.yml,/.github/workflows/docs-governance.yml,/.github/workflows/llm-package-governance.yml