Gate4 Incident Runbooks
Scope
This runbook covers:
- conversation and OpenClaw incident response,
- risk rejection and kill-switch incident response,
- audit evidence capture and review handoff.
Guardrails
- Clients must use Platform API routes only.
- No runbook step may call provider APIs directly from clients.
- Side-effecting commands remain blocked when risk controls fail.
- Every incident update must include correlation IDs and issue links.
Severity Levels
| Severity | Trigger | SLA Target |
|---|---|---|
SEV-1 | Side effects blocked across tenants or wrong-side execution risk | Immediate containment, 30-minute review handoff |
SEV-2 | Single-tenant contract failures, degraded suggestions, repeated kill-switch false positives | 2-hour mitigation |
SEV-3 | Isolated payload/client defects with known workaround | Next planned release |
Runbook A: Conversation and OpenClaw
Trigger Signals
- Elevated
4xx/5xxfor:POST /v2/conversations/sessionsGET /v2/conversations/sessions/{sessionId}POST /v2/conversations/sessions/{sessionId}/turns
- OpenClaw workflow breakage in contract/e2e checks.
- Missing
contextMemorySnapshotor notification metadata in turn responses.
Triage Checklist
- Capture correlation:
requestId,sessionId,tenant_id,user_id,channel. - Verify API contract path and auth headers with a minimal repro:
curl -sS -X POST "$PLATFORM_API_BASE/v2/conversations/sessions" \
-H "Authorization: Bearer $TOKEN" \
-H "X-API-Key: $API_KEY" \
-H "X-Tenant-Id: $TENANT_ID" \
-H "X-User-Id: $USER_ID" \
-H "Content-Type: application/json" \
-d '{"channel":"openclaw","topic":"incident-triage","metadata":{"notificationsOptIn":true}}'
- Re-run deterministic contract coverage:
uv run --directory backend --with pytest python -m pytest \
backend/tests/contracts/test_platform_api_v2_handlers.py \
backend/tests/contracts/test_conversation_context_memory.py \
backend/tests/contracts/test_conversation_proactive_notifications.py \
backend/tests/contracts/test_openclaw_client_integration.py \
backend/tests/contracts/test_openclaw_e2e_flow.py -q
- Validate boundary invariants:
- OpenClaw lane uses
channel=openclaw. - No provider-direct client calls.
- Errors are canonical contract errors with
requestId.
Decision Tree
flowchart TD
A["Conversation/OpenClaw incident detected"] --> B{"Platform API contract routes healthy?"}
B -->|No| C["Escalate API availability incident and contain client traffic"]
B -->|Yes| D{"Contract payload/headers valid?"}
D -->|No| E["Fix client payload/header contract mismatch and retest"]
D -->|Yes| F{"Runtime memory/notification semantics valid?"}
F -->|No| G["Contain by disabling proactive notifications and patch conversation runtime"]
F -->|Yes| H["Escalate downstream platform dependency issue with request IDs"]
Containment Actions
- Keep traffic on Platform API paths; do not introduce provider-direct fallback.
- For unstable notification behavior, force
notificationsOptIn=falseat client feature flag until recovery. - If OpenClaw skill lane is degraded, fail over to CLI-mediated lane (still Platform API only).
Recovery Checklist
- Contract tests pass for v2 conversation handlers and OpenClaw e2e flow.
- Repro requests return canonical success/error shapes with correlation IDs.
contextMemoryandcontextMemorySnapshotfields are deterministic across retries.- Incident summary is posted with root cause and prevention action.
Runbook B: Risk Runtime and Kill-Switch
Trigger Signals
- Sudden spikes in
422risk limit rejections on:POST /v1/deploymentsPOST /v1/orders
423kill-switch rejections on side-effecting commands.- Drawdown-triggered stop behavior firing unexpectedly or not firing when expected.
Triage Checklist
- Collect evidence:
requestId,deploymentId/orderId, policyversion, outcome code, and breach reason. - Validate risk controls with contract checks:
uv run --directory backend --with pytest python -m pytest \
backend/tests/contracts/test_risk_policy_schema.py \
backend/tests/contracts/test_risk_pretrade_checks.py \
backend/tests/contracts/test_risk_killswitch_drawdown.py \
backend/tests/contracts/test_risk_audit_trail.py \
backend/tests/contracts/test_execution_command_layer.py \
backend/tests/contracts/test_execution_command_idempotency.py -q
- Verify stop and rejection behavior through Platform API only:
GET /v1/deployments/{deploymentId}POST /v1/deployments/{deploymentId}/actions/stopPOST /v1/orders
- Confirm policy compatibility:
- accepted schema version is
risk-policy.v1, - invalid schema/version must fail closed.
Decision Tree
flowchart TD
A["Risk incident detected"] --> B{"Kill-switch currently triggered?"}
B -->|Yes| C{"Drawdown breach evidence valid?"}
C -->|Yes| D["Maintain block and execute controlled recovery plan"]
C -->|No| E["Treat as false positive, roll back invalid policy input, revalidate"]
B -->|No| F{"Pre-trade rejections increasing?"}
F -->|Yes| G["Inspect policy limits and audit outcomes, fix policy/config drift"]
F -->|No| H["Investigate execution/orchestrator path and retry budget handling"]
Containment Actions
- Preserve fail-closed behavior; do not bypass risk gates to force execution.
- Keep side-effecting operations blocked while policy validity is uncertain.
- Limit blast radius by tenant/account where configuration isolation is available.
Recovery Checklist
- Policy validation passes (
risk-policy.v1, limits coherent, enums valid). - Drawdown breach and kill-switch transitions are reproducible and deterministic.
- Risk audit trail contains both blocked and approved decisions with outcome codes.
- Execution command layer remains idempotent after incident patch deployment.
Evidence Capture
Collect and attach all of the following to the incident ticket:
- Reproduction command(s) and responses with redacted secrets.
- Correlation IDs (
requestId) and timeline. - Relevant policy snapshot and schema/version validation result.
- Contract test output used for verification.
- Mitigation, rollback, and follow-up action items.
Escalation Matrix
| Condition | Primary Team | Secondary Team | Escalate To |
|---|---|---|---|
| Conversation/OpenClaw contract drift | Team E | Team G | Parent #80, docs parent #106 |
| Runtime state/retry anomalies | Team B | Team F | Parent #77, reliability #81 |
| Risk rejection/kill-switch uncertainty | Team B | Team F | Parent #77, reliability #81 |
| Audit evidence gap | Team F | Team G | Docs parent #106, reliability #81 |
On-Call Handoff Template
Gate4 Incident Handoff
- Incident ID:
- Severity:
- Affected lane: conversation | openclaw | risk | runtime
- Start time (UTC):
- Current status: triage | contained | recovering | resolved
- Correlation IDs:
- Impact summary:
- Guardrails confirmed:
- Platform API only: yes/no
- No provider-direct client calls: yes/no
- Risk fail-closed preserved: yes/no
- Verification commands executed:
- Remaining risks:
- Owner team:
- Review handoff target issue(s):
- Next decision checkpoint time (UTC):
Traceability
- Conversation/OpenClaw docs:
/docs/portal/platform/gate4-conversation-contract.md/docs/portal/platform/gate4-openclaw-client-lane.md
- Runtime/risk docs:
/docs/portal/operations/gate4-orchestrator-controls.md/docs/portal/operations/gate4-risk-policy-controls.md
- Public API source:
/docs/architecture/specs/platform-api.openapi.yaml - Related issues:
#139,#81,#106