# rca-diagnostician (Unified Skill)

## Core Instructions (SKILL.md)

# RCA Diagnostician

Conduct or evaluate a root cause analysis using cross-disciplinary principles. Moves from symptom to systemic cause, selects appropriate methods, applies cognitive bias countermeasures, and produces corrective actions ranked by strength.

## Setup

- If an incident description or RCA report is provided, read it completely.
- If no input is provided, ask what incident or failure to investigate.
- Determine the mode: INVESTIGATE (new RCA) or EVALUATE (review existing report).

## Procedure

1. **Scope the problem.** Define WHAT/WHERE/WHEN/SEVERITY and the counterfactual. Use `references/problem-definition.md`.
2. **Collect and map evidence.** Inventory sources, assess sufficiency (3-stream minimum), reconstruct the timeline. Use `references/evidence-timeline.md`.
3. **Generate and test hypotheses.** Produce candidates at mechanism, process, and organizational levels. Select the appropriate RCA method for the domain. Use `references/hypothesis-methods.md`.
4. **Apply bias countermeasures.** Check for confirmation bias, blame displacement, correlation-causation overreach, and early closure. Use `references/bias-countermeasures.md`.
5. **Define corrective actions.** Rank by strength (strong/intermediate/weak). Require at least one strong action. Define verification metrics. Use `references/action-hierarchy.md`.
6. **Evaluate rigor.** Apply the minimum viable rigor checklist. If AI tools were used, apply the AI governance checklist. Use `references/rigor-checklist.md`.
7. **Produce the final report.** Use `references/report-template.md`.

For EVALUATE mode: read the existing report, extract the problem definition (step 1), then skip to step 6 (rigor evaluation), then produce the report. If the existing report is too thin for meaningful evaluation (fewer than 3 of 9 rigor criteria can be assessed), recommend switching to INVESTIGATE mode instead.

## Rules

- Never promote correlation to causation without a causal model or explicit uncertainty.
- Never accept a single-cause narrative without testing alternatives.
- Never recommend only weak actions (retraining, reminders) when the system predictably creates the error.
- Every action must have a verification metric and monitoring period.
- Reference exact evidence from the input. Do not fabricate findings.
- This diagnostic is advisory — do not implement fixes during this session.


---

## Reference: action-hierarchy.md

## Corrective Action Hierarchy and Verification

### Action Strength Classification

Rank every proposed corrective action by its ability to change the system:

**Strong (system redesign):**
- Architectural changes that eliminate the failure mode
- Forcing functions and interlocks (make the error physically/logically impossible)
- Automation of detection or prevention (the system catches or prevents without human action)
- Interface redesign that removes the ambiguity or error pathway
- Examples: circuit breakers, type-safe interfaces, automated rollback triggers, equipment redesign, workflow interlocks

**Intermediate (enhanced controls):**
- Improved monitoring and alerting (reduces detection time but doesn't prevent)
- Checklists and standardized procedures (reduces variation but depends on compliance)
- Staffing changes (reduces workload-driven errors but doesn't eliminate the mechanism)
- Process redesign (changes the workflow but doesn't add forcing functions)
- Examples: new dashboard alerts, pre-deployment checklists, on-call rotation changes, peer review gates

**Weak (awareness-only):**
- Retraining or education
- Policy memos and reminders
- "Be more careful" directives
- Documentation updates without verification that they are read or followed
- Examples: email reminders, updated wiki pages, all-hands announcements, annual training modules

### Minimum Strong Action Requirement

**Every RCA must include at least one strong action.** If only intermediate or weak actions are proposed:

1. Flag this explicitly: "Action portfolio contains no strong actions"
2. Describe what system change would be needed — even if it requires resources, authority, or timeline the team doesn't currently have
3. Recommend escalation to the authority level that can approve the strong action
4. Document the risk accepted by proceeding with only intermediate/weak actions

### Verification Plan

For each action, define:

| Element | Required |
|---------|----------|
| **Verification metric** | What measurable outcome confirms the action reduced risk? |
| **Monitoring period** | How long must the metric be tracked? (Minimum: 2x the mean time between previous occurrences. For novel incidents without recurrence history, use 90 days or one full operational cycle, whichever is longer.) |
| **Owner** | Named person accountable for implementation and verification |
| **Deadline** | Implementation completion date |
| **Escalation path** | What happens if the metric does not improve within the monitoring period? |

### Learning Loop

Identify how findings feed back into the organization:

- Standards or specifications to update
- Training content to modify
- Design review criteria to add
- Monitoring and alerting to change
- Audit or compliance checks to introduce

### Output

```text
## Corrective Actions

| # | Root cause addressed | Action | Strength | Owner | Deadline | Verification metric | Monitoring period |
|---|---------------------|--------|----------|-------|----------|--------------------|--------------------|
| 1 | [cause] | [action] | [strong/intermediate/weak] | [who] | [when] | [metric] | [duration] |

Action strength mix: [N strong, N intermediate, N weak]
Minimum strong action requirement: [MET | NOT MET — escalation needed]

## Learning Loop

- Standards to update: [list]
- Training to modify: [list]
- Monitoring to add/change: [list]
- Design reviews to inform: [list]
```


---

## Reference: bias-countermeasures.md

## Cognitive Bias Countermeasures

RCA fails more often from cognitive and organizational barriers than from lack of method. Apply these checks explicitly during hypothesis generation and evaluation.

### Check 1: Confirmation Bias

Question: Did you actively seek disconfirming evidence for your leading hypothesis?

Red flags:
- All cited evidence supports one narrative
- Alternative hypotheses were listed but not seriously tested
- Disconfirming data was explained away rather than weighted
- The investigation stopped at the first plausible story

Test: Can you name specific evidence that would weaken your leading hypothesis? If you cannot, the investigation has early-closure risk.

### Check 2: Blame Displacement

Question: Are you attributing to individuals what the system predictably creates?

Red flags:
- Root cause is stated as "operator error," "didn't follow procedure," or "lack of training"
- No analysis of why the system allowed or encouraged the error
- Corrective actions are person-only (retraining, disciplinary action, reminders)
- The same error has occurred before with different individuals

Test: If you replaced this person with a competent peer, would the system still create conditions for the same failure? If YES, the system is the root cause, not the individual.

### Check 3: Correlation-Causation Overreach

Question: Are you promoting a statistical association to a causal claim without a mechanistic explanation?

Red flags:
- "X happened before Y, therefore X caused Y" (temporal precedence alone)
- Pattern found in data without explanation of how X produces Y
- Confounding variables not considered (something else changed simultaneously)
- AI/ML tool surfaced a correlation and it was adopted as a root cause

Test: Can you explain the mechanism by which X causes Y? Can you identify confounders that might explain the association? If not, label this as "candidate association, not confirmed cause."

### Check 4: Early Closure

Question: Did the investigation stop at an organizationally convenient explanation?

Red flags:
- Only one root cause identified for a complex failure
- Investigation ended after a single pass of "5 Whys" without cross-checking
- The root cause conveniently avoids implicating leadership decisions, resource allocation, or organizational culture
- Timeline reconstruction was skipped or abbreviated

Test: Ask "who benefits from this being the root cause?" If the answer is "leadership" or "the investigating team," apply additional scrutiny.

### Output

```text
## Bias Check

- Confirmation bias: [CLEAR | FLAG — describe concern]
- Blame displacement: [CLEAR | FLAG — describe concern]
- Correlation-causation: [CLEAR | FLAG — describe concern]
- Early closure: [CLEAR | FLAG — describe concern]

Countermeasure actions taken: [what you did to mitigate flagged biases]
```


---

## Reference: evidence-timeline.md

## Evidence Collection and Timeline Reconstruction

Map what happened before hypothesizing why.

### Evidence Inventory

Classify each source into one of four types:

| Type | Examples | Reliability notes |
|------|----------|-------------------|
| **Records** | Logs, metrics, charts, audit trails | Machine-generated; check for gaps and clock skew |
| **Direct observation** | Inspections, screenshots, reproductions | Strongest when captured during/near the event |
| **Testimony** | Interviews, incident comms, retrospective accounts | Subject to hindsight bias and memory distortion |
| **Artifacts** | Config changes, code diffs, design docs, process maps | Check timestamps; distinguish planned vs. actual |

### Three-Stream Sufficiency Test

Do you have at least three independent evidence streams (e.g., logs + interviews + config history)? If not:

- Flag the gap explicitly
- Recommend what to gather before proceeding
- Note which hypotheses cannot be tested with current evidence
- Mark all subsequent outputs as **PROVISIONAL** until evidence gaps are filled — this must carry through to the rigor checklist and final report

### Timeline Construction

Build chronologically from last known normal state through detection and resolution:

- **Decision points**: who decided what, with what information available at the time
- **Environmental context**: workload, staffing, concurrent changes, external events
- **Hindsight compression guard**: record what was known vs. not known at each point — do not project post-event knowledge backward

### Output

```text
## Evidence Inventory

| Source | Type | Reliability | Key facts |
|--------|------|-------------|-----------|
| [source] | [record/observation/testimony/artifact] | [high/medium/low] | [what it tells us] |

Evidence sufficiency: [MET — 3+ streams | GAP — need X]

## Timeline

| Time | Event | Source | Notes |
|------|-------|--------|-------|
| [when] | [what happened] | [evidence source] | [context] |
```


---

## Reference: hypothesis-methods.md

## Hypothesis Generation and Method Selection

### Multi-Level Hypothesis Generation

Produce at least three candidate root causes across different analytical levels:

- **Mechanism level**: What physical, logical, or behavioral process failed? (e.g., memory leak, O-ring degradation, medication dosage calculation error)
- **Process control level**: What check, barrier, or monitoring should have caught it? (e.g., missing alert threshold, no pre-deployment validation, absent second-check protocol)
- **Organizational level**: What policy, incentive, resource, or cultural factor enabled the failure? (e.g., staffing pressure, incentive misalignment, deferred maintenance, blame culture suppressing reports)

### Method Selection Guide

Select based on domain, evidence, and the question being asked:

| Method | Best when | Domain fit |
|--------|-----------|------------|
| **5 Whys** | Fast initial exploration; small team | Any — but stop only when you reach a system-modifiable cause |
| **Fishbone/Ishikawa** | Brainstorming across cause categories | Any — team-friendly for cross-functional groups |
| **FTA (Fault Tree)** | Combinations of failures matter; system architecture available | Engineering, manufacturing, safety |
| **FMEA/FMECA** | Preventive analysis of components/processes; need risk ranking | Engineering, manufacturing, design |
| **STAMP/STPA** | Complex sociotechnical systems with control interactions | Aviation, healthcare, autonomous systems |
| **Causal inference (DAG/SCM)** | Need to identify intervention effects formally; confounding is a concern | Social science, policy, epidemiology |
| **Qualitative inquiry** | Practices, incentives, or culture are the suspected drivers | Organizational, healthcare, education |
| **Bayesian networks** | Multiple uncertain evidence streams; need probabilistic diagnosis | Engineering, medical diagnostics, security |
| **Postmortem (structured)** | Software/infrastructure incidents; need detection-response-recovery analysis | Software, IT operations, security |

### Hypothesis Testing

For each candidate cause, answer:

1. **What evidence supports it?** (cite specific sources from evidence inventory)
2. **What evidence contradicts it?** (actively seek disconfirming data)
3. **What would falsify it?** (define the test that would eliminate this hypothesis)
4. **Is the mechanism plausible?** (can you explain how X causes Y, not just that X correlates with Y?)
5. **Causal role**: Is it necessary? Sufficient? Or a contributing factor?

### Differentiating Causal Levels

- **Contributing factor**: Exacerbated the outcome or increased its likelihood, but did not directly initiate it
- **Root cause**: The specific mechanism or broken process that, if removed, would have prevented the outcome
- **Latent/generic cause**: The overarching systemic flaw that allowed the root cause to exist (e.g., flawed policy, missing training program, cultural norm). Fixing these yields the highest ROI.

### Output

```text
## Hypotheses

| # | Level | Candidate cause | Supporting evidence | Contradicting evidence | Falsification test | Status |
|---|-------|-----------------|--------------------|-----------------------|-------------------|--------|
| 1 | [mechanism/process/org] | [hypothesis] | [evidence] | [evidence] | [what would disprove] | [supported/weakened/falsified] |

Method selected: [method] — Rationale: [why this method fits the domain and evidence]
```


---

## Reference: problem-definition.md

## Problem Definition

Define the outcome precisely before investigating causes.

### Required Elements

1. **WHAT** — Specific observable outcome (not interpretation)
2. **WHERE** — System, component, location, scope
3. **WHEN** — First detection, duration, resolution
4. **SEVERITY** — Impact on users, safety, business, compliance
5. **OUT OF SCOPE** — What this investigation does not cover

### Counterfactual

State the expected/normal behavior and what changed relative to that baseline. This anchors the investigation — without a counterfactual, you cannot distinguish cause from background condition.

### Red Flags

- Problem statement contains solutions ("we need to add...")
- Describes a symptom without measurable specificity ("the system is slow")
- No severity assessment — all problems feel urgent without scoping
- Scope is unbounded — investigation will grow without limit

### Output

```text
## Problem Definition

Outcome: [precise statement]
Measurement: [how detected/measured]
Severity: [impact assessment]
Counterfactual: [expected vs. actual]
Scope boundary: [in scope / out of scope]
```


---

## Reference: report-template.md

```text
# RCA Diagnostician Report

Date: [YYYY-MM-DD]
Mode: [INVESTIGATE | EVALUATE]
Domain: [discipline/context]

## Problem Definition

Outcome: [precise statement of what failed]
Measurement: [how detected/measured]
Severity: [impact assessment]
Counterfactual: [expected vs. actual behavior]
Scope boundary: [in scope / out of scope]

## Evidence Summary

Sources: [count] across [count] independent streams
Sufficiency: [MET | GAP — describe]

## Timeline (key events)

| Time | Event | Source |
|------|-------|--------|
| [when] | [what happened] | [evidence source] |

## Root Causes Identified

| # | Level | Root cause | Confidence | Key evidence |
|---|-------|-----------|------------|--------------|
| 1 | [mechanism/process/org] | [cause] | [high/medium/low] | [supporting evidence] |

## Bias Check Summary

- Confirmation bias: [CLEAR | FLAG]
- Blame displacement: [CLEAR | FLAG]
- Correlation-causation: [CLEAR | FLAG]
- Early closure: [CLEAR | FLAG]

Countermeasure actions taken: [what was done to mitigate flagged biases]

## Corrective Actions

| # | Root cause | Action | Strength | Owner | Deadline | Verification metric | Monitoring period |
|---|-----------|--------|----------|-------|----------|--------------------|--------------------|
| 1 | [cause] | [action] | [strong/intermediate/weak] | [who] | [when] | [metric] | [duration] |

Action strength mix: [N strong, N intermediate, N weak]
Minimum strong action: [MET | NOT MET]

## Rigor Assessment

Overall: [STRONG | ADEQUATE | WEAK | INSUFFICIENT]
Key gaps: [list any PARTIAL or NOT MET criteria]

## Recommendations

1. [Most critical action — with owner and deadline]
2. [Next priority]
3. [Follow-up or monitoring action]

## Open Questions

- [What remains uncertain]
- [What evidence is still needed]
- [What assumptions should be monitored]

## Learning Loop

- Standards to update: [list]
- Training to modify: [list]
- Design reviews to inform: [list]
- Monitoring to add/change: [list]
- Audit/compliance to introduce: [list]
```


---

## Reference: rigor-checklist.md

## Rigor Evaluation Checklist

Apply to your own RCA (self-check) or to an existing report (EVALUATE mode).

### Core Criteria (always apply)

| # | Criterion | What to check | MET when |
|---|-----------|---------------|----------|
| 1 | **Problem definition** | Is the outcome measurable, time-bounded, and severity-scoped? | WHAT/WHERE/WHEN/SEVERITY all specified; no solution language embedded |
| 2 | **Counterfactual** | Is the expected/normal behavior stated? | Explicit baseline; change from normal identified |
| 3 | **Evidence sufficiency** | Are there at least three independent evidence streams? | 3+ distinct source types (records, observation, testimony, artifacts) |
| 4 | **Hypothesis discipline** | Were alternative hypotheses documented with falsification criteria? | 3+ candidates at different levels; disconfirming evidence sought for each |
| 5 | **Mechanism plausibility** | Is each claimed cause explained by mechanism, not just correlation? | "How X causes Y" stated; not just "X preceded Y" |
| 6 | **Action quality** | Do actions materially change system constraints? | At least 1 strong action; weak-only portfolios flagged |
| 7 | **Ownership** | Is there a named owner, deadline, and authority for each action? | Every action row complete |
| 8 | **Effectiveness verification** | Are verification metrics and monitoring periods defined? | Metric + period + escalation path for each action |
| 9 | **Learning loop** | Does the RCA feed back into standards, training, and monitoring? | At least one organizational update identified |

### AI Governance Criteria (apply when AI tools were used in the investigation)

| # | Criterion | What to check | MET when |
|---|-----------|---------------|----------|
| 10 | **Provenance** | Does every AI-produced claim link to evidence artifacts? | Each AI output traceable to source data |
| 11 | **Explainability** | Are AI outputs interpretable and connected to interventions? | Explanations are meaningful to domain practitioners |
| 12 | **Causal guardrails** | Were associations distinguished from causal claims? | Uncertainty stated; no bare "AI found the root cause" |
| 13 | **Human decision rights** | Did accountable humans review and approve findings? | Named reviewer signed off on AI-informed conclusions |

### Scoring

- **MET**: Criterion fully satisfied with evidence
- **PARTIAL**: Criterion addressed but with gaps or weak evidence
- **NOT MET**: Criterion absent or inadequate

Overall rigor (based on 9 core criteria):
- **STRONG**: All 9 core criteria MET
- **ADEQUATE**: No more than 2 PARTIAL, zero NOT MET
- **WEAK**: 1-2 NOT MET or 3+ PARTIAL
- **INSUFFICIENT**: 3+ NOT MET

AI governance criteria do not affect the core rigor score but are reported separately. If any AI governance criterion is NOT MET, append "(AI governance gaps)" to the overall rigor rating.

### Output

```text
## Rigor Evaluation

| # | Criterion | Status | Evidence/Gap |
|---|-----------|--------|-------------|
| 1 | Problem definition | [MET/PARTIAL/NOT MET] | [detail] |
| 2 | Counterfactual | [MET/PARTIAL/NOT MET] | [detail] |
| 3 | Evidence sufficiency | [MET/PARTIAL/NOT MET] | [detail] |
| 4 | Hypothesis discipline | [MET/PARTIAL/NOT MET] | [detail] |
| 5 | Mechanism plausibility | [MET/PARTIAL/NOT MET] | [detail] |
| 6 | Action quality | [MET/PARTIAL/NOT MET] | [detail] |
| 7 | Ownership | [MET/PARTIAL/NOT MET] | [detail] |
| 8 | Effectiveness verification | [MET/PARTIAL/NOT MET] | [detail] |
| 9 | Learning loop | [MET/PARTIAL/NOT MET] | [detail] |

AI governance (if applicable):
| 10 | Provenance | [MET/PARTIAL/NOT MET/N/A] | [detail] |
| 11 | Explainability | [MET/PARTIAL/NOT MET/N/A] | [detail] |
| 12 | Causal guardrails | [MET/PARTIAL/NOT MET/N/A] | [detail] |
| 13 | Human decision rights | [MET/PARTIAL/NOT MET/N/A] | [detail] |

Overall rigor: [STRONG | ADEQUATE | WEAK | INSUFFICIENT]
```


---

