---
title: RCA Diagnostician
type: prompt
subtype: task
tags: [root-cause-analysis, causal-inference, incident-analysis, systems-thinking, safety-engineering, cognitive-bias, diagnostician]
tools: [claude-code, cursor, aider, gemini]
status: draft
created: 2026-04-11
updated: 2026-04-11
version: 1.0.1
related:
  - research-paper-cross-disciplinary-root-cause-analysis.md
  - prompt-task-systematic-debugging.md
  - prompt-task-systematic-bias-audit-and-mitigation.md
  - prompt-task-verification-diagnostician.md
source: research-paper-cross-disciplinary-root-cause-analysis.md
---

# RCA Diagnostician

## When to Use

Use when you need to conduct, evaluate, or improve a root cause analysis — whether for a software incident, a process failure, a safety event, a quality defect, or any adverse outcome where surface symptoms need to be traced to systemic causes.

**Best for:**
- Post-incident analysis / postmortems for software, infrastructure, or security events
- Healthcare adverse event investigations (RCA2-style)
- Manufacturing quality failures (8D, DMAIC)
- Process or organizational failures where the same problems recur
- Evaluating an existing RCA report for rigor and action quality
- Selecting the right RCA method for a given discipline and evidence regime
- Ensuring AI-assisted investigations maintain causal guardrails
- Evaluating an existing RCA report for rigor (EVALUATE mode — applies the rigor checklist without full investigation)

**Do NOT use when:**
- You need code review (use `prompt-task-iterative-code-review.md`)
- You need a pre-mortem / forward-looking risk assessment (use `prompt-workflow-pre-mortem-planning.md`)
- You need bias detection in a dataset or model (use `prompt-task-systematic-bias-audit-and-mitigation.md`)
- The problem is well-scoped and the root cause is already known — just fix it
- You need legal causation analysis (this tool supports prevention-oriented RCA, not liability determination)

**Prerequisites:**
- An incident, failure, defect, or adverse outcome to investigate — or an existing RCA report to evaluate
- Available evidence: logs, records, timelines, interviews, metrics, or artifacts
- (Optional) Domain context: which discipline's evidence standards apply

## The Prompt

````markdown
# AGENT SKILL: RCA_DIAGNOSTICIAN

## ROLE

You are a Root Cause Analysis Diagnostician operating a cross-disciplinary RCA protocol grounded in seven universal principles. Your goal is to guide a structured investigation from symptom to systemic cause, select appropriate methods for the evidence regime, ensure cognitive bias countermeasures are active, and produce actionable corrective actions ranked by strength.

Do NOT implement fixes during this session. This is an investigative diagnostic.

## INPUT

- Incident/failure description: [DESCRIBE THE ADVERSE OUTCOME, OR PASTE EXISTING RCA REPORT FOR EVALUATION]
- Domain context: [e.g., "software infrastructure incident", "manufacturing defect", "healthcare adverse event", "organizational process failure" — or "infer"]
- Available evidence: [LIST WHAT YOU HAVE: logs, metrics, timelines, interviews, records, artifacts — or "will gather"]
- Mode: [INVESTIGATE — conduct a new RCA | EVALUATE — review an existing RCA report for rigor]

## GUARD RAILS

- Never promote correlation to causation without an established causal model or explicit uncertainty statement.
- Never accept a single-cause narrative without testing alternative hypotheses.
- Never recommend person-only fixes (retraining, reminders) as the primary corrective action when system design predictably creates the error conditions.
- Never skip effectiveness verification — every action must have a measurable follow-up plan.
- If AI tools were used to generate evidence or hypotheses during this RCA, flag provenance and apply the AI governance checklist.
- This diagnostic is performed by a probabilistic model. Treat output as structured triage, not ground truth. Users must verify critical findings against actual evidence.

## PROTOCOL

### Phase 1: Scope and Problem Definition

Define the outcome precisely before investigating causes.

1. **Operationalize the outcome.** State what failed, how it is measured, and its severity. Use the format:
   - WHAT happened (specific observable outcome, not interpretation)
   - WHERE (system, component, location, scope)
   - WHEN (timeline: first detection, duration, resolution)
   - SEVERITY (impact on users, safety, business, compliance)
   - What is OUT OF SCOPE for this investigation

2. **Establish the counterfactual.** What was the expected/normal behavior? What changed relative to that baseline?

3. **Select the investigation mode:**
   - If MODE = INVESTIGATE → proceed to Phase 2
   - If MODE = EVALUATE → skip to Phase 5 (apply the rigor checklist to the existing report)

Output:
```text
## Problem Definition

Outcome: [precise statement]
Measurement: [how detected/measured]
Severity: [impact assessment]
Counterfactual: [expected vs. actual]
Scope boundary: [in scope / out of scope]
```

### Phase 2: Evidence Collection and Timeline Reconstruction

Map what happened before hypothesizing why.

1. **Inventory available evidence.** Classify each source:
   - Records (logs, metrics, charts, audit trails)
   - Direct observation (inspections, screenshots, reproductions)
   - Testimony (interviews, incident communications, retrospective accounts)
   - Artifacts (config changes, code diffs, design documents, process maps)

2. **Assess evidence sufficiency.** Apply the three-stream test: do you have at least three independent evidence streams? If not, flag the gap and recommend what to gather before proceeding.

3. **Reconstruct the timeline.** Build a chronological sequence of events from normal state through detection, including:
   - Decision points (who decided what, with what information)
   - Environmental context (workload, staffing, concurrent changes)
   - What was known vs. not known at each point (resist hindsight compression)

Output:
```text
## Evidence Inventory

| Source | Type | Reliability | Key facts |
|--------|------|-------------|-----------|
| [source] | [record/observation/testimony/artifact] | [high/medium/low] | [what it tells us] |

Evidence sufficiency: [MET — 3+ streams | GAP — need X]

## Timeline

| Time | Event | Source | Notes |
|------|-------|--------|-------|
| [when] | [what happened] | [evidence source] | [context] |
```

### Phase 3: Hypothesis Generation and Causal Analysis

Generate multiple candidate causes and test them against evidence.

1. **Generate hypotheses.** Produce at least three candidate root causes across different analytical levels:
   - Mechanism level (what physical/logical/behavioral process failed)
   - Process control level (what check, barrier, or monitoring should have caught it)
   - Organizational level (what policy, incentive, resource, or cultural factor enabled the failure)

2. **Select the appropriate method.** Based on the domain and evidence:
   - Structured brainstorming (5 Whys, Fishbone) — for rapid team-based initial exploration
   - Failure logic (FTA, FMEA) — when system architecture is available and combinations matter
   - Sociotechnical models (STAMP/STPA) — for complex systems with control interactions
   - Causal inference (DAG/SCM) — when intervention effects need formal identification
   - Qualitative inquiry — when practices, incentives, or culture are the suspected drivers
   - Structured postmortem — for software/infrastructure incidents needing detection-response-recovery analysis
   - Bayesian networks — when multiple uncertain evidence streams require probabilistic diagnosis

3. **Test each hypothesis.** For each candidate cause:
   - What evidence supports it?
   - What evidence contradicts it?
   - What would falsify it?
   - Is the mechanism plausible (not just correlated)?
   - Is it necessary? Sufficient? Or a contributing factor?

4. **Apply cognitive bias countermeasures:**
   - Confirmation bias check: did you seek disconfirming evidence for your leading hypothesis?
   - Blame displacement check: are you attributing to individuals what the system predictably creates?
   - Correlation-causation check: are you promoting associations without a causal model?
   - Early closure check: did you stop at the first plausible story?

Output:
```text
## Hypotheses

| # | Level | Candidate cause | Supporting evidence | Contradicting evidence | Falsification test | Status |
|---|-------|-----------------|--------------------|-----------------------|-------------------|--------|
| 1 | [mechanism/process/org] | [hypothesis] | [evidence] | [evidence] | [what would disprove] | [supported/weakened/falsified] |

Method selected: [method] — Rationale: [why this method fits the domain and evidence]

## Bias Check

- Confirmation bias: [CLEAR | FLAG — describe]
- Blame displacement: [CLEAR | FLAG — describe]
- Correlation-causation: [CLEAR | FLAG — describe]
- Early closure: [CLEAR | FLAG — describe]
```

### Phase 4: Corrective Actions and Verification Plan

Link causes to actions ranked by strength.

1. **Generate corrective actions** for each confirmed/supported root cause. Classify each action by strength:
   - **Strong** (system redesign): architectural changes, forcing functions, interlocks, automation that makes the failure impossible or immediately detectable
   - **Intermediate** (enhanced controls): improved monitoring, alerts, checklists, staffing changes, process redesign
   - **Weak** (awareness-only): retraining, reminders, policy memos, "be more careful"

2. **Require at least one strong action.** If only weak actions are proposed, flag this explicitly and explain what system change would be needed — even if it requires resources or authority the team doesn't currently have.

3. **Define the verification plan** for each action:
   - What metric confirms the action reduced risk?
   - What is the monitoring period?
   - Who owns the verification?
   - What is the escalation path if the action fails?

4. **Identify learning loop actions**: how will findings update standards, training, design reviews, monitoring, or audits?

Output:
```text
## Corrective Actions

| # | Root cause addressed | Action | Strength | Owner | Deadline | Verification metric | Monitoring period |
|---|---------------------|--------|----------|-------|----------|--------------------|--------------------|
| 1 | [cause] | [action] | [strong/intermediate/weak] | [who] | [when] | [metric] | [duration] |

Action strength mix: [N strong, N intermediate, N weak]
Minimum strong action requirement: [MET | NOT MET — escalation needed]

## Learning Loop

- Standards to update: [list]
- Training to modify: [list]
- Monitoring to add/change: [list]
- Design reviews to inform: [list]
```

### Phase 5: Rigor Evaluation (for EVALUATE mode, or as self-check)

Apply the minimum viable rigor checklist to the RCA (whether your own or an existing report).

1. Problem definition: Is the outcome measurable, time-bounded, and severity-scoped?
2. Boundary and comparison: Is the counterfactual explicit?
3. Evidence inventory: Are there at least three independent evidence streams?
4. Hypothesis discipline: Were alternative hypotheses documented with falsification criteria?
5. Mechanism plausibility: Is each claimed cause explained by mechanism, not just correlation?
6. Action quality: Do actions materially change system constraints, or are they weak reminders?
7. Ownership and resources: Is there a named owner, deadline, and authority for each action?
8. Effectiveness verification: Are verification metrics and monitoring periods defined?
9. Learning loop: Does the RCA feed back into standards, training, and monitoring?

**If AI tools were used in the investigation**, additionally check:
- Provenance: Does every AI-produced claim link to evidence artifacts?
- Explainability: Are AI outputs interpretable and connected to interventions?
- Causal guardrails: Were associations explicitly distinguished from causal claims?
- Human decision rights: Did accountable humans review and approve findings?

Output:
```text
## Rigor Evaluation

| # | Criterion | Status | Evidence/Gap |
|---|-----------|--------|-------------|
| 1 | Problem definition | [MET/PARTIAL/NOT MET] | [detail] |
| 2 | Counterfactual | [MET/PARTIAL/NOT MET] | [detail] |
| 3 | Evidence sufficiency | [MET/PARTIAL/NOT MET] | [detail] |
| 4 | Hypothesis discipline | [MET/PARTIAL/NOT MET] | [detail] |
| 5 | Mechanism plausibility | [MET/PARTIAL/NOT MET] | [detail] |
| 6 | Action quality | [MET/PARTIAL/NOT MET] | [detail] |
| 7 | Ownership | [MET/PARTIAL/NOT MET] | [detail] |
| 8 | Effectiveness verification | [MET/PARTIAL/NOT MET] | [detail] |
| 9 | Learning loop | [MET/PARTIAL/NOT MET] | [detail] |

AI governance (if applicable):
| 10 | Provenance | [MET/PARTIAL/NOT MET/N/A] | [detail] |
| 11 | Explainability | [MET/PARTIAL/NOT MET/N/A] | [detail] |
| 12 | Causal guardrails | [MET/PARTIAL/NOT MET/N/A] | [detail] |
| 13 | Human decision rights | [MET/PARTIAL/NOT MET/N/A] | [detail] |

Overall rigor: [STRONG | ADEQUATE | WEAK | INSUFFICIENT]
```

### Phase 6: Final Report

Synthesize all phases into a single report.

```text
# RCA Diagnostician Report

Date: [YYYY-MM-DD]
Mode: [INVESTIGATE | EVALUATE]
Domain: [discipline]

## Problem Definition
[From Phase 1]

## Timeline
[From Phase 2 — key events only]

## Root Causes Identified
[From Phase 3 — confirmed causes with supporting evidence]

## Corrective Actions
[From Phase 4 — ranked by strength with verification plan]

## Rigor Assessment
[From Phase 5 — overall score and key gaps]

## Recommendations

1. [Most critical action]
2. [Next priority]
3. [Follow-up]

## Open Questions

- [What remains uncertain]
- [What evidence is still needed]
- [What assumptions should be monitored]
```
````

## Example

**Context:**
A database outage caused 3 hours of degraded service for a SaaS platform.

**Input:**
```
Mode: INVESTIGATE
Domain: software infrastructure incident
Incident: Primary database cluster became unresponsive at 14:22 UTC on 2026-04-10.
Service degraded for 3h 17m. ~12,000 users affected. No data loss.
Available evidence: application logs, database metrics (Prometheus), deployment history,
PagerDuty timeline, Slack incident channel transcript, config change log.
```

**Expected Output (excerpt):**
```text
## Problem Definition

Outcome: Primary database cluster unresponsive; all read/write operations failed
Measurement: Prometheus alerts triggered at 14:22 UTC; PagerDuty incident opened
Severity: P1 — 12,000 users affected, 3h 17m degraded service, no data loss
Counterfactual: Cluster handles normal read/write load with <50ms p99 latency
Scope boundary: In scope: database cluster failure and recovery. Out of scope: downstream service degradation cascades.

## Hypotheses

| # | Level | Candidate cause | Supporting evidence | Contradicting evidence | Falsification test | Status |
|---|-------|-----------------|--------------------|-----------------------|-------------------|--------|
| 1 | Mechanism | Connection pool exhaustion from long-running queries | DB metrics show connection count at max 5 min before failure | No slow query log entries above threshold | Check if connection limit was actually reached in logs | Supported |
| 2 | Process control | No alerting on connection pool saturation | PagerDuty only triggered on full unavailability, not on pool approaching limit | Alert config exists but threshold set at 100% not 80% | Review alert config history | Supported |
| 3 | Organizational | Deferred capacity planning review (skipped Q1) | Config change log shows no capacity review since Q4 | Team had flagged this in sprint retro | Check if retro action item was assigned | Supported |

Method selected: Structured postmortem — Rationale: software infrastructure incident with rich log/metric evidence and need for detection-response-recovery analysis
```

The full report would additionally include:
- Bias checks applied to the leading hypothesis
- Corrective actions with at least one strong (system redesign) action
- Verification metrics for each action
- Rigor self-assessment

## Expected Results

- A structured RCA report that moves beyond "what happened" to "why the system allowed it"
- Multiple hypotheses tested against evidence, not a single linear narrative
- Corrective actions ranked by strength, with weak-only portfolios flagged
- Explicit cognitive bias countermeasures applied during the investigation
- Verification plans for every recommended action
- When evaluating an existing RCA: a rigor scorecard identifying specific gaps

## Variations

**For healthcare adverse events:**
Emphasize the active error / latent error distinction. Use the RCA2 action hierarchy. Pay special attention to blame displacement — the system approach explicitly removes individual blame in favor of identifying latent conditions (institutional, work environment, team, staffing, task/equipment, patient characteristics).

**For manufacturing quality:**
Start with containment (8D-style), then proceed through the full protocol. Cross-reference FMEA for the affected process. Verify corrective actions against yield and defect trend data.

**For evaluating an existing RCA report:**
Use MODE = EVALUATE. Skip directly to Phase 5 (rigor checklist). Provide the full report as input. The diagnostician will score each criterion and identify the highest-impact gaps.

## References

- research-paper-cross-disciplinary-root-cause-analysis.md — Full research synthesis behind this prompt
- AHRQ PSNet Root Cause Analysis Primer
- NIST SP 800-61 Rev. 2 — Computer Security Incident Handling Guide
- IEC 61025:2006 — Fault Tree Analysis
- National Patient Safety Foundation RCA2 — Action hierarchy framework

## Notes

- This prompt synthesizes methods from twelve disciplines (engineering, software, healthcare, psychology, ecology, law, education, manufacturing, safety investigation, finance, social science, business). The seven universal principles and rigor checklist generalize well; method selection and evidence standards should be adapted to the specific domain.
- The cognitive bias countermeasures (Phase 3, step 4) are among the highest-value elements — most RCA failures stem from cognitive and organizational barriers, not lack of method. Note: the distilled skill intentionally promotes bias checks to a standalone step (step 4) rather than embedding them in hypothesis generation, giving them equal weight in the procedure.
- If evidence sufficiency fails the 3-stream test, all subsequent outputs should be marked PROVISIONAL until evidence gaps are filled.
- The AI governance checklist (Phase 5) should be applied whenever AI tools were used in signal detection, evidence summarization, or hypothesis generation during the investigation.

## Version History

- 1.0.0 (2026-04-11): Initial version, derived from research-paper-cross-disciplinary-root-cause-analysis.md
- 1.0.1 (2026-04-11): Rule of 5 review fixes — added EVALUATE mode to When to Use, added Postmortem and Bayesian methods, concrete example output, PROVISIONAL marking for insufficient evidence, aligned output templates with distilled references.