The clinical observation: among ICU patients with sepsis, those who received vasopressors died at higher rates than those who did not. The naive interpretation: vasopressors are harmful in sepsis. The actual mechanism: clinicians give vasopressors to the patients with worse hemodynamics, and worse hemodynamics independently predict death. The treatment is correlated with the outcome through the prior state, not just through its causal effect.

This is confounding by indication, but it is worse than the static version. The decision is sequential — fluid bolus at hour 0, vasopressor at hour 4, antibiotic escalation at hour 8 — and at each step the patient's evolving state is both an outcome of prior treatment AND a driver of the next treatment decision. Adjusting for the state at each step blocks the legitimate causal pathway from earlier treatment to outcome. Adjusting only for baseline misses the time-varying confounder. Neither standard adjustment strategy works.

The structural problem

Reinforcement learning trained on ICU records learns the historical policy — the policy that produced the observed data. Because the historical policy was driven by the observed state, the policy is a function of the confounder. Deploying a policy learned this way is, formally, a no-op: it reproduces the observed outcomes because it was learned from them. Causal RL — using g-methods or equivalent sequential identification — learns the interventional value of alternative policies, the kind that would produce different outcomes if deployed.

Prior — no evidence set
Prior — no evidence set

Population baseline before any patient data is entered.

Dialog: Sepsis happens too fast for a model to be useful. — answered: not for the bedside — for the protocol. The dynamic-treatment-regime machinery evaluates a decision rule against the patients you've already treated, controlling the time-varying confounding that makes the current policy look optimal in its own data. The question it answers is whether to change the rule, not what to do right now.

The model is a dynamic Bayesian network across three time slices: hour 0 (admission), hour 6, and hour 12. A 28-day mortality node depends on the cumulative trajectory.

Node (per slice)StatesRole
Severity_t (SOFA bin)Mild · Moderate · SeverePatient state — confounder at next step
Hemodynamics_tMAP_low_lact_high · MAP_low_lact_normal · MAP_normalTreatment-decision driver — time-varying confounder
FluidGiven_tYes · NoTreatment
PressorGiven_tYes · NoTreatment
Comorbidities (time-invariant)Low · Moderate · HighBaseline confounder
Mortality_28d (terminal)Alive · DeceasedOutcome

The structural commitments at each slice t:

  • Severity_t, Hemodynamics_t, ComorbiditiesFluidGiven_t, PressorGiven_t (clinicians decide on observed state — the time-varying confounding mechanism)
  • Severity_t, FluidGiven_t, PressorGiven_tHemodynamics_{t+1}, Severity_{t+1} (treatments and prior state shape next state)
  • Final-slice state and cumulative treatments → Mortality_28d
Identifiability

The interventional distribution P(Mortality | do(policy = π̂)) under a candidate dynamic policy π̂ is identifiable using Robins' g-formula: iterate over time slices, taking expectations over the post-treatment state distribution at each step. This is operationally identical to causal reinforcement learning when the value function is computed under the interventional distribution rather than the observational one. The page-level message: the choice between "off-policy RL" and "g-methods" is partly nomenclature; the formal content is identical, and the failure mode of deploying historical policies as causal recommendations is a well-defined statistical artifact.

Three queries on the dynamic structure, each with a different identification status.

How to read the diagrams. An arrow shows the causal direction. An arrow from A to B means A causes an effect — a change — in B.

Two operators appear repeatedly below. obs(X = value) means we learned that X had this value — like filtering the chart-review down to only patients where X was that value. do(X = value) means we imposed this value — like a randomization in a trial, where we control X regardless of what the patient would naturally have. The difference matters: filtering down to "patients who got the drug" tells you something about which patients tend to receive it; imposing the drug tells you only what the drug does.

Rung 1 — Association (Historical Policy)

Under the historical clinician policy, what fraction of patients with this baseline state survived to day 28?

Computed directly from the joint distribution of Comorbidities, Severity, Hemodynamics, the observed sequence of FluidGiven and PressorGiven decisions, and Mortality_28d. This is what observational survival curves report. Reproduces the data; does not predict the effect of changing clinical practice.

In plain language: Historical mortality numbers describe what happened under the past clinician policy — they cannot predict what would happen under a new policy that responds to Severity or Hemodynamics differently.
Rung 2 — Intervention (Candidate Policy)

If we deployed an alternative policy — e.g., earlier pressor escalation when MAP drops below 65 — what would the 28-day mortality be?

P(Mortality_28d | do(policy = π̂)). Computed by the g-formula: at each time slice, take the candidate policy's action distribution conditional on the slice's observed Severity and Hemodynamics, and propagate forward through the structural equations. This is the quantity that distinguishes causal RL from off-policy RL with naïve weighting.

In plain language: The candidate policy's expected mortality under the structural model — accounting for how earlier PressorGiven changes downstream Hemodynamics — is the right quantity to compare against the historical baseline before any deployment decision.
Prior — no policy applied
Prior — no policy applied

Population baseline.

Rung 3 — Counterfactual

This specific patient received fluid at hour 0 and pressor at hour 12 and died on day 6. Would they have survived if pressor had been started at hour 4?

Patient-specific counterfactual. The U-values at each slice — U_Severity, U_Hemodynamics, U_Mortality — are abducted from the factual observed trajectory; then the intervention do(PressorGiven_1 = Yes) is applied and the trajectory is replayed under the abducted U-values. This is the inference that supports retrospective M&M review and individual-level audit of clinical decisions.

In plain language: For this specific patient, the model can estimate whether earlier PressorGiven would have changed the Mortality_28d outcome — using their factual Severity and Hemodynamics trajectory to inform the counterfactual replay. That's what the M&M reviewer needs.
Prior — no evidence set
Prior — no evidence set

Population baseline before any patient trajectory is entered.

The Bayes Server file below encodes the DAG and the conditional probability tables described above. Each observable node has a corresponding U-node — the exogenous noise variable that absorbs residual variation — which is what makes Rung 3 counterfactual abduction possible. The CPTs are populated with clinically defensible illustrative priors; the qualitative behavior they encode is what makes the failure mode visible when running Rung 1 versus Rung 2 queries on the same data.

SepsisDynamicTreatment.bayes
Two-slice dynamic Bayesian network with state, treatment, and outcome at each step. Demonstrates the time-varying confounding pathway: hemodynamics drives both the next treatment decision AND the next state. Adjusting for hemodynamics as a covariate blocks a legitimate causal path; correct identification requires Robins' g-formula or sequential back-door adjustment.