STAR*D, the largest pragmatic trial of antidepressant sequencing, established that approximately one-third of patients remit after a first-line SSRI; among those who do not, approximately a quarter remit at Level 2 (switch or augment); fewer still at Level 3. Those are population averages — the marginal probability of remission at each step, integrated over the heterogeneous population that arrived at that step.

The patient on the table is not the population average. They failed sertraline at Level 1 and venlafaxine at Level 2. Those two specific failures are themselves diagnostic: they tell us something about which kind of depression this is. The history is not a covariate to be controlled for. It is evidence about an unobserved subtype that the next treatment decision needs to condition on.

The structural problem

A standard ML approach to "next-best treatment" would use prior treatments and prior responses as features in a predictive model. That is an associational claim that ignores why those features have the values they do. Treatment_L1 was chosen partly because of demographic and severity covariates AND partly because of the latent type the clinician was already inferring. Response_L1 reflects both Treatment_L1 and the latent type. Conditioning on those features in a regression introduces collider bias on the latent — and the model gives wrong answers for the very decision it was built to support.

Prior — no evidence
Prior — no evidence set

Population baseline before any patient evidence.

Dialog: An algorithm can't replace clinical judgment. — answered: it shouldn't try. A causal model encodes judgment so it can be applied consistently, audited, updated, and pressure-tested. The graph is drawn by experts; the parameters are calibrated on data; the queries are run by the model. The clinician remains the source of structural judgment and the final decision-maker.

The model is a sequential decision graph with an explicit latent type variable. The latent type is partially observed through baseline severity and comorbid anxiety; it is further constrained by each treatment failure observed in sequence.

NodeStatesRole
DemographicsYounger_Female · Younger_Male · Older_Female · Older_MalePre-treatment covariate
BaselineSeverityMild · Moderate · SevereSymptom-derived; informs latent type
ComorbidAnxietyYes · NoSymptom-derived; informs latent type
LatentTypeAnxious · Melancholic · AtypicalLatent — never directly observed
Treatment_L1SSRI · SNRI · OtherFirst-line agent
Response_L1Remit · Partial · Non-remitUpdates posterior on LatentType
Treatment_L2Switch_class · Augment · OtherSecond-line agent
Response_L2Remit · Partial · Non-remitFurther updates posterior
Treatment_L3Augment · Switch_to_MAOI · Ketamine · OtherThe decision under analysis
Response_L3Remit · Non-remitOutcome of interest

Edges encode the structural assumptions: Demographics, BaselineSeverity, ComorbidAnxietyLatentType (partial observability of the type from observed symptoms). LatentType → all Response_t (response patterns are type-driven). Demographics + LatentType → all Treatment_t (clinicians choose partly based on inferred type). Treatment_t + Response_{t-1}Response_t (the agent matters; prior failures don't directly cause the next response but their information flows through the latent). Response_{t-1}Treatment_t (clinicians update).

Identifiability

The latent type makes the treatment-response system identifiable in a way that "control for prior treatments" cannot. The structural form lets the inference engine compute the correct posterior over LatentType after observing the treatment history, then marginalize that posterior when projecting forward. This is what STAR*D's protocol-driven recommendations cannot do — they treat the patient at L3 as a member of the L3 population, not as a specific posterior over types.

The graph supports three sequential queries — each one a different formal question about the same patient.

How to read the diagrams. An arrow shows the causal direction. An arrow from A to B means A causes an effect — a change — in B.

Two operators appear repeatedly below. obs(X = value) means we learned that X had this value — like filtering the chart-review down to only patients where X was that value. do(X = value) means we imposed this value — like a randomization in a trial, where we control X regardless of what the patient would naturally have. The difference matters: filtering down to "patients who got the drug" tells you something about which patients tend to receive it; imposing the drug tells you only what the drug does.

Rung 1 — Association

Among patients with the same demographics, baseline severity, and same prior treatment history who received Augment at L3, what was the remission rate?

Computed as a conditional probability over the data — covariates like Demographics, BaselineSeverity, plus the observed sequence of Treatment_L1, Response_L1, Treatment_L2, Response_L2. Useful for descriptive epidemiology. Polluted by selection: clinicians chose Augment at L3 partly because of suspicions about the patient's LatentType — and that latent state is exactly what we cannot directly observe.

In plain language: The chart-review remission rate for Augment-at-L3 looks like an answer, but it isn't — clinicians who chose this treatment did so based on signs of LatentType we cannot see in the data. The conditional cannot tell you what would happen if a different decision had been made.
Rung 2 — Intervention

If a fresh clinician with no information about the patient's prior choices were forced to give Augment at L3, what would the remission rate be?

P(Response_L3 = Remit | do(Treatment_L3 = Augment), demographics, severity, anxiety, prior responses). The intervention severs Treatment_L3's dependence on the (unobserved) clinician's inference about LatentType — but the posterior on LatentType, updated by the observed prior responses Response_L1 and Response_L2, still conditions the outcome. This is the right quantity for guideline development.

In plain language: The unconfounded population effect of Augment at L3 — comparing it like-for-like against Switch_to_MAOI or Ketamine — gives the right number for treatment guidelines. It does not tell you what would have happened in this patient's actual sequence.
Prior — no intervention
Prior — no intervention

Population baseline.

Rung 3 — Counterfactual

This patient remitted at L3 on Augment. Would they have remitted at L1 if mirtazapine had been given instead of sertraline?

The full counterfactual. Abduct the U-values consistent with the factual sequence — Treatment_L1, Response_L1, Treatment_L2, Response_L2, Treatment_L3, Response_L3 — and the inferred LatentType. Intervene on Treatment_L1 to set Mirtazapine; propagate forward; read the counterfactual Response at L1 and beyond. This is the inference that lets the M&M conference — or the malpractice review — assess whether the original treatment plan was reasonable given what was knowable at the time.

In plain language: For this specific patient, the model can estimate what their L1 response would have been under Mirtazapine instead of Sertraline — using everything we now know about their LatentType from the full treatment sequence. That's the answer the retrospective review actually needs.
Prior
Prior — no evidence set

Population baseline.

The Bayes Server file below encodes the DAG and the conditional probability tables described above. Each observable node has a corresponding U-node — the exogenous noise variable that absorbs residual variation — which is what makes Rung 3 counterfactual abduction possible. The CPTs are populated with clinically defensible illustrative priors; the qualitative behavior they encode is what makes the failure mode visible when running Rung 1 versus Rung 2 queries on the same data.

TRDepressionSequencing.bayes
Sequential model with explicit latent type node. Each treatment failure updates the posterior on the latent depression subtype, which in turn changes the recommended next agent. The standard ML approach of treating prior treatments as covariates introduces collider bias on the latent — the SCM structure makes the correct computation tractable.