The same dataset. The same question. Three rungs of Pearl's Ladder give three different answers — and the Rung 1 answer, the one that every predictive model produces, is sometimes exactly backwards. The difference is not which tool is more sophisticated. It is which question each tool is capable of answering.
The Three Rungs
Seeing — Association
“Given that I observe X, what can I expect Y to be?”
Formally: P(Y | X). The conditional probability of Y given that X is observed in the data. This is the question every statistical and machine learning model answers. Dashboards, regressions, neural networks, and language models all operate here. The answer is a statement about patterns in historical data — what tends to co-occur with what.
Requires: data. Nothing more.
Doing — Intervention
“What would happen to Y if I changed X?”
Formally: P(Y | do(X)). The probability of Y given that X is set to a specific value — forced, regardless of the natural process that normally determines it. The do-operator is Pearl's notation for this: it severs all causal arrows into X, simulating a perfect intervention. The result is the distribution of Y in the world where X has been deliberately controlled, not merely observed.
Requires: a causal model that encodes the mechanism connecting X to Y — not just the correlation. Data alone cannot provide this.
Imagining — Counterfactual
“Given that Y occurred, would it have been different if X had been different?”
Formally: P(Y′ₓ | X = x, Y = y). The probability that the outcome would have differed under an alternative X, given that the actual outcome Y was observed under actual condition X. This is the counterfactual: reasoning about a world that did not occur, while conditioning on a world that did. Legal attribution, post-mortem accountability, and individual-level causal inference all require Rung 3.
Requires: a causal model plus the individual's unobserved factors — the fingerprint abducted from their actual outcome. Population-level causal models are insufficient.
| Rung | Question type | Formal expression | What it requires | What cannot answer it |
|---|---|---|---|---|
| 1 — Seeing | Association / prediction | P(Y | X) |
Data | Nothing — any model reaches Rung 1 |
| 2 — Doing | Intervention / policy | P(Y | do(X)) |
Causal model (graph + CPTs) | Any purely statistical or ML model |
| 3 — Imagining | Counterfactual / attribution | P(Y′ₓ | X, Y) |
Causal model + individual fingerprint (abduction) | Any model without unobserved factors |
Rung 1 — and Why It Misleads
A bank is reviewing its churn model. The model flags customers who have made support calls as higher churn risk: 38.3% of callers churn, versus a 25% baseline. The natural inference is that support calls are a warning signal — or worse, that they cause churn. A product team proposes reducing support accessibility to lower churn.
The causal model reveals the mechanism. Dissatisfied customers are both more likely to call support (75% vs 30% of satisfied customers) and more likely to churn. The correlation between support calls and churn is not causal — it is a consequence of a shared upstream cause, dissatisfaction. The support call is not the problem; it is a symptom of the problem.
P(Churn | Support Call = true) = 38.3%. Customers who call are more likely to churn. This is a true statement about the data. Treated as a causal claim — which Rung 1 cannot justify — it produces a recommendation that would accelerate churn by eliminating the one intervention that is actually reducing it.
P(Churn | do(Support Call = true)) = 17.1%. The do-operator severs the causal arrow from Dissatisfaction to Support Call — it simulates a world where support calls are assigned independently of customer satisfaction, as in a randomised trial. Dissatisfaction reverts to its population prior of 30%. Churn propagates through the mechanism to 17.1%. The causal effect of support calls is −7.9 percentage points — protective, not harmful. The 21-point gap between Rung 1 and Rung 2 is pure confounding, and it reverses the business decision.
When the sign flips: Simpson's Paradox
This pattern — confounding that reverses the apparent effect — is common enough to have a name. Simpson's Paradox occurs when a trend present in each subgroup of a dataset disappears or reverses when the groups are combined. A healthcare dataset shows patients receiving a drug have lower recovery rates overall. The Rung 1 conclusion: the drug is harmful. The causal reality: sicker patients are more likely to be prescribed the drug, which biases the unconditional association. When the confounder is blocked — when the analysis uses do(Drug) rather than observing Drug — the true causal effect is positive.
| Analysis | Method | Measured effect of drug on recovery |
|---|---|---|
| Rung 1 — Association | P(Recovery | Drug = true) | −5% — drug appears harmful |
| Rung 2 — Intervention | P(Recovery | do(Drug = true)) | +5% — drug is beneficial |
Same data. Same question. Opposite sign. A non-causal system that optimises on the Rung 1 answer recommends withholding the drug. The causal model recommends administering it. The only difference is whether the analysis blocks the backdoor path through the confounder.
Rung 2 — What It Enables
The do-operator is not a statistical adjustment. It is a structural operation on the causal graph. When you write do(X = x), you are specifying that X is set to x by an external intervention — regardless of the values of the variables that normally determine X. In the graph, this means removing all arrows into X and fixing its value. The downstream variables then propagate according to the remaining causal structure.
This models what actually happens when an organization implements a policy. A policy does not merely observe the world at a particular value of X. It imposes that value, overriding the natural process that would otherwise determine it. A predictive model is calibrated on a world where the policy was not in place. The do-operator computes the distribution in the world where it is.
Control effectiveness: not “do losses correlate with the absence of this control” but “what is P(loss | do(control = implemented))” — the expected loss in the world where the control is forced in, blocking the confounders that determine which risks receive which controls.
Policy comparison: not “which customers who received this offer had the best outcomes” but “what is P(outcome | do(offer = A))” versus “P(outcome | do(offer = B))” — comparing the outcomes in the two worlds the policies would create, not the customers who happened to receive each one.
Risk attribution: not “which factors are correlated with this loss” but “what is the expected loss attributable to each factor” — how much of the outcome would change if that factor were set to its baseline value, with all other factors held at their natural values.
Nearly every consequential business decision is a Rung 2 question. Should we raise prices? Should we implement this control? Should we enter this market? Each is asking about the world that would result from an action — not about the patterns in the world as it has been observed.
| Decision question | Rung required | Why Rung 1 fails |
|---|---|---|
| Should we implement this control? | 2 — Doing | Controls are not randomly assigned — the correlation between controls and outcomes is confounded by risk severity. |
| Will this pricing change improve the loss ratio? | 2 — Doing | Pricing correlates with risk selection — the observed relationship between price and outcome is not the causal effect of changing price. |
| What happens to churn if we change the onboarding process? | 2 — Doing | Customers who go through different onboarding paths differ in ways that confound the association with retention. |
| How much does this risk factor contribute to expected loss? | 2 — Doing | Attributing expected loss to a specific factor requires P(loss | do(factor = baseline)) — not the correlation between factor and loss. |
Rung 3 — What It Uniquely Requires
The Rung 3 computation requires an additional step that Rung 2 does not: abduction. Before asking what would have happened under a different condition, the model must determine who this individual is — what values of the unobserved variables are consistent with their actual observed outcome. These unobserved variables are their “fingerprint”: the individual-specific information that is not in the data but is implicitly encoded in what actually happened to them.
The three-step Rung 3 procedure:
Abduct
Enter the individual's observed evidence. The model updates the posterior over unobserved variables — inferring the individual's fingerprint from what is known about them. A customer who churned despite being satisfied is assigned a different unobserved fingerprint than a customer who churned from dissatisfaction.
Intervene
Apply the counterfactual intervention — set the hypothetical condition. Sever the causal arrows into the intervened variable. The individual's fingerprint (the abducted unobserved values) remains fixed. The intervention changes the observable condition; it does not change who the person is.
Predict
Compute the outcome under the counterfactual condition with the fingerprint held. The result is the probability that the outcome would have been different under the alternative history — not for a typical person with this individual's observed characteristics, but for this specific individual, given everything inferred about their unobserved state.
Why Rung 2 and Rung 3 give different numbers
A rejected loan applicant asks: “If my income had been $80,000 instead of $60,000, would I have been approved?”
| Rung | What it computes | Credit score | Change from actual |
|---|---|---|---|
| Rung 2 — Intervention | Average score for people like this applicant with $80K income. Unobserved factors free — averaged over the population. | 534 | +84 |
| Rung 3 — Counterfactual | Score for this applicant with $80K income. Unobserved factors fixed at their abducted values — this individual's fingerprint. | 476 | +26 |
The 58-point gap exists because this applicant has above-average debt and below-average payment history. Rung 2 averages these out across the population of people with this income level — it answers “what happens to a typical person at $80K.” Rung 3 holds them fixed — it answers “what happens to this person at $80K.” The Rung 2 answer suggests income growth alone would secure approval. The Rung 3 answer reveals it would not — the debt and payment history are the binding constraints, and the actionable advice is different.
Many explainable AI tools use the term “counterfactual” for a perturbation technique: tweak inputs until the model's output flips, and report the smallest change required. “If your income were $5,000 higher, the model would approve you.”
This is not a Rung 3 counterfactual. It is a Rung 1 observation about the model's input–output surface. It does not ask whether income causes approval or merely correlates with something that does. It does not hold the individual's fingerprint fixed. It does not compute what would actually have happened. It tells you how to game the model — which may have nothing to do with how to change the outcome in reality. The distinction matters in any context where the counterfactual claim has legal or governance consequences.
The Gap Is Structural
Pearl's result is precise: no statistical procedure applied to observational data can compute interventional or counterfactual distributions without additional assumptions about the causal structure. Those assumptions must come from outside the data — from domain knowledge, from randomised experiments, or from an explicit causal model. They cannot be learned from the data itself, because the data is a sample from the observational distribution, not the interventional one.
This has a direct implication for AI strategy:
- A larger language model is a larger Rung 1 system. It predicts text — it learns the conditional distribution of tokens given context. It cannot compute P(Y | do(X)) because it has no causal graph, no do-operator, and no mechanism connecting its internal representations to the causal structure of the world.
- A more accurate gradient boosted tree is a more accurate Rung 1 system. Higher AUC on a holdout set means better discrimination in the observational distribution. It does not mean better answers to interventional questions.
- An explanation layer on a Rung 1 model is a Rung 1 explanation. SHAP values tell you which features the model weighted most heavily. They do not tell you which variables causally influence the outcome, because the model they explain is a correlation model.
Ask: does this system answer P(Y | X) or P(Y | do(X))? If the system does not have a causal graph and a do-operator, it is computing the first — regardless of how it is marketed. If it cannot answer that question about itself, it is computing the first.
Rung 1 answers are appropriate for Rung 1 questions. They are wrong for Rung 2 and Rung 3 questions — not approximately wrong, not conservative estimates, but answers to a different question that can point in the opposite direction when confounding is present.
The Mathematical Object Behind the Ladder
Pearl's Ladder is a classification of questions. The question of what mathematical object satisfies the requirements at each rung is separate — and the answer is a Structural Causal Model.
An SCM has three components: a set of variables (endogenous variables V, determined within the model, and exogenous variables U, the background noise and individual variation); a directed acyclic graph encoding which variables cause which; and structural equations — one per endogenous variable — of the form Vi = fi(Pai, Ui), specifying the mechanism by which each variable is produced from its direct causes and its individual noise term.
The exogenous variables U are what makes Rung 3 computable. They encode everything the model does not explicitly represent — the individual fingerprint, the unobserved factors that determined what actually happened to this specific person, in this specific situation. A Bayesian network has no equivalent: it has conditional probability tables, not structural equations, and no explicit representation of background noise.
Rung 1: Marginalize or condition the joint distribution. Standard inference on the Bayesian network induced by the SCM. No special SCM machinery required.
Rung 2: Apply the do-operator — remove the structural equation for the intervened variable and replace it with a constant. The remaining equations propagate the effect. This is graph surgery: formally justified by the SCM, executable on the induced Bayesian network.
Rung 3: Abduct the individual's exogenous variables U from their observed outcome. Apply the counterfactual intervention. Predict the outcome with U held fixed. This requires the structural equations explicitly — a Bayesian network without SCM structure cannot do it.
The practical boundary matters for risk decisions. Bayesian networks are sufficient for Rung 2 — for intervention planning, policy evaluation, and expected loss attribution. They are the working tool for most risk management applications. SCMs are required for Rung 3 — for individual attribution, regulatory defense of a specific decision, and post-mortem accountability. These are precisely the questions that regulators and courts ask.
The implication is not that every risk model must be rebuilt as a full SCM. It is that the risk function must know where the boundary is — and have the capability to cross it when the question requires it.
Structured Causal Models — the full treatment →
Grammatical mood encodes exactly this hierarchy. The indicative states facts: smoking is associated with cancer. The imperative commands or intervenes: reduce the dose. The subjunctive reasons about alternate histories: had the treatment been given earlier, the patient might have survived. Each mood corresponds precisely to a rung — and each requires a different mathematical object to formalize.
Question: What do we observe? · Formal object: P(Y | X)
Describes statistical relationships. Cannot distinguish correlation from causation. Where most machine learning — including large language models — operates.
Question: What happens if we act? · Formal object: P(Y | do(X))
The observational question is people who exercise have lower blood pressure — a Rung 1 fact. The interventional question is if we make someone exercise, does their blood pressure fall? — a Rung 2 computation. The answers can differ, sometimes dramatically. Addressed by randomised controlled trials or a causal model with the do-operator.
Question: What would have happened under a different history? · Formal object: P(Yx | X′, Y′)
A patient did not receive the drug and died. Would they have lived if they had received it? This question cannot be answered from population statistics alone — it requires a structural causal model that can replay an individual history under an alternative.
The same structure appears in AI. Statistical learning — neural networks, LLMs — operates on Rung 1. Reinforcement learning moves to Rung 2: it reasons about the effects of actions. Structural causal models occupy Rung 3: they can answer counterfactual questions about individuals and alternate histories. Most deployed AI systems remain at Rung 1.
Every model in production in your organization answers Rung 1 questions. The decisions those models inform require Rung 2 answers. The gap between them is computable — and consequential.
Pearl, J., 2009, Causality: Models, Reasoning, and Inference (2nd ed.), Cambridge University Press · Pearl, J. & Mackenzie, D., 2018, The Book of Why, Basic Books · Pearl, J., Glymour, M. & Jewell, N.P., 2016, Causal Inference in Statistics: A Primer, Wiley