Pearl's Ladder of Causality

The same dataset. The same question. Three rungs of Pearl's Ladder give three different answers — and the Rung 1 answer, the one that every predictive model produces, is sometimes exactly backwards. The difference is not which tool is more sophisticated. It is which question each tool is capable of answering.

The Three Rungs

Seeing — Association

“Given that I observe X, what can I expect Y to be?”

Formally: P(Y | X). The conditional probability of Y given that X is observed in the data. This is the question every statistical and machine learning model answers. Dashboards, regressions, neural networks, and language models all operate here. The answer is a statement about patterns in historical data — what tends to co-occur with what.

Requires: data. Nothing more.

Doing — Intervention

“What would happen to Y if I changed X?”

Formally: P(Y | do(X)). The probability of Y given that X is set to a specific value — forced, regardless of the natural process that normally determines it. The do-operator is Pearl's notation for this: it severs all causal arrows into X, simulating a perfect intervention. The result is the distribution of Y in the world where X has been deliberately controlled, not merely observed.

Requires: a causal model that encodes the mechanism connecting X to Y — not just the correlation. Data alone cannot provide this.

Imagining — Counterfactual

“Given that Y occurred, would it have been different if X had been different?”

Formally: P(Y′ₓ | X = x, Y = y). The probability that the outcome would have differed under an alternative X, given that the actual outcome Y was observed under actual condition X. This is the counterfactual: reasoning about a world that did not occur, while conditioning on a world that did. Legal attribution, post-mortem accountability, and individual-level causal inference all require Rung 3.

Requires: a causal model plus the individual's unobserved factors — the fingerprint abducted from their actual outcome. Population-level causal models are insufficient.

Rung	Question type	Formal expression	What it requires	What cannot answer it
1 — Seeing	Association / prediction	`P(Y \| X)`	Data	Nothing — any model reaches Rung 1
2 — Doing	Intervention / policy	`P(Y \| do(X))`	Causal model (graph + CPTs)	Any purely statistical or ML model
3 — Imagining	Counterfactual / attribution	`P(Y′ₓ \| X, Y)`	Causal model + individual fingerprint (abduction)	Any model without unobserved factors

Rung 1 — and Why It Misleads

A bank is reviewing its churn model. The model flags customers who have made support calls as higher churn risk: 38.3% of callers churn, versus a 25% baseline. The natural inference is that support calls are a warning signal — or worse, that they cause churn. A product team proposes reducing support accessibility to lower churn.

The causal model reveals the mechanism. Dissatisfied customers are both more likely to call support (75% vs 30% of satisfied customers) and more likely to churn. The correlation between support calls and churn is not causal — it is a consequence of a shared upstream cause, dissatisfaction. The support call is not the problem; it is a symptom of the problem.

The Rung 1 answer, precisely

P(Churn | Support Call = true) = 38.3%. Customers who call are more likely to churn. This is a true statement about the data. Treated as a causal claim — which Rung 1 cannot justify — it produces a recommendation that would accelerate churn by eliminating the one intervention that is actually reducing it.

The Rung 2 answer, precisely

P(Churn | do(Support Call = true)) = 17.1%. The do-operator severs the causal arrow from Dissatisfaction to Support Call — it simulates a world where support calls are assigned independently of customer satisfaction, as in a randomised trial. Dissatisfaction reverts to its population prior of 30%. Churn propagates through the mechanism to 17.1%. The causal effect of support calls is −7.9 percentage points — protective, not harmful. The 21-point gap between Rung 1 and Rung 2 is pure confounding, and it reverses the business decision.

When the sign flips: Simpson's Paradox

This pattern — confounding that reverses the apparent effect — is common enough to have a name. Simpson's Paradox occurs when a trend present in each subgroup of a dataset disappears or reverses when the groups are combined. A healthcare dataset shows patients receiving a drug have lower recovery rates overall. The Rung 1 conclusion: the drug is harmful. The causal reality: sicker patients are more likely to be prescribed the drug, which biases the unconditional association. When the confounder is blocked — when the analysis uses do(Drug) rather than observing Drug — the true causal effect is positive.

Analysis	Method	Measured effect of drug on recovery
Rung 1 — Association	P(Recovery \| Drug = true)	−5% — drug appears harmful
Rung 2 — Intervention	P(Recovery \| do(Drug = true))	+5% — drug is beneficial

Same data. Same question. Opposite sign. A non-causal system that optimises on the Rung 1 answer recommends withholding the drug. The causal model recommends administering it. The only difference is whether the analysis blocks the backdoor path through the confounder.

Rung 1 answers are not less accurate versions of Rung 2 answers. They are answers to a different question — one that can point in the opposite direction when confounding is present.

Rung 2 — What It Enables

The do-operator is not a statistical adjustment. It is a structural operation on the causal graph. When you write do(X = x), you are specifying that X is set to x by an external intervention — regardless of the values of the variables that normally determine X. In the graph, this means removing all arrows into X and fixing its value. The downstream variables then propagate according to the remaining causal structure.

This models what actually happens when an organization implements a policy. A policy does not merely observe the world at a particular value of X. It imposes that value, overriding the natural process that would otherwise determine it. A predictive model is calibrated on a world where the policy was not in place. The do-operator computes the distribution in the world where it is.

What Rung 2 makes computable that Rung 1 cannot

Control effectiveness: not “do losses correlate with the absence of this control” but “what is P(loss | do(control = implemented))” — the expected loss in the world where the control is forced in, blocking the confounders that determine which risks receive which controls.

Policy comparison: not “which customers who received this offer had the best outcomes” but “what is P(outcome | do(offer = A))” versus “P(outcome | do(offer = B))” — comparing the outcomes in the two worlds the policies would create, not the customers who happened to receive each one.

Risk attribution: not “which factors are correlated with this loss” but “what is the expected loss attributable to each factor” — how much of the outcome would change if that factor were set to its baseline value, with all other factors held at their natural values.

Nearly every consequential business decision is a Rung 2 question. Should we raise prices? Should we implement this control? Should we enter this market? Each is asking about the world that would result from an action — not about the patterns in the world as it has been observed.

Decision question	Rung required	Why Rung 1 fails
Should we implement this control?	2 — Doing	Controls are not randomly assigned — the correlation between controls and outcomes is confounded by risk severity.
Will this pricing change improve the loss ratio?	2 — Doing	Pricing correlates with risk selection — the observed relationship between price and outcome is not the causal effect of changing price.
What happens to churn if we change the onboarding process?	2 — Doing	Customers who go through different onboarding paths differ in ways that confound the association with retention.
How much does this risk factor contribute to expected loss?	2 — Doing	Attributing expected loss to a specific factor requires P(loss \| do(factor = baseline)) — not the correlation between factor and loss.

Dialog: Counterfactuals aren't testable. Isn't this philosophy, not science? — answered: the counterfactual outcome itself isn't observable, by definition. The model that produces counterfactual estimates is fully testable: every prediction it makes about observable quantities can be checked against data.

Rung 3 — What It Uniquely Requires

The Rung 3 computation requires an additional step that Rung 2 does not: abduction. Before asking what would have happened under a different condition, the model must determine who this individual is — what values of the unobserved variables are consistent with their actual observed outcome. These unobserved variables are their “fingerprint”: the individual-specific information that is not in the data but is implicitly encoded in what actually happened to them.

The three-step Rung 3 procedure:

Abduct

Enter the individual's observed evidence. The model updates the posterior over unobserved variables — inferring the individual's fingerprint from what is known about them. A customer who churned despite being satisfied is assigned a different unobserved fingerprint than a customer who churned from dissatisfaction.

Intervene

Apply the counterfactual intervention — set the hypothetical condition. Sever the causal arrows into the intervened variable. The individual's fingerprint (the abducted unobserved values) remains fixed. The intervention changes the observable condition; it does not change who the person is.

Predict

Compute the outcome under the counterfactual condition with the fingerprint held. The result is the probability that the outcome would have been different under the alternative history — not for a typical person with this individual's observed characteristics, but for this specific individual, given everything inferred about their unobserved state.

Why Rung 2 and Rung 3 give different numbers

A rejected loan applicant asks: “If my income had been $80,000 instead of $60,000, would I have been approved?”

Rung	What it computes	Credit score	Change from actual
Rung 2 — Intervention	Average score for people like this applicant with $80K income. Unobserved factors free — averaged over the population.	534	+84
Rung 3 — Counterfactual	Score for this applicant with $80K income. Unobserved factors fixed at their abducted values — this individual's fingerprint.	476	+26

The 58-point gap exists because this applicant has above-average debt and below-average payment history. Rung 2 averages these out across the population of people with this income level — it answers “what happens to a typical person at $80K.” Rung 3 holds them fixed — it answers “what happens to this person at $80K.” The Rung 2 answer suggests income growth alone would secure approval. The Rung 3 answer reveals it would not — the debt and payment history are the binding constraints, and the actionable advice is different.

What counts as a counterfactual — and what does not

Many explainable AI tools use the term “counterfactual” for a perturbation technique: tweak inputs until the model's output flips, and report the smallest change required. “If your income were $5,000 higher, the model would approve you.”

This is not a Rung 3 counterfactual. It is a Rung 1 observation about the model's input–output surface. It does not ask whether income causes approval or merely correlates with something that does. It does not hold the individual's fingerprint fixed. It does not compute what would actually have happened. It tells you how to game the model — which may have nothing to do with how to change the outcome in reality. The distinction matters in any context where the counterfactual claim has legal or governance consequences.

The Gap Is Structural

Pearl's result is precise: no statistical procedure applied to observational data can compute interventional or counterfactual distributions without additional assumptions about the causal structure. Those assumptions must come from outside the data — from domain knowledge, from randomised experiments, or from an explicit causal model. They cannot be learned from the data itself, because the data is a sample from the observational distribution, not the interventional one.

This has a direct implication for AI strategy:

A larger language model is a larger Rung 1 system. It predicts text — it learns the conditional distribution of tokens given context. It cannot compute P(Y | do(X)) because it has no causal graph, no do-operator, and no mechanism connecting its internal representations to the causal structure of the world.
A more accurate gradient boosted tree is a more accurate Rung 1 system. Higher AUC on a holdout set means better discrimination in the observational distribution. It does not mean better answers to interventional questions.
An explanation layer on a Rung 1 model is a Rung 1 explanation. SHAP values tell you which features the model weighted most heavily. They do not tell you which variables causally influence the outcome, because the model they explain is a correlation model.

The test for any AI system

Ask: does this system answer P(Y | X) or P(Y | do(X))? If the system does not have a causal graph and a do-operator, it is computing the first — regardless of how it is marketed. If it cannot answer that question about itself, it is computing the first.

Rung 1 answers are appropriate for Rung 1 questions. They are wrong for Rung 2 and Rung 3 questions — not approximately wrong, not conservative estimates, but answers to a different question that can point in the opposite direction when confounding is present.

Most organizational AI investment is in Rung 1. Most consequential decisions require Rung 2. Most governance accountability requires Rung 3. The gap between where investment sits and where questions live is structural — and it will not close as models get larger.

Norman Fenton on the same gap — that machine learning trained on observational data, however large, is structurally confined to Rung 1, and that Pearl’s ladder is the right map of why the higher rungs require a different kind of model.

Machine learning with ‘big data’: fundamental limitations and Pearl’s ‘ladder of causation’

Professor Norman Fenton · YouTube

The Mathematical Object Behind the Ladder

Pearl's Ladder is a classification of questions. The question of what mathematical object satisfies the requirements at each rung is separate — and the answer is a Structural Causal Model.

An SCM has three components: a set of variables (endogenous variables V, determined within the model, and exogenous variables U, the background noise and individual variation); a directed acyclic graph encoding which variables cause which; and structural equations — one per endogenous variable — of the form V_i = f_i(Pa_i, U_i), specifying the mechanism by which each variable is produced from its direct causes and its individual noise term.

The exogenous variables U are what makes Rung 3 computable. They encode everything the model does not explicitly represent — the individual fingerprint, the unobserved factors that determined what actually happened to this specific person, in this specific situation. A Bayesian network has no equivalent: it has conditional probability tables, not structural equations, and no explicit representation of background noise.

How each rung maps to SCM operations

Rung 1: Marginalize or condition the joint distribution. Standard inference on the Bayesian network induced by the SCM. No special SCM machinery required.

Rung 2: Apply the do-operator — remove the structural equation for the intervened variable and replace it with a constant. The remaining equations propagate the effect. This is graph surgery: formally justified by the SCM, executable on the induced Bayesian network.

Rung 3: Abduct the individual's exogenous variables U from their observed outcome. Apply the counterfactual intervention. Predict the outcome with U held fixed. This requires the structural equations explicitly — a Bayesian network without SCM structure cannot do it.

The practical boundary matters for risk decisions. Bayesian networks are sufficient for Rung 2 — for intervention planning, policy evaluation, and expected loss attribution. They are the working tool for most risk management applications. SCMs are required for Rung 3 — for individual attribution, regulatory defense of a specific decision, and post-mortem accountability. These are precisely the questions that regulators and courts ask.

The implication is not that every risk model must be rebuilt as a full SCM. It is that the risk function must know where the boundary is — and have the capability to cross it when the question requires it.

Structured Causal Models — the full treatment →

Grammatical mood encodes exactly this hierarchy. The indicative states facts: smoking is associated with cancer. The imperative commands or intervenes: reduce the dose. The subjunctive reasons about alternate histories: had the treatment been given earlier, the patient might have survived. Each mood corresponds precisely to a rung — and each requires a different mathematical object to formalize.

Rung 1 · Association

Question: What do we observe? · Formal object: P(Y | X)
Describes statistical relationships. Cannot distinguish correlation from causation. Where most machine learning — including large language models — operates.

Rung 2 · Intervention

Question: What happens if we act? · Formal object: P(Y | do(X))
The observational question is people who exercise have lower blood pressure — a Rung 1 fact. The interventional question is if we make someone exercise, does their blood pressure fall? — a Rung 2 computation. The answers can differ, sometimes dramatically. Addressed by randomised controlled trials or a causal model with the do-operator.

Rung 3 · Counterfactual

Question: What would have happened under a different history? · Formal object: P(Y_x | X′, Y′)
A patient did not receive the drug and died. Would they have lived if they had received it? This question cannot be answered from population statistics alone — it requires a structural causal model that can replay an individual history under an alternative.

The same structure appears in AI. Statistical learning — neural networks, LLMs — operates on Rung 1. Reinforcement learning moves to Rung 2: it reasons about the effects of actions. Structural causal models occupy Rung 3: they can answer counterfactual questions about individuals and alternate histories. Most deployed AI systems remain at Rung 1.

In the cases

Insurance

Property Insurance

The risk register operates at Rung 1. The board's question — which lever to pull — requires Rung 2. The rate increase question requires Rung 3.

Financial

Bank Churn

Churn analytics answers Rung 1. Whether the campaign prevented exits is a Rung 3 counterfactual — the distinction the page is built around.

Compliance

NIST CSF 2.0

$4.2M allocation between Protect and Detect is a Rung 2 intervention query. The maturity assessment stays at Rung 1 and cannot answer it.

Next Step

Every model in production in your organization answers Rung 1 questions. The decisions those models inform require Rung 2 answers. The gap between them is computable — and consequential.

The Three Questions That Cannot Be Answered →

Pearl, J., 2009, Causality: Models, Reasoning, and Inference (2nd ed.), Cambridge University Press · Pearl, J. & Mackenzie, D., 2018, The Book of Why, Basic Books · Pearl, J., Glymour, M. & Jewell, N.P., 2016, Causal Inference in Statistics: A Primer, Wiley

Pearl's ladder of causation.

On this page