Confounders: adjust. Mediators: do not adjust. Colliders: never condition on. Moderators: interact.
The correct choice requires a causal graph. The wrong choice produces a precisely wrong conclusion — not an approximately right one.
Confounding
A confounder is a variable Z that causally affects both X (the treatment or intervention) and Y (the outcome). The classic structure: Z → X and Z → Y. Because Z causes both, X and Y will be correlated even if X has no causal effect on Y whatsoever.
Example: ice cream sales and drowning deaths are correlated. Neither causes the other. Temperature — the confounder — causes both. Any regression of drowning on ice cream sales that omits temperature will find a spurious positive coefficient.
In a causal graph, a confounder creates a back-door path from X to Y: a path that runs through the parents of X rather than through X’s descendants. The back-door criterion identifies the minimal set of confounders to adjust for in order to close all such paths and recover the true causal effect.
Include Z in a regression, or stratify by Z, or use it to reweight the sample. Any of these closes the back-door path and removes the spurious correlation. The residual association between X and Y — after removing Z’s influence — is the causal effect.
Mediation
A mediator is a variable M that lies on a causal path from X to Y: X → M → Y. The mediator transmits — or partially transmits — the effect of X on Y. If you include M in a regression, you block the indirect pathway and your estimate of X’s effect will be attenuated or eliminated — even though X genuinely causes Y.
Example: a training program (X) improves job performance (Y) partly by increasing employee confidence (M). A regression that controls for confidence will underestimate the training effect, because confidence is the mechanism by which training works. The “total effect” of training flows partly through confidence; controlling for confidence estimates only the “direct effect” that bypasses it.
Mediation analysis — decomposing total effects into direct and indirect components — is valuable and legitimate. But it requires an explicit decision to decompose, not an inadvertent inclusion of mediators in a control set.
Most regression modellers include “relevant” variables without distinguishing confounders from mediators. If a variable is caused by X, it is a potential mediator or collider — not a confounder. Including it does not improve the estimate; it distorts it. The distinction between “caused by X” and “causes X” is structural, not statistical, and cannot be inferred from correlation coefficients.
Colliders: the hidden danger
Of the four causal structures, colliders are the most dangerous and the least intuitive. Confounders and mediators produce biased estimates if you get them wrong. A collider, conditioned on, creates an association that does not exist in the population — and every additional control variable added to a regression is a potential collider.
A collider is a variable C that is caused by both X and Y: X → C ← Y. Unlike confounders, colliders do not create an association between X and Y in the full population. But if you condition on C — by including it in a regression, or by selecting a sample where C has a particular value — you open a spurious path between X and Y.
The intuition: if you know that C occurred, and C is caused by both X and Y, then observing that X did not occur raises the probability that Y must have occurred to explain C. This creates a negative correlation between X and Y within the conditioned sample, even if X and Y are independent unconditionally.
Berkson’s Bias — a collider in clinical data
A hospital study finds that among hospitalised patients, patients with disease A are less likely to have disease B — suggesting A protects against B. But hospitalisation is a collider: you are hospitalised if you have A or B (or both). Conditioning on hospitalised patients opens the A → Hospitalisation ← B path and creates a spurious negative association. In the general population, A and B may be independent.
Never condition on a collider. Never include a variable in a regression or a filter if it is caused by both the treatment and the outcome, or by two variables both of which affect the outcome. Whether a variable is a collider cannot be determined from data — it requires the causal graph.
Moderation
A moderator (also called an effect modifier) is a variable W that changes the magnitude of X’s effect on Y. The causal structure is: W modifies the X → Y relationship. This is represented in a regression as an interaction term: Y = β₀ + β₁X + β₂W + β₃XW + ε.
Example: a marketing campaign (X) increases sales (Y), but the effect is larger among customers under 35 (W=young) than among customers over 55. Age moderates the treatment effect. Unlike confounding — where omitting W biases the average estimate — omitting a moderator averages over heterogeneous effects and may produce a correct average estimate but mislead you about who to target.
In a Bayesian network, moderation is encoded directly in the conditional probability table: P(Y | X, W) will show different conditional distributions for different values of W. The BN naturally represents effect heterogeneity that a regression model requires explicit interaction terms to capture.
The practical rule
The practical rule is disarmingly simple: draw the causal graph before choosing your control set. The graph makes the distinction between confounders, mediators, and colliders unambiguous. Without it, the choice of what to adjust for is a guess — and a wrong guess produces conclusions that are precisely wrong rather than approximately right.
The formal tool is d-separation: given a causal DAG, two variables are independent conditional on a set Z if and only if all paths between them are blocked by Z. A path is blocked if it contains a non-collider that is in Z, or a collider that is not in Z (and has no descendants in Z). The back-door criterion uses d-separation to identify valid adjustment sets.
For practitioners who cannot build a full causal graph, the minimum viable version is a list of variables and a question for each: does this variable cause the treatment, does it cause the outcome, or is it caused by both? That three-way classification distinguishes confounders (causes both), mediators (caused by treatment, causes outcome), and colliders (caused by both) well enough to avoid the most damaging errors.
Adjust for the right variables or introduce bias that no amount of data can correct. A structured elicitation session maps your model’s causal structure.
info@rung3.ai