Why Current Tools Fail

The standard analytical tools produce outputs that look like causal claims. The text shaped as a recommendation, the heat-map that ranks risks, the score that says “Customer A is more likely to churn than Customer B” — each is delivered in the format a decision-maker would expect from a causal answer. The structural problem is that none of them is one.

Each is a Rung 1 object — a description of the data’s associational structure — offered in response to a Rung 2 or Rung 3 question. The reader cannot tell, from the output, that the substitution has occurred. That invisibility is the failure mode this hub names.

The substitution problem

The pattern is consistent across tooling categories and decision contexts:

A stakeholder asks a Rung 2 question: “What should we do about churn?”
The team responds with a Rung 1 tool: a churn classifier trained on historical data.
The classifier predicts which customers are likely to churn — an associational pattern.
The output is read as a recommendation: focus retention efforts on the predicted-churn customers.
The interventions launched on that basis often fail to reduce churn, because the classifier identified who tends to churn, not which interventions cause churn reduction.

The two answers are different objects. The first describes what tends to happen in observed data; the second predicts what happens under an action. They are formally different distributions and cannot be substituted for each other, even when the surface text reads the same. See Pearl’s ladder of causation for the foundational distinction; see What Data Cannot Tell You for the gap that no amount of data closes.

The substitution is silent because the output text is identical. The stakeholder hears “focus retention efforts on these customers” from both a causal model and a churn classifier. From the team, that text means whatever the underlying model computed. From the stakeholder’s perspective, that text means a recommendation. The two readings diverge invisibly.

Why predictive ML fails harder than statistics

Statistical inference has at least a century of awareness of the causal-vs-associational distinction. A statistician applying regression to observational data knows the result is an association unless the design supports a causal interpretation. Even when the warning is ignored, the awareness is present in the field’s vocabulary.

Predictive machine learning has fewer such guardrails. Three reasons:

The framing is predictive by default. “Predict whether the customer will churn” is a Rung 1 task; ML treats it as the canonical task. The stakeholder’s actual question — what should we do? — is reframed as predict the outcome under the status quo, and the substitution is built into the framing.
Feature importance is read as causal. SHAP values, gradient importance, partial dependence plots — each tool produces output that looks like a causal attribution. A feature with high importance is read as “this drives the outcome.” The feature is a predictor, not a cause; the framing is misread by stakeholders and increasingly by practitioners.
Counterfactual recommendations are inferred from associational models. “If we change this customer’s tenure, the prediction changes” is treated as “if we change this customer’s tenure, the outcome changes.” The first is a property of the model; the second is a property of the world. Predictive ML does not distinguish them.

The result is that organizations deploy predictive models against questions the models cannot answer, receive outputs that look like answers, and run interventions that fail. The failure is then attributed to data quality or model performance, when the issue was structural from the outset. See What Data Cannot Tell You for the full treatment of why no amount of data fixes the substitution.

The risk matrix as a worked example

Risk matrices are the most widely deployed instrument for causal reasoning in operational risk, audit, and governance. They are also among the least examined. The standard risk matrix scores each risk on likelihood and impact, ranks them, and presents the rank-ordered list to the board.

The instrument has the structural properties of a causal claim — risks named, prioritized, recommended for action — without any of the machinery. Specifically:

Independent scoring. Each risk is scored without reference to the others. Cascading risks — where one event triggers a second, which triggers a third — are scored as if they were unconnected. The matrix does not represent the causal connections between rows.
No mechanism representation. The matrix records that a risk exists and rates it. It does not record how the risk is produced, by what controls it is moderated, or what other variables affect its likelihood. A board asking “which controls reduce which risks?” has no answer in the matrix — the structure is not there.
No interaction effects. Two risks that are individually mild but compound under specific conditions are invisible to the matrix. Risk-of-risk is not a row in the matrix; it is a property of the (un-modeled) interaction.

For the worked failure modes and the cascading-failure example, see The Risk Matrix Problem. The matrix is the canonical example of an instrument that looks like causal reasoning to a board because the framing is causal, but that is, structurally, a list ordered by two scores. The board’s questions about which controls to fund, in which order, for which risks, are not answerable from the matrix — not because the matrix is poorly built, but because the answers require structure the matrix does not contain.

Why scale doesn’t fix it

One response to the critique is that the tools work in principle but currently lack scale — better predictive models, larger datasets, more refined dashboards will close the gap. This response misreads the failure mode.

The substitution problem is not that the Rung 1 answer is approximately right and the Rung 2/3 answer is the limit of larger models. The Rung 1 answer is computing a different object from the Rung 2/3 answer. The gap is mathematical, not approximation-theoretic. Larger models compute the Rung 1 object more accurately; they do not converge to the Rung 2/3 object as they scale.

This applies equally to large language models. The argument is developed in detail on the Why not an LLM? page: a language model learns P(token | context), which is a Rung 1 distribution; P(Y | do(X)) — the intervention distribution — is not derivable from it, no matter how much text the model is trained on. A larger language model is a larger Rung 1 system.

The same structural argument applies to predictive ML, risk matrices, scoring models, and dashboards. Each is a Rung 1 instrument. Each can be improved at its native task — better predictions, sharper scores, cleaner visualizations — without becoming any more capable of answering the Rung 2 or Rung 3 question the stakeholder originally asked. The substitution is invisible at every scale.

What the alternative looks like

The alternative is not a better predictive model. It is a different category of artefact: a structural causal model that represents the mechanism explicitly, supports intervention queries by construction, and produces answers that can be defended against challenge.

The architecture has three parts:

Causal memory. The organization’s causal knowledge, made explicit and structured. See the Causal Memory hub for the framing.
The Library. The system that holds the structural models, supports composition, tracks scope, and audits provenance. See the Library hub for the architecture.
The procedure. The discovery process that gets an engagement started — auditing data, categorizing questions by rung, beginning the SCM development cycle. See the Procedure page for the three-phase process.

For the deeper position on why no amount of better predictive modeling closes the gap, and what specifically does, see What Data Cannot Tell You. For the LLM-specific version of the argument, see Why not an LLM? The position is the same in each case: the substitution is structural; the alternative is structural.

Next Step

If your team is using a predictive model, scoring system, or risk matrix to answer questions that turn out to require causal reasoning, the discovery procedure surfaces which of your current outputs are computed and which are substituted. A half-day workshop completes the audit for one decision area.

info@rung3.ai

Why current tools fail.

The substitution problem

Why predictive ML fails harder than statistics

The risk matrix as a worked example

Why scale doesn’t fix it

What the alternative looks like