The Procedure
Three steps. Each corresponds to a rung of Pearl’s Ladder. The standard KPI process stops at step one — and then uses the result to make step two decisions. That is not a limitation. It is a systematic source of incorrect outcomes. Goodhart’s Law names the mechanism precisely: optimizing a Rung 1 proxy is not the same as intervening on the Rung 2 cause.
Start at Rung 2, not Rung 1
The standard process selects a KPI by asking: what correlates with the outcome we want? That is a Rung 1 question. The right question is: what can we intervene on that causes the outcome? That requires drawing the causal graph before selecting a metric.
For every candidate KPI, trace the causal path from the metric to the outcome. If no such path exists — if the metric and the outcome share a common cause that the intervention does not reach — the KPI is a Rung 1 proxy dressed as a Rung 2 target. It will break.
Test whether the proxy survives intervention
For any candidate metric, ask: can an agent hit this metric without producing the outcome? If there is any mechanism by which the metric moves while the outcome does not, that mechanism will be found under optimization pressure. The back-door criterion formalizes this: the KPI is valid as a target only if there are no unblocked back-door paths between the metric and the outcome that the intervention bypasses.
In practice: ask the people closest to the work to describe how they would hit the target if they had to. If the first answer they give does not produce the outcome, the KPI is fragile.
Specify the counterfactual audit condition
A KPI that cannot be audited counterfactually cannot be held accountable. Before locking a KPI, specify the Rung 3 question that will be asked at review: would the outcome have been better if we had intervened differently? That question should be answerable — which means the KPI design must be specific enough about the mechanism that the counterfactual is computable.
If the counterfactual question cannot be stated precisely, the KPI is not specific enough to be useful as an accountability measure.
KPIs by Rung
Every KPI belongs to exactly one rung. Treating a Rung 1 KPI as if it were Rung 2 is the most common failure in performance management.
| Rung 1 — Association | Rung 2 — Intervention | Rung 3 — Counterfactual | |
|---|---|---|---|
| Question | What is happening? | What works? | What works for whom? |
| Examples | Conversion rate, churn rate, loss ratio, DAU/MAU | Incremental lift from campaign, ATE from A/B test, price elasticity | Uplift per customer segment, individual treatment effect, policy value |
| Valid use | Monitoring, anomaly detection, hypothesis generation | Program evaluation, resource allocation, budget decisions | Targeting, personalisation, individual accountability |
| What it requires | Historical data | A/B test, natural experiment, or causal model | Structural causal model with individual-level U variables |
| Breaks when | Used as a target | Distribution shifts outside test conditions | Causal structure changes |
The practical implication: Every dashboard KPI is Rung 1. It is appropriate for monitoring. It is not appropriate for evaluation — for answering whether a program, a team, or an intervention is working. Using it for evaluation is the most common structural error in performance management.
The Pipeline
A well-designed KPI system is not three separate dashboards. It is a closed-loop decision system in which each rung feeds the next.
The pipeline is closed-loop: each Rung 3 decision produces new observations that update the Rung 1 baseline, which generates new hypotheses for Rung 2. A KPI system that operates only at Rung 1 produces dashboards. One that reaches Rung 3 produces decisions — and compounds over time as the causal model improves.
When KPIs Break
A KPI that was correctly designed can still break — not through gaming, but through structural change. Three mechanisms:
The common cause that held the metric and outcome together stops holding. A market shift, a regulatory change, a competitive move — any of these can sever the relationship between a metric and the outcome it was tracking. The metric continues to move. The outcome no longer follows. The KPI looks healthy. The business isn’t.
A Rung 2 KPI validated in an A/B test may stop being valid when rolled out at scale — because the test population was not representative, or because the mechanism operates differently at different volumes. The causal effect estimated in the test does not hold in the deployment distribution.
The organization changes — a new product, a new channel, an acquisition, a team reorganization — and the causal graph that justified the KPI no longer reflects the system being managed. The KPI was valid for the old structure. It is measuring something different in the new one.
The governance question
KPI maintenance requires someone to own the causal graph — to be accountable for knowing whether the causal structure that justified each metric still holds. In most organizations that person does not exist or is not named. The practical consequence is that KPIs accumulate without review, metrics that no longer track outcomes are optimized by teams who have no way of knowing they’ve drifted, and the gap between performance and reported performance widens invisibly.
The minimum viable governance structure is: for each KPI, name the causal claim it embeds, the conditions under which that claim holds, and the person responsible for reviewing it when those conditions change. That is not a technical exercise. It is a management one.
Anti-Patterns
Four failure modes, each with a recognition criterion:
| Anti-pattern | What it looks like | Recognition criterion |
|---|---|---|
| Attribution as causation | A campaign’s “attributed revenue” is used to evaluate whether the campaign worked | You cannot say what revenue would have occurred without the campaign |
| Proxy optimization | The metric moves but the outcome doesn’t follow — and no one knows why | The people hitting the target can describe how they did it without mentioning the underlying outcome |
| Selection bias in evaluation | Program effectiveness is measured on participants — who were selected because they were likely to succeed | The comparison group does not exist or was not randomly assigned |
| Average effect applied to individuals | An average treatment effect from an A/B test is used to make decisions about specific customers or accounts | The decision does not distinguish between customers who benefit from the intervention and those who don’t |
A KPI audit maps each of your current metrics to a rung, identifies which ones embed causal claims they can’t support, and specifies the causal model needed to replace them with measures that hold under optimization. Thirty minutes to identify the first one.
info@rung3.ai