Three steps. Each corresponds to a rung of Pearl’s Ladder. The standard KPI process stops at step one — and then uses the result to make step two decisions. That is not a limitation. It is a systematic source of incorrect outcomes. Goodhart’s Law names the mechanism precisely: optimizing a Rung 1 proxy is not the same as intervening on the Rung 2 cause.

1

Start at Rung 2, not Rung 1

The standard process selects a KPI by asking: what correlates with the outcome we want? That is a Rung 1 question. The right question is: what can we intervene on that causes the outcome? That requires drawing the causal graph before selecting a metric.

For every candidate KPI, trace the causal path from the metric to the outcome. If no such path exists — if the metric and the outcome share a common cause that the intervention does not reach — the KPI is a Rung 1 proxy dressed as a Rung 2 target. It will break.

2

Test whether the proxy survives intervention

For any candidate metric, ask: can an agent hit this metric without producing the outcome? If there is any mechanism by which the metric moves while the outcome does not, that mechanism will be found under optimization pressure. The back-door criterion formalizes this: the KPI is valid as a target only if there are no unblocked back-door paths between the metric and the outcome that the intervention bypasses.

In practice: ask the people closest to the work to describe how they would hit the target if they had to. If the first answer they give does not produce the outcome, the KPI is fragile.

3

Specify the counterfactual audit condition

A KPI that cannot be audited counterfactually cannot be held accountable. Before locking a KPI, specify the Rung 3 question that will be asked at review: would the outcome have been better if we had intervened differently? That question should be answerable — which means the KPI design must be specific enough about the mechanism that the counterfactual is computable.

If the counterfactual question cannot be stated precisely, the KPI is not specific enough to be useful as an accountability measure.

Every KPI belongs to exactly one rung. Treating a Rung 1 KPI as if it were Rung 2 is the most common failure in performance management.

Rung 1 — Association Rung 2 — Intervention Rung 3 — Counterfactual
Question What is happening? What works? What works for whom?
Examples Conversion rate, churn rate, loss ratio, DAU/MAU Incremental lift from campaign, ATE from A/B test, price elasticity Uplift per customer segment, individual treatment effect, policy value
Valid use Monitoring, anomaly detection, hypothesis generation Program evaluation, resource allocation, budget decisions Targeting, personalisation, individual accountability
What it requires Historical data A/B test, natural experiment, or causal model Structural causal model with individual-level U variables
Breaks when Used as a target Distribution shifts outside test conditions Causal structure changes

The practical implication: Every dashboard KPI is Rung 1. It is appropriate for monitoring. It is not appropriate for evaluation — for answering whether a program, a team, or an intervention is working. Using it for evaluation is the most common structural error in performance management.

A well-designed KPI system is not three separate dashboards. It is a closed-loop decision system in which each rung feeds the next.

Rung 1 — Observe
Detect the opportunity

Churn spikes in segment A. Loss ratio deteriorates in coastal accounts. Conversion drops in a specific funnel step.

Output: a hypothesis worth testing

Rung 2 — Intervene
Validate the intervention

Test the retention offer. Adjust the underwriting guideline. Change the funnel step. Measure incremental effect, not level.

Output: a causal estimate of the effect

Rung 3 — Personalise
Optimize targeting

Only offer the discount to customers with positive expected uplift. Only apply the control to the accounts where it changes the outcome for that specific risk profile.

Output: individual-level decisions

The pipeline is closed-loop: each Rung 3 decision produces new observations that update the Rung 1 baseline, which generates new hypotheses for Rung 2. A KPI system that operates only at Rung 1 produces dashboards. One that reaches Rung 3 produces decisions — and compounds over time as the causal model improves.

A KPI that was correctly designed can still break — not through gaming, but through structural change. Three mechanisms:

Confounder drift

The common cause that held the metric and outcome together stops holding. A market shift, a regulatory change, a competitive move — any of these can sever the relationship between a metric and the outcome it was tracking. The metric continues to move. The outcome no longer follows. The KPI looks healthy. The business isn’t.

Distribution shift

A Rung 2 KPI validated in an A/B test may stop being valid when rolled out at scale — because the test population was not representative, or because the mechanism operates differently at different volumes. The causal effect estimated in the test does not hold in the deployment distribution.

Structural change

The organization changes — a new product, a new channel, an acquisition, a team reorganization — and the causal graph that justified the KPI no longer reflects the system being managed. The KPI was valid for the old structure. It is measuring something different in the new one.

The governance question

KPI maintenance requires someone to own the causal graph — to be accountable for knowing whether the causal structure that justified each metric still holds. In most organizations that person does not exist or is not named. The practical consequence is that KPIs accumulate without review, metrics that no longer track outcomes are optimized by teams who have no way of knowing they’ve drifted, and the gap between performance and reported performance widens invisibly.

The minimum viable governance structure is: for each KPI, name the causal claim it embeds, the conditions under which that claim holds, and the person responsible for reviewing it when those conditions change. That is not a technical exercise. It is a management one.

Four failure modes, each with a recognition criterion:

Anti-pattern What it looks like Recognition criterion
Attribution as causation A campaign’s “attributed revenue” is used to evaluate whether the campaign worked You cannot say what revenue would have occurred without the campaign
Proxy optimization The metric moves but the outcome doesn’t follow — and no one knows why The people hitting the target can describe how they did it without mentioning the underlying outcome
Selection bias in evaluation Program effectiveness is measured on participants — who were selected because they were likely to succeed The comparison group does not exist or was not randomly assigned
Average effect applied to individuals An average treatment effect from an A/B test is used to make decisions about specific customers or accounts The decision does not distinguish between customers who benefit from the intervention and those who don’t
The Engagement

A KPI audit maps each of your current metrics to a rung, identifies which ones embed causal claims they can’t support, and specifies the causal model needed to replace them with measures that hold under optimization. Thirty minutes to identify the first one.

info@rung3.ai