For Executives — When the analysis your decision depends on is answering the wrong question

If you have 90 seconds

Your firm asks three categories of question: what tends to happen, what happens if we act, and what would have happened if we’d acted differently. Almost every analytics tool you have answers only the first. Most of the time it answers it well. The two times it doesn't — when the decision is high-stakes and the data is observational — are exactly the times that matter.

Below: which question yours is actually answering, and three short tracks for the kind of reader you are.

Rung3.ai is causal AI for executive decision support — the kind that computes causal quantities rather than fitting patterns to history. The problem it’s built to address is older than the current AI debate:

“Decision makers in all areas of life (including physicians, generals, scientists, bankers and politicians) must often assess and manage risk when there is little or no direct historical data to draw upon, or where relevant data is difficult to identify. The international credit crisis was not predicted by the world’s leading financial analysts because they relied on models based on historical statistical data that could not adapt to new circumstances — even when those circumstances (in this case the collapse of the mortgage sub-prime market) were foreseeable by experts with more intimate knowledge of the market place. The challenges are similarly acute when the source of the risk is novel: terrorist attacks, ecological disasters, major project failures, and more general failures of novel systems, market-places and business models.”

— Fenton, N. & Neil, M., Risk Assessment and Decision Analysis with Bayesian Networks, 2nd ed., CRC Press, 2018.

The three questions, and which yours answers

Pearl's framework¹ identifies three categories of causal question. They look related; they are not. A tool built to answer one cannot answer the others — not with more data, not with better tuning. The gap is logical, not technical.

Rung 1 — Association. What tends to happen? This is what observational analytics — descriptive statistics, predictive ML, BI dashboards — was built for. Patients with this biomarker profile tend to relapse. Borrowers with this debt-to-income ratio tend to default. Suppliers in this region tend to delay. All true. None of it answers a decision.

Rung 2 — Intervention. What happens if we act? The decision question. If we treat this patient, what will happen? If we extend this borrower, what will happen? If we shift this supplier, what will happen? The Rung 1 answer does not transfer to Rung 2 without an additional assumption — that the conditions under which we observed the pattern are the conditions under which we are intervening. That assumption is frequently false, and the failure mode is not detectable from the Rung 1 data.

Rung 3 — Counterfactual. What would have happened if we’d acted differently, for this specific case? The counterfactual question. Did this drug cause this patient's injury, or did the underlying condition? Would this borrower have defaulted if we had not intervened? Did the resilience investment save us, or did we just not get tested? This is the question regulators ask, the question boards ask, the question every post-mortem asks. It cannot be answered by any tool that stops at Rung 1.

The point here is operational: identify which rung your highest-stakes analyses are operating on, and which rung the question they are being used to answer actually requires. The two often differ. When they do, the answer is structurally wrong, not approximately right.

What the AI initiative you authorize should look like

Most AI initiatives an executive authorises today are organized around a single tool: a language model, a predictive engine, a forecasting platform. The choice of tool answers the wrong question. The decision is not which AI system. The decision is which kind of intelligence does each kind of question require, and how do those parts fit together.

For Rung 1 questions — descriptive, observational, “what tends to happen” — a language model or a predictive engine is the right tool. They are trained on the world’s aggregate experience and they answer accordingly. For the Rung 2 and Rung 3 questions your highest-stakes decisions depend on, no amount of training data substitutes for a structural account of how your system works. That account — the cause-and-effect graph that your subject-matter experts already carry in their heads — is what a structural causal model holds.

The architecture worth authorizing is therefore hybrid: a structural causal model of your domain, built by your experts and owned by your organization, with a language model as the interface through which non-technical decision-makers query it. The model itself is an explicit account of which variables cause which others, under what conditions, with what strength — operational, not statistical. The LLM brings linguistic fluency and general priors. The SCM brings grounded specificity and audit-ready reasoning. The library of models you accumulate over engagements is the artifact that compounds — each new model uses the structure of the last, and your experts continue to refine each one as evidence accumulates and the domain changes. The discipline transfers to your team rather than departing with a vendor. The half-day audit (below) is how that capability begins.

For the technical version of this argument — why the hybrid is measurably more accurate, separate from architectural elegance — see Why SCMs Improve LLM Accuracy. The RAG ceiling, transferable causal topology, sparse-data robustness, and what each component contributes. Useful to send to the technical lead who will vet this.

For technical leadership (CDS, Head of AI/ML, quants, econometricians)

You already know correlation isn't causation. The question for you is more pointed: your team's analytics stack — the predictive models, the DiD studies², the propensity-scored evaluations — what fraction of the decisions it informs at the board level are Rung 1 answers being read as Rung 2 or Rung 3? Three concrete patterns from the case studies on this site:

Selection on the outcome. A bank measured a credit-rehabilitation program by looking at enrolled borrowers. But enrollment was already a selection: the program only took borrowers it thought it could save. The data answers how do enrolled borrowers do — not what is the program's effect.

Credit Risk →

Treatment confounded by indication. Observational ICU data shows that patients who got vasopressors died at higher rates — so the model concludes vasopressors kill. But vasopressors were given to the patients who needed them most. The covariates that selected for treatment also selected for outcome; the standard adjustment cannot reach the counterfactual.

Sepsis Dynamic Treatment Regimes →

Programs that look like they worked. A retention program's churn rate is 11%; the comparable non-program churn rate is 18%. The program "saved" 7 points. But customers chose into the program based on intent to stay. The observed effect is mostly the selection, not the program.

Bank Churn →

The constructive answer is for your team to build and own a structural causal model library — auditable, queryable, composable, and composable with the LLM tooling they’re already deploying. The work transfers the discipline; the artifact stays with you. The technical entry point is a half-day audit on one decision domain — see The smallest commitment.

For deeper technical reading: Why SCMs Improve LLM Accuracy · Why Structured Causal Models? · The SCM Library · Construction walkthroughs: commercial auto reserving, FAIR cybersecurity risk (how an engagement actually proceeds)

For strategy and operations (COO, head of strategy, ops leadership)

Your problem isn't analytics. Your problem is that the decisions you own depend on cause-and-effect claims, and the analyses they rest on were not built to make them. Three patterns:

The score that didn't make a decision. A utility had wildfire as a top risk, with a score and dashboards and a regulatory framework. None of it connected an equipment-deferral decision to a fire-probability change. The risk register described what was happening; it did not describe what to do, and it could not be queried that way.

Utility Wildfire Risk →

The metric that was unmeasured. A training program ran at 94% completion. The performance effect of the training — the actual reason the program existed — was unmeasured. Completion was answerable from existing systems; effect required isolating the training from everything else that affects performance. The cheaper metric won by default.

Training Effectiveness →

The risk that wasn't visible. A supply chain risk model captured single-step disruptions. Two-step interactions — a tier-2 supplier and a tier-3 logistics shift compounding — were invisible to the model. The disruption that hit was the one nobody had modeled, not because they hadn't thought of it but because the modeling shape didn't accommodate it.

Supply Chain Risk →

The remedy is not “more analytics.” It is a model your team can interrogate as a decision tool, not just as a description — built by your domain experts, owned by your organization, extensible without me in the room. The smallest commitment is a half-day audit on one decision your team is currently making — see The smallest commitment. For the work-shape of the full engagement, see the procedure and the engagement page.

For risk-sensitive industries (CRO, CFO, chief actuary; healthcare, insurance, finance, regulated industries)

When a decision you signed off on goes wrong, the question isn't what the model predicted. The question is why. Predictive models don't answer that question — they were not built to. Structural causal models do, and they were. Three patterns:

The pattern that wasn't the cause. Chain-ladder reserving describes the development pattern of historical claims. It cannot tell you why the pattern is changing — whether the shift is social inflation, attorney involvement, severity creep, or a mix — and those answers carry different reserve implications. Regulators are increasingly asking for the why, and the standard tooling cannot supply it.

Insurance Reserving →

The harm that needs causal evidence. A regulator does not ask is exposure to X associated with harm. The regulator asks did X cause this harm. Association is not the answer. A structural causal model is the artifact that supplies the answer in defensible form, with explicit assumptions, identifiable counterfactuals, and an audit trail.

Causal Evidence →

The drug fighting the medication. An iatrogenic-medication audit found that a large share of adverse events attributed to underlying conditions were actually drug-drug interactions in the prescribing cascade. The reserve was overstated; the clinical liability was understated; both came from attributing the event to the wrong cause.

Iatrogenic Medications →

A model that goes to a board, a regulator, or a court has to pass one test: can a third party reconstruct exactly how you arrived at the answer? A structural causal model passes it — the graph, the assumptions, and the math are all explicit and inspectable. A predictive model does not.

What this looks like in practice is documented in the case studies. The pre-decision conversation is the half-day audit — see The smallest commitment.

For the audit-defensibility apparatus in detail: Provenance & Audit · About Risk (case studies by risk type)

The smallest commitment

The first step is a half-day audit on one decision domain. The audit names which of your decision-relevant analyses are answering the question they are being used to answer, and which aren't. Nothing more is committed at that stage.

If the audit finds something, what follows is capability-building, not consulting. Your analysts and domain experts build the model with me — the work is theirs, the artifact is yours, the craft transfers. At the end, your team owns the capability and the ongoing maintenance. There is no version of this engagement that leaves you dependent on an outside firm. The procedure page describes the full engagement shape; the half-day audit is the entry point. Typical engagement: eight to sixteen weeks across one decision domain — long enough for your team to build, query, and pressure-test their models end-to-end; short enough that you commit to one domain at a time.

The half-day audit takes thirty minutes to set up. Contact: info@rung3.ai — bring one domain your team is currently making a decision worth getting right.

¹ Pearl, J. & Mackenzie, D., 2018, The Book of Why: The New Science of Cause and Effect, Basic Books. The three-rung framework is the book's central organizing argument; the formal treatment is in Pearl, J., 2009, Causality: Models, Reasoning, and Inference (2nd ed.), Cambridge University Press. ↩

² DiD: Difference-in-Differences, a quasi-experimental method economists and policy analysts use when they can't run a randomized trial but want a causal estimate of a treatment or policy effect. ↩

Which conversation is yours.