The Setup
The case for elicitation is straightforward. The questions a board cares about — why is this happening, what would happen if we changed it, what would have happened if we’d acted differently — require knowledge of mechanism. Mechanism knowledge is not in the data. It lives in the heads of people who have spent twenty years watching how the system actually behaves: the senior underwriter, the chief actuary, the operating-line CISO, the maintenance engineer who has seen the failure mode three times before. Those people are the only available source for the structure and the conditional probabilities a causal model needs.
The case is honest. The procedure is hard.
This page names the difficulties. The literature is clear about most of them — Spetzler & Staël von Holstein in 1975, Tversky & Kahneman through the 1970s and 80s, Cooke’s performance-weighted aggregation in 1991, Gigerenzer on frequency formats in the mid-1990s, O’Hagan and colleagues collecting it all in Uncertain Judgments in 2006. The findings are not encouraging. They are, however, manageable, and managing them is most of what a competent elicitation engagement actually does.
The Cognitive Biases
Human probability judgment is systematically miscalibrated. The biases below are not character flaws — they are how the brain makes inferences fast under uncertainty. They survive into the elicitation room intact.
| Bias | What it does to elicited probabilities |
|---|---|
| Overconfidence | 90% credible intervals contain the true value about 50–60% of the time, not 90%. Experts compress uncertainty. |
| Anchoring | The first number mentioned in the room becomes the gravitational center of every subsequent estimate, even when irrelevant. |
| Availability | Recent or vivid events get higher probability than their base rate. The expert who handled the 2023 incident weights 2023‑like risks. |
| Hindsight | After an outcome is known, the expert reports the prior probability as if it had been higher. Useless for forward-looking models. |
| Conjunction | P(A and B) is reported higher than P(A). Tversky’s Linda problem; reproduces in risk experts. |
| Reference class | Asked “what’s the probability of failure,” the expert imagines this failure rather than the class of all failures of this type. |
| Motivated reasoning | The expert whose program will be cut if the model says it’s working will not estimate that it’s working badly. |
Each bias is well-documented. None of them is fixable by simply asking the expert to be more careful. They require process.
The Process Pathologies
Even with cognitively-honest experts, the elicitation process introduces its own failure modes.
The senior person speaks. In a group session, the most senior expert anchors the room. The junior expert who works the data daily and may have a sharper view defers. Group elicitation produces consensus more reliable-looking than the underlying knowledge actually is.
Discretisation in the wrong place. A continuous variable becomes three discrete states for tractability — Low, Medium, High. The expert assigns probabilities to those bins. But the expert was thinking in dollars, days, or basis points; forcing the answer into Low/Medium/High loses the structure they had. Worse: the bin boundaries are themselves an elicitation problem, usually solved by guessing.
Elicitation fatigue. A model with thirty nodes and a handful of parents per node has hundreds of conditional probabilities to specify. By probability number one hundred and twenty, the expert is making numbers up. The model fits perfectly to numbers no one believes.
The wrong question gets answered. The expert is asked to estimate P(default | high leverage) and answers something closer to P(high leverage | default) — the inverse. Inverse probability error is rampant in elicitation; experts are not trained probabilists.
Probabilities of zero and one. Experts assign 0% and 100% routinely. Both are catastrophic for a Bayesian network: 0% means no amount of contradictory evidence can ever shift the belief, and 100% does the same in reverse. A causal model with a single 0% in the wrong CPT can produce nonsense and refuse to recover.
Definition drift. “What’s the probability of a major incident?” means three different things to three experts in the same room. Without a shared, written, operational definition of the variable, the elicited probabilities are not comparable.
What Actually Works
None of these difficulties is novel. The probability-elicitation literature has fifty years of work on how to mitigate each one, and the practitioner techniques are well-understood. They are skipped routinely because they take time.
Operational definition first, probability second. Before any number is requested, the variable is defined in writing — what counts as a major incident, in this engagement, by this organization’s ledger. Five minutes of definition saves an hour of inconsistent answers.
Decomposition before aggregation. Rather than ask “what’s the probability of breach?”, decompose: probability of attempted intrusion, conditional probability of perimeter failure, conditional probability of detection failure, conditional probability of impact realisation. Experts are markedly better at the parts than the whole. The model multiplies them out; the expert never has to.
Calibration training and feedback. Before the substantive elicitation, the expert answers ten general-knowledge questions with 90% credible intervals. Their interval contains the truth about 50% of the time on the first round. By the third round, they widen their intervals appropriately. The two hours spent on this are not wasted — they recover the calibration the rest of the elicitation depends on. Hubbard’s How to Measure Anything made this technique an operational discipline; it is the single highest-leverage hour of any elicitation engagement.
Individual elicitation, then group reconciliation. Each expert assigns probabilities alone; only afterwards is disagreement surfaced and discussed. This catches the senior-anchoring problem and surfaces dissent that group sessions suppress.
Lottery comparisons rather than direct estimation. Asking “is the probability higher than 30%, or would you rather take the lottery that pays off if a fair 30% gamble wins?” reaches calibrated probabilities even from experts who cannot articulate them as numbers. Slow, but accurate.
Frequency formats rather than probability formats. Asking “out of 1,000 firms in this situation, how many would experience the failure?” produces more accurate elicitations than asking “what is the probability of failure?” The same expert who says 5% to the second question may say “maybe 20 out of 1,000” to the first — and the second answer is closer to the calibrated truth. Gigerenzer’s 1991 work demonstrating this finding is one of the most reliable interventions in the elicitation literature; the format change does most of the work that calibration training does, and it requires no training. The reason is structural: probabilities ask the expert to map a situation to an abstract scale, while frequencies ask them to imagine a population of cases — a cognitive task humans evolved to handle. Frequency formats also surface the conjunction-error problem (asking “how many bank tellers? how many bank tellers who are feminists?” rather than the abstract probability versions) and force the expert into the right reference class.
Bound the zeros and ones. Convert every elicited 0% to a small ε (say, 0.001) and every 100% to 1−ε. The substantive belief is preserved; the inference engine remains capable of updating on contradictory evidence.
Sensitivity analysis as feedback. Once the first model runs, the elicitation isn’t finished — it’s tested. Vary each elicited probability across its plausible range; identify the variables the model is most sensitive to; go back to the experts and re-elicit those carefully. The variables that matter get the deepest attention; the ones that don’t move the answer are accepted as rough.
Performance-weighted aggregation when multiple experts must be combined. When the engagement requires a single number from a panel of disagreeing experts — a regulator, a court, an actuarial filing — equal-weight averaging is the wrong default. Cooke’s classical model assigns each expert a weight derived from how well they performed on a set of seed questions with known answers (questions whose truth is unknown to the expert at the time but knowable to the elicitor). Experts who are well-calibrated and informative on the seeds get higher weights on the substantive questions; experts who are confidently wrong get less. The classical model has been the standard technique in serious nuclear-safety, environmental-risk, and aerospace elicitation studies since the 1990s — the kind of high-stakes settings where “we asked five experts and averaged” would not survive review. Cooke’s 1991 monograph and the subsequent literature with Goossens document its operational discipline; it requires more setup than equal weighting but produces aggregations that are themselves calibrated rather than just democratic.
Document disagreement; do not average. When two experts disagree on a probability, the model encodes the disagreement as a wide prior or as two competing scenarios. When Experts Disagree covers this in detail. The instinct to average is wrong; the instinct to defer to seniority is wrong. The right move is to preserve both views in the formal structure — or, when a single number is required, to weight by demonstrated calibration rather than by reputation.
What Honest Elicitation Looks Like
An elicitation engagement done well is mostly not about asking for numbers. It is about defining variables operationally, decomposing complex probabilities into estimable parts, calibrating the experts before they answer the substantive questions, protecting against process pathologies, and iterating with sensitivity analysis. The actual probability elicitation is a smaller fraction of the total time than most clients expect, and a larger fraction than most consultants budget.
The deliverable that comes out the other side is a model with three properties most consulting outputs lack: every elicited probability has a written definition of what it means, a record of who was asked and how, and a sensitivity analysis showing how much the answer depends on each input. That documentation is what makes the model defensible to a regulator, durable across staff turnover, and revisable as new evidence arrives.
The site argues elsewhere that expert knowledge is the only available source for the structural-causal moves data alone cannot make. This page is the necessary other half of that argument: the source is real, but it is not free. Eliciting from it well is harder than the rest of the modeling work combined. The engagement budget reflects that.
If the elicitation procedure your team uses does not include calibration training, decomposition, individual-then-group, sensitivity feedback, and operational definitions for every variable — the resulting model is not capturing what your experts know. It is capturing what they said when asked.
info@rung3.ai
Spetzler, C.S. & Staël von Holstein, C.A.S., 1975, “Probability Encoding in Decision Analysis,” Management Science 22(3) · Tversky, A. & Kahneman, D., 1974, “Judgment under Uncertainty: Heuristics and Biases,” Science 185 · Cooke, R.M., 1991, Experts in Uncertainty: Opinion and Subjective Probability in Science, Oxford University Press — the foundational treatment of performance-weighted aggregation; Cooke & Goossens 2008 (Reliability Engineering and System Safety 93) operationalises the technique across decades of high-stakes elicitation studies · Gigerenzer, G. & Hoffrage, U., 1995, “How to Improve Bayesian Reasoning Without Instruction: Frequency Formats,” Psychological Review 102(4) — the empirical case for frequency over probability framing · O’Hagan, A. et al., 2006, Uncertain Judgments: Eliciting Experts’ Probabilities, Wiley · Renooij, S., 2001, “Probability elicitation for belief networks: issues to consider,” The Knowledge Engineering Review 16(3) · Morgan, M.G. & Henrion, M., 1990, Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis, Cambridge University Press · Hubbard, D.W., 2014, How to Measure Anything: Finding the Value of Intangibles in Business (3rd ed.), Wiley — the practitioner-facing companion to the academic literature, with calibration training as its operational core.