Asset Reliability

Why a Causal Model

Asset reliability management has a fundamental measurement problem: the successes are invisible. A predictive maintenance program that detects a developing fault and triggers early intervention prevents a failure that never appears in the data. The board sees the cost of the program and the count of failures that occurred. It cannot see the failures that did not occur. Standard practice — comparing failure counts before and after, or against a benchmark fleet — contains a structural error: the fleets that adopt predictive maintenance tend to be higher-risk fleets, so any comparison conflates the effect of the program with the background risk that motivated the investment in the first place.

The same problem appears in interval decisions. When a board observes that fleets running extended overhaul intervals have higher failure rates, it is seeing two effects: the interval extension itself, and the elevated wear rates that tend to accompany the cost pressures that produce interval extensions. Separating those two effects requires a causal model that holds the confounder — fleet utilization pressure — at its natural level while forcing the interval to change.

Analysis Component	Standard Approach	Causal Approach
Program evaluation	Count failures that occurred	Estimate failures that would have occurred without the program
Interval extension effect	Correlated with failure rate (confounded by utilization pressure)	Isolated via do() — utilization pressure held at prior
Root cause of rising outage rate	Each team cites their metric	Posterior distribution over causes given the full evidence pattern
PdM selection bias	Not in model — high-risk fleets compared to low-risk benchmarks	Asset Risk Profile as explicit confounder
Counterfactual failures prevented	Not computable	Estimated from U-node-anchored individual-level model

The board approved the predictive maintenance program based on projected failure reductions. The only way to know whether it delivered is to estimate the failures that did not happen — and that requires a causal model, not a dashboard.

The Questions

3 Questions, 3 Rungs

We spent $8M on a predictive maintenance program and still had three unplanned outages — would those outages have occurred without the program? — Rung 3 (Counterfactual). Answering it requires abduction to anchor this specific fleet’s background failure risk via U nodes before removing the PdM program; without U node anchoring the counterfactual reflects a population average fleet, not this one.
If we extend gas turbine overhaul intervals from 24 to 36 months to reduce capex, what does that actually cause to the probability of failure before the next planned intervention? — Rung 2 (Intervention). A do() query severs the confounding path from Fleet Utilization Pressure through the interval decision, isolating the causal effect of the extension from the operational environment that tends to drive both utilization and maintenance deferrals simultaneously.
Our forced outage rate has risen 18% over three years — engineering says fleet aging, operations says increased load, finance says deferred maintenance. What does the asset data actually support? — Rung 1 (Association). The graph encodes which dependencies exist between Asset Age, Load Factor, Maintenance Spend, and outage outcomes; entering the observed trend updates every genuinely connected upstream node, separating what the data supports from what each team is asserting.

Reading the screenshots: a black check mark on a node means it has been set as observed evidence — a fact entered into the model, acting as a filter. A red check mark means it has been set as a do intervention — a decision applied to the model, severing the influence of its parents.

Reading the spec tables: each Run the Analysis block lists the exact steps to reproduce each screenshot in Bayes Server. The Obs / Do column uses three italic control tokens: clear — reset the model to a blank no-evidence state; abduction step — enter the factual observations that anchor the U nodes to this specific case; use abduction result — apply a do() intervention with the U nodes held from the abduction step.

Rung 3 — Counterfactual

Did the program prevent failures we cannot see?

“We spent $8M on a predictive maintenance program last year. We still had three unplanned outages. Would those outages have occurred without the program — or did it actually prevent worse failures that never appear in the data?”

This question conditions on what actually happened — three outages with PdM running — and asks what would have occurred in a world where the program had not been deployed. It cannot be answered by comparing this fleet to a non-PdM benchmark, because fleets that invest in predictive maintenance tend to be higher-risk fleets. The Asset Risk Profile confounder in the model captures this: high-risk fleets adopt PdM precisely because they fail more, so any naive comparison overstates the program's benefit. The model separates those two effects.

The U nodes (shown in orange) represent the unobserved background of this specific fleet — sensor placement quality, analyst expertise, the particular fault modes present this year, crew positioning when failures occurred. By entering the three observed outages as evidence, the model infers what those background factors must have been for this fleet in this year. It then holds those background factors fixed and asks: with everything else about this fleet unchanged, what would the failure count have been without the program?

Answer

The counterfactual comparison shows that removing PdM from this fleet's actual background has negligible effect on Outage Consequence — but that is because the U nodes are also anchored to the observed failure state. Step 2 (obs PdM=Not Implemented + Failure=Three to four) anchors the model: EFDR drops to 63.9% Low, Asset Risk shifts only modestly (28.8/55.4/15.8%), and Outage Consequence shifts to 38.5% Severe / 17.0% Contained. Step 3 shows the fleet without PdM evidence but holding the same failure event: EFDR rises to 47.7% Low vs 63.9% — confirming that the program's direct causal contribution to detection is real. Step 4, do(PdM=Implemented), severs the Asset Risk back-door: Asset Risk jumps to 47.4% High-risk (the model now correctly infers this is a high-risk fleet — only such fleets adopt PdM), EFDR improves to 46.5% High, and the failure distribution returns close to prior (24.3/34.8/30.2/10.7%). The core counterfactual insight: without PdM, this fleet's EFDR falls to 63.9% Low, which is the pathway through which unplanned failures compound.

AssetReliabilityCounterfactual.bayes

Image	Obs / Do	Node	Set	Result
ar-c-1-prior	—	PdM Program		36.4% Implemented / 63.6% Not Implemented
	—	Early Fault Detection Rate		22.6% High / 31.4% Moderate / 45.9% Low
	—	Unplanned Failure Events		24.7% Five+ / 35.1% Three-four / 29.8% One-two / 10.4% None
	—	Outage Consequence		35.3% Severe / 37.0% Moderate / 27.7% Contained
ar-c-2	obs	PdM Program	Not Implemented
	obs	Unplanned Failure Events	Three to four	Abduction — fleet background anchored
	—	Asset Risk Profile		28.8% High / 55.4% Moderate / 15.8% Low
	—	Early Fault Detection Rate		7.5% High / 28.7% Moderate / 63.9% Low
	—	Outage Consequence		38.5% Severe / 44.5% Moderate / 17.0% Contained
ar-c-3	obs	Unplanned Failure Events	Three to four	Failure only — PdM not set (compare to step 2)
	—	Asset Risk Profile		30.5% High / 54.0% Moderate / 15.5% Low
	—	Early Fault Detection Rate		20.2% High / 32.1% Moderate / 47.7% Low
	—	Outage Consequence		38.5% Severe / 44.5% Moderate / 17.0% Contained
ar-c-4	do	PdM Program	Implemented	Severs Asset Risk back-door
	—	Asset Risk Profile		47.4% High — correct inference: only high-risk fleets adopt PdM
	—	Early Fault Detection Rate		46.5% High / 36.5% Moderate / 17.0% Low
	—	Unplanned Failure Events		24.3% Five+ / 34.8% Three-four / 30.2% One-two / 10.7% None
	—	Outage Consequence		35.0% Severe / 37.0% Moderate / 28.0% Contained

Prior — no evidence set

Fleet-level baseline. PdM 36.4% Implemented / 63.6% Not Implemented. EFDR 45.9% Low / 22.6% High. Failure Events: 24.7% Five+ / 35.1% Three-four. Outage Consequence 35.3% Severe.

Rung 2 — Intervention

What does extending the overhaul interval actually cause?

“If we extend the overhaul interval on our gas turbines from 24 to 36 months to reduce capex, what does that actually do to the probability of failure before the next planned intervention?”

When a board observes that fleets running extended intervals have higher failure rates, it is seeing two effects at once. Fleet Utilization Pressure independently causes both the decision to extend the interval and the wear rate on the assets — fleets under cost pressure run harder and maintain less. Observing an extended interval tells the model that utilization pressure is probably high, which raises wear rate, which worsens component condition. Forcing the interval to extended via intervention holds utilization pressure at its natural level and asks only what the longer gap between overhauls causes, separate from the operating environment that tends to accompany that decision. The gap between those two numbers is what the board needs before it approves the deferral.

Answer

do(Overhaul = Extended) raises High failure probability from 34.6% to 49.1% — a 14.5-point increase from the interval decision alone, with Fleet Utilization Pressure held at its prior. obs(Overhaul = Extended) produces 51.8% High failure — the 2.7-point gap above do() is the confounding bias: observing an extended interval updates FUP toward High (58.6% vs 30.0% prior), which further elevates wear rate and component condition. Add obs(Operating Environment = Severe) to the do() and failure probability rises to 53.4% High — the environment interaction adds another 4.3 points on top of the interval causal effect. Unplanned Outage Major rises from 33.8% at prior to 43.0% under do(Extended) alone and 45.5% under do(Extended) + Severe. The board is not deciding whether to extend the interval in the abstract — it is deciding whether to extend it on assets already exposed to severe conditions, where the effects compound.

AssetReliabilityIntervention.bayes

Image	Obs / Do	Node	Set	Result
ar-i-1-prior	—	Fleet Utilization Pressure		30.0% High / 45.0% Moderate / 25.0% Low
	—	Overhaul Interval		28.1% Extended / 50.4% Standard / 21.5% Accelerated
	—	Component Condition		31.6% Degraded / 36.8% Fair / 31.6% Good
	—	Failure Probability		34.6% High / 33.3% Moderate / 32.1% Low
	—	Unplanned Outage		33.8% Major / 31.0% Partial / 35.2% Avoided
	—	SAIDI Contribution		32.2% Above / 28.2% Near / 39.5% Within
ar-i-3	do	Overhaul Interval	Extended	FUP stays at prior — back-door severed
	—	Fleet Utilization Pressure		30.0% High / 45.0% Moderate / 25.0% Low — unchanged
	—	Component Condition		54.8% Degraded / 31.9% Fair / 13.2% Good
	—	Failure Probability		49.1% High / 31.3% Moderate / 19.6% Low
	—	Unplanned Outage		43.0% Major / 30.6% Partial / 26.4% Avoided
	—	SAIDI Contribution		38.5% Above / 28.4% Near / 33.2% Within
ar-i-4	obs	Overhaul Interval	Extended	Compare to do() — back-door open
	—	Fleet Utilization Pressure		58.6% High — up from 30.0%; confound flows back
	—	Component Condition		59.9% Degraded — higher than do()
	—	Failure Probability		51.8% High — 2.7pp above do(); confounding bias
	—	Unplanned Outage		44.5% Major / 25.1% Avoided
ar-i-2	do	Overhaul Interval	Extended	+ environment stress
	obs	Operating Environment	Severe
	—	Wear Rate		49.1% Accelerated — environment drives wear
	—	Failure Probability		53.4% High — 4.3pp above do() alone
	—	Unplanned Outage		45.5% Major / 24.3% Avoided
	—	SAIDI Contribution		40.1% Above Threshold

Prior — no evidence set

Fleet baseline. FUP 30% High / 45% Moderate. Overhaul 28.1% Extended / 50.4% Standard. Failure Probability 34.6% High / 32.1% Low. Outage 33.8% Major / 35.2% Avoided. SAIDI 32.2% Above Threshold.

Rung 1 — Association with Causal Structure

Which team is right about the rising outage rate?

“Our forced outage rate has risen 18% over three years. Engineering says it’s fleet aging. Operations says it’s increased load. Finance says it’s deferred maintenance. What does the asset data actually support?”

At Rung 1 the model runs as a filter: enter what the data shows, read which root causes become more probable. Two confounders in the graph make the diagnostic discriminate between the three teams rather than updating each cause proportionally. Asset Program Maturity independently drives both Fleet Age Profile and Maintenance Deferrals — confirming Significant deferrals updates program maturity, which also shifts fleet age beliefs. Grid Demand Growth independently drives both Load Stress and Fleet Age Profile — confirming Above Rating load updates demand growth, which shifts fleet age beliefs through a different mechanism. Each team's evidence produces a different footprint across the graph, and those footprints tell the board whose hypothesis most fits the full pattern of evidence when all three data points are available.

Answer

The elevated outage rate alone cannot name a single cause — but the three teams' evidence produces meaningfully different diagnostic signatures, and the combination points most strongly to fleet aging amplified by demand growth. With only the elevated outage rate entered, all three root causes update modestly upward and no team's hypothesis dominates. Confirming Above Rating load pulls Grid Demand Growth sharply upward, which also shifts Fleet Age Profile — the operations team's evidence implicates engineering's cause through the shared confounder. Confirming Significant maintenance deferrals pulls Asset Program Maturity toward Mature, which shifts both Fleet Age and Deferrals upward — finance's evidence implicates engineering through a different route. The model does not resolve the debate; it shows that all three teams are describing the same underlying pressure through different observational windows, and that the right intervention is at the confounder level — program renewal and load growth management — rather than addressing any single symptom in isolation.

AssetReliabilityDiagnostic.bayes

Image	Obs / Do	Node	Set	Result
ar-d-1-prior	—	Forced Outage Rate		—
	—	Financial Impact		—
ar-d-2-elevated	obs	Forced Outage Rate	Elevated
	—	Fleet Age Profile		—
	—	Load Stress		—
	—	Maintenance Deferrals		—
ar-d-3-load	obs	Forced Outage Rate	Elevated
	obs	Load Stress	Above Rating
	—	Grid Demand Growth		—
	—	Fleet Age Profile		—
ar-d-4-deferrals	obs	Forced Outage Rate	Elevated
	obs	Maintenance Deferrals	Significant
	—	Asset Program Maturity		—
	—	Fleet Age Profile		—

Prior — no evidence set

Root causes at fleet prior. Forced Outage Rate, Fleet Age, Load Stress, and Maintenance Deferrals at their baseline marginals.

Download the Models

↓

Asset Reliability Counterfactual — PdM Program (Rung 3)

Full U-node complement. Set obs(PdM = Implemented) + obs(Failure Events = Three to four) to anchor to your fleet's actual year, then do(PdM = Not Implemented) for the counterfactual.

↓

Asset Reliability Intervention — Overhaul Interval (Rung 2)

Compare obs(Extended) vs do(Extended) to see the Fleet Utilization Pressure confound. Add obs(Operating Environment = Severe) for the compounded worst case.

↓

Asset Reliability Diagnostic — Forced Outage Rate (Rung 1)

Set obs(Forced Outage Rate = Elevated) then add load, maintenance, or age evidence one at a time to read which root cause the data most supports.

All models require Bayes Server (free edition available). See Download Models for the full library across all case studies.

Next Step

Your reliability engineers know which assets are symptomatic and which programs are under pressure. That knowledge needs to be in a causal model before the next capital allocation cycle — not discovered after the overhaul deferral produces a forced outage at peak demand.

The models are free. What I provide is the judgment to build the right structure for your specific fleet, encode your engineers’ knowledge into it, and turn the output into decisions your board can act on. The discipline stays with your team.

info@rung3.ai

This case study is a composite drawn from published asset reliability literature, utility and industrial maintenance program reviews, and infrastructure capital planning practice. Specific figures are representative. No individual organization or engagement is described. The Bayes Server models are working files: download, set evidence, and run inference.

The control that prevented the problem you can’t see

On this page

Why a Causal Model

The Questions

Did the program prevent failures we cannot see?

What does extending the overhaul interval actually cause?

Which team is right about the rising outage rate?

Download the Models