Why Structured Causal Models?

In one paragraph

Bayes Server is the modeling editor of choice during the engagement. The license is $742 per modeller-seat, perpetual, and it is optional after the engagement — your team can author in pgmpy (Python), pyAgrum, R (bnlearn / gRain), or GeNIe instead. The model itself exports to an open interchange format any compliant tool can read. Inference can run on Bayes Server’s runtime or any of the open-source alternatives. The LLM is whichever vendor you choose — Anthropic, OpenAI, or self-hosted. Ongoing license to me: none.

The engagement produces three things: a structural causal model of one decision domain, the capability in your team to refine and extend it, and a working integration with a language model interface so non-technical decision-makers can query it. Everything below is the technology layer that supports those three things.

The architecture is deliberately portable. The modeling environment, the inference engine, and the language-model layer are each interchangeable. The constant is the model itself — a structural causal graph and its parameters, serialised in an open interchange format — which can be loaded into any compatible tool.

The modeling environment

Bayes Server is the editor of choice during the construction phase. It is the cleanest UI for the work of building a causal graph with subject-matter experts in the room: drawing nodes, defining states, parameterising conditional probability tables, running sensitivity analyses, and inspecting belief propagation interactively. Most of the case-page screenshots on this site were made with it.

Bayes Server is not required after the engagement. The model exports to an open interchange format that any compliant Bayesian-network tool can read. If your team wants to continue authoring in Bayes Server, the license is $742 per modeller-seat, perpetual. That is a per-modeller cost, not a per-user-of-the-model cost. A free evaluation copy is available from bayesserver.com — your team can install it and inspect the engagement’s model directly before deciding whether to purchase any seats.

Open-source alternatives that read the same exported model:

pgmpy (Python, MIT) — full inference engine, query support, parameter learning. The strongest open-source choice for teams already on Python.
pyAgrum (Python/C++, LGPL) — broader graphical-model support, including influence diagrams and dynamic Bayesian networks. Faster for large models.
bnlearn and gRain (R) — mature R packages with structure-learning and inference. The right choice if your analytics team is R-native.
GeNIe / SMILE (commercial, free for academic; commercial licences for production use) — closest match in editor experience to Bayes Server.

The choice is yours. The engagement’s output is portable; the runtime stack is your call.

The model artifact

What your team actually receives, and owns:

The model file in an open interchange format — the structural graph, the variable specifications, and the parameters (conditional probability tables, or structural equations for continuous variables).
The parameter-elicitation record — who provided which estimates, with what stated uncertainty, in which session. This is the audit trail behind the model’s numbers; it is the difference between “the model says X” and “here is why the model says X.”
The sensitivity analyses — which parameters drive which outputs, where the model is robust, where it is fragile. Useful both for prioritizing future refinement and for explaining the model’s outputs to regulators.
Validation cases — the test set used to verify the model against historical decisions or against reasoned expert judgment. These become regression tests as your team refines the model.
Construction notes — the modeling rationale, the structural choices made and why, the alternatives considered. This is the document your team turns to two years from now when they need to extend the model and want to know what was deliberate.

All of this is text-based and lives in version control alongside any other code asset. The model is a thing you diff, review, and ship like software.

Inference engine for production queries

Production queries against the model fall into two patterns:

Interactive. A user asks a counterfactual question; the model returns a posterior. Latency budget: hundreds of milliseconds. Computation: one inference pass plus any sensitivity analysis you want to surface alongside the answer.
Batch. A pipeline scores or evaluates many cases against the model. Latency budget: minutes to hours, depending on volume. Useful for re-scoring a portfolio when the model is updated, or for running counterfactuals across an entire historical decision log.

Both patterns run on the same engine. Bayes Server has a runtime library that supports them; pgmpy handles them in pure Python; pyAgrum has the fastest large-model performance. The choice depends on your stack preferences and your scale.

The inference engine is the load-bearing component of the production system. It is also the smallest piece of code: a few thousand lines for the open-source options, well-tested and stable. The complexity of the system lives in the model, not the engine.

The LLM interface

The language model is the natural-language front door to the causal model. It does four jobs:

Parse the question. Recognize whether the user is asking a Rung 1 question (description), a Rung 2 question (intervention), or a Rung 3 question (counterfactual). Reject or clarify if ambiguous.
Translate to a formal causal query. Identify the variables and conditions in the question, map them to nodes in the causal graph, and produce a structured query the inference engine can execute — a do-operator expression for interventions, a counterfactual expression for Rung 3.
Invoke the inference engine with the formal query.
Translate the structured result back to prose, including the relevant uncertainty bounds and any caveats the sensitivity analysis surfaces.

Any language model that supports function-calling or tool use can drive this interface. Anthropic Claude, OpenAI GPT-4, locally-hosted Llama or Mistral with appropriate tool-use frameworks — all work. The interface contract between the LLM and the inference engine is stable; the LLM behind it is interchangeable. Your choice will turn on data-residency requirements, cost-per-query targets, and whichever vendor your platform team is already standardized on.

The LLM is the only component of the architecture that improves automatically over time as the underlying model is upgraded. The causal model itself improves because your experts continue to refine it, not because any vendor releases an update.

Deployment patterns

Three patterns cover the engagements the practice has supported:

Modeller-driven, on demand. Your analysts and domain experts run the model themselves during decision support cycles — reserve committee meetings, portfolio reviews, program-evaluation discussions. The model lives on a workstation or shared analytics environment. Appropriate for low-frequency, high-stakes decisions.
Embedded in existing analytics tooling. The model is loaded as a runtime component inside the dashboard, scoring pipeline, or risk-engine your team already operates. Used in scheduled cycles (monthly reserve runs, quarterly attribution analyses) and accessed through the surrounding application’s UI. Appropriate for medium-frequency decisions where the audience is operational, not technical.
Queryable via internal API. The model sits behind an internal endpoint that downstream systems can call. Latency is sub-second. Useful when the model is supporting many decision-makers at once, or when the LLM interface is integrated into a conversational tool the organization already deploys.

The engagement does not require any particular pattern. The choice depends on what your team already runs, what your security posture allows, and what frequency the decisions you’re supporting actually occur at.

Day-2 maintenance

The model is a living artifact. It is refined as evidence accumulates and as your environment changes — not because the methodology drifts, but because the world the model represents drifts.

Day-2 maintenance has three components:

Version control. The model file, the elicitation record, and the construction notes all live in a repository alongside any other code asset. Changes are reviewed, diffed, and traceable.
Regression testing. The validation cases from the original engagement become a test suite. When the model is refined, regressions against the original cases are detected automatically, and any change requires an explicit explanation of why it is correct.
Sensitivity-analysis baselines. The original sensitivity analysis becomes a reference. Refinements that materially shift which parameters drive which outputs are flagged for review, because they imply the model’s structure or its calibration has changed in a non-trivial way.

None of this requires me. Your team owns the maintenance practice. The point of the engagement is to leave you with that capability, not with a dependency.

Total cost picture

The honest order-of-magnitude:

The engagement itself. Eight to sixteen weeks for one decision domain. See The smallest commitment for how the work starts (a half-day audit on one decision domain, no further obligation).
Bayes Server seats, optional. $742 per modeller-seat, perpetual. Pay this only if your team wants to continue authoring in Bayes Server after the engagement. A free evaluation copy is available from bayesserver.com for inspection before any purchase. If they author in pgmpy, pyAgrum, or R, the cost is zero.
LLM API costs, ongoing. Variable with query volume. Typical small-to-medium decision-support queries cost well under $0.01 each at current API pricing. A small department might spend $50–$500 per month; a large organization with high-volume integration might spend several thousand. Self-hosted models eliminate the per-query cost entirely.
Compute for inference, ongoing. Small. Bayesian-network inference is computationally light at the scales most decisions occur at. A modest server, or part of an existing analytics environment, is sufficient.
Ongoing license to me. None. The engagement closes; your team runs the model.

The structural cost picture: a defined engagement, then your team operates and refines the artifact independently. No vendor relationship survives the engagement unless your team chooses to maintain one (e.g., the Bayes Server seats).

Next step

If this matches the picture you wanted to see, the entry point is the half-day audit. If you have a specific stack-related question this page didn’t answer — data-residency, an integration constraint, a regulatory requirement on model artifacts — drop a note. The conversation is simpler if the specifics are on the table.

You’ll leave with a model, not a vendor relationship.