The LLMs-and-causality literature is moving fast and decomposes into four paradigms with different bets. This page names the bets, walks one query through all four, and says explicitly where the methods arc sits — including what it commits to that the other three don't.

Causal reasoning has a structure that observation alone does not capture. A correlation between cigarette smoking and lung cancer is consistent with the causal claim, with the reverse claim, and with hidden-variable explanations involving stress or genetics. Pearl's program clarified this: the symbolic apparatus needed for causal answers — directed graphs, do-calculus, identification under assumptions, counterfactuals — is not in the data. It has to be brought to the data.

Large language models present a complication. They are trained on text in which causal claims are made constantly — in textbooks, in scientific papers, in news articles, in everyday writing. A natural question follows: can an LLM substitute for, augment, or replace the symbolic causal apparatus that previously had to be hand-built? The literature has answered this question four times, in four different directions.

The four paradigms below are not strict partitions; some papers occupy more than one. But they correspond to genuinely different bets about what the LLM is doing, what the symbolic side is doing, and what the human has to do. A reader of this site benefits from seeing the four laid out together, because the methods arc this site develops sits in the fourth — and the fourth's distinctness from the other three is most of what makes the methods arc's commitments load-bearing.

The deeper question

Can coherence across massive text substitute for experimental validation? Three of the paradigms below say partly; one says yes, eventually, at scale; the methods arc says no. The cost of saying no is that something else has to do the validating work. The methods arc names that something else: humans inhabiting a discipline.

Each paradigm below takes a different position on three questions: what is the LLM trusted to do? what is the symbolic side, and where does it come from? what does the human contribute? I name a representative paper for each. The list is illustrative, not exhaustive.

Paradigm 1
LLM as discovery assistant
Bet: text-derived knowledge improves data-driven causal discovery

The LLM proposes priors, constraints, or candidate edges; classical causal-discovery algorithms (PC, GES, LiNGAM) do the actual structure-learning work on observational or experimental data. The LLM is a knowledge source, not a reasoner. Statistical tests remain the grounding mechanism.

Representative work: Takayama et al. (2024) propose "statistical causal prompting" that synthesizes LLM-derived background knowledge with classical structure-learning. Cohrs et al. (2024) use the LLM as a conditional-independence oracle inside the PC algorithm.

Failure mode: Inherits all the failure modes of classical causal discovery — sensitivity to test choices, faithfulness assumptions, finite-sample noise — and adds the LLM's tendency to import correlation as causation. Stays scientifically honest; stays slow.

Paradigm 2
LLM as causal reasoner
Bet: LLMs already perform causal reasoning at useful accuracy

Direct queries to an LLM about pairwise causation, counterfactuals, or effect direction. No external causal apparatus; the LLM is asked to reason, and benchmark scores measure how well it does. The position is empirical: maybe LLMs have absorbed enough causal structure from training to be useful causal reasoners, even if the mechanism is unclear.

Representative work: Kiciman, Ness, Sharma & Tan (2023) showed GPT-3.5/4 achieving high accuracy on pairwise discovery, counterfactual reasoning, and event-causality benchmarks — surpassing many existing methods.

Failure mode: Memorization vs reasoning is the central debate. A model that has read the smoking-and-lung-cancer literature can produce the right answer about smoking and lung cancer without doing causal reasoning at all. Robustness across novel domains is contested. The mechanism is opaque, which makes scope conditions ungrounded.

Paradigm 3
LLM as causal-knowledge generator
Bet: text contains enough causal structure to assemble large causal models from

The most ambitious paradigm. The LLM is asked to extract or generate causal claims about a domain at scale; those claims are normalized into causal triples or graphs; the result is a large causal model spanning many domains. The symbolic side is built from the LLM, not just informed by it.

Representative work: Mahadevan's DEMOCRITUS system (2025) exemplifies this paradigm — proposing topics, generating causal questions, extracting causal statements, normalizing into triples, integrating into a global causal model across archeology, biology, climate, economics, and medicine. Gendron et al. (2024) work in the same direction at smaller scale, building causal graphs from natural-language documents and conducting counterfactual inference on the resulting structure.

Failure mode: No ground-truth anchor. No interventions, no statistical tests on the resulting model. LLM hallucinations propagate when integrated; conflicting edges destabilize the graph. Validation is by coherence, not by experiment. Risks producing a coherent worldview that is also wrong.

Paradigm 4 — what this work bets on
Human-curated Structural Causal Model (SCM) library, LLM mediation, audit-grounded validation
Bet: the symbolic side has to be human-curated; the LLM mediates access

The symbolic side is a curated library of structural causal models, each authored or sponsored by a human modeler with named provenance, declared scope, identification metadata, and a regime under which the model applies. The LLM does not generate the contents. It performs the surface work — recognizing the situation a user describes, finding a candidate model, instantiating variables, posing the structured query, translating the answer back. Validation is grounded in the library's contents and in the audit trail of how queries are answered, with humans (caretakers) doing the operations that the library cannot.

Representative work: The methods arc developed across the SCM library design, composition, scope, translation, provenance and audit, and caretakers pages on this site.

Failure mode: The library has to actually be built, by humans, with their time. There is no scaling shortcut. Three real models with declared scope and audit trails are worth more than thirty toy models, and producing the three takes domain-expert months. The bet is on curation, and curation is expensive.

Paradigm What is the LLM trusted to do? Where does symbolic content come from? Validation grounded in
1. Discovery assistant Provide priors and candidate edges Classical structure learning over data Statistical tests on observational data
2. Causal reasoner Answer causal queries directly Implicit in LLM weights Benchmark accuracy
3. Knowledge generator Generate causal claims at scale Extracted from LLM outputs Internal coherence at scale
4. Mediator (this work) Translate, retrieve, instantiate, refuse Human modelers, with provenance Library scope + audit trail + caretakers

Take the same query the methods arc has been working with: "What would happen to employment under a 50% national minimum-wage hike?" Each paradigm handles it differently. The differences are the point.

Paradigm 1: LLM as discovery assistant

An analyst with state-level minimum-wage and employment data uses an LLM to propose candidate causal edges for inclusion in a structure-learning algorithm. The LLM suggests common confounders (regional economic cycles, industry composition), and the algorithm runs structure learning constrained by those suggestions. The output is a graph fit to the available data. To answer the user's question, the analyst notes that the data covers 5–25% state-level hikes; a 50% national hike is out of distribution. The honest paradigm-1 answer is: the model fit to available data does not extrapolate to this regime, and the LLM's priors don't substitute for data we don't have.

Paradigm 2: LLM as causal reasoner

The query is asked directly to a state-of-the-art LLM. The model produces a fluent response — likely citing standard labor-economics arguments, possibly invoking elasticity estimates, possibly hedging. Whether the answer is correct depends on whether the LLM's training data contained correct reasoning about wage hikes at this magnitude. There is no mechanism in the system to detect that the user's 50% national hike is out of the regime where its absorbed knowledge was generated. The answer is fluent and ungrounded.

Paradigm 3: LLM as causal-knowledge generator

A pre-existing large causal model — extracted from text across many domains — contains nodes for "minimum wage," "employment," "labor demand," and edges between them. The user's query is mapped onto this graph, an effect path is identified, and a result is computed. The result reflects the aggregated causal structure that the LLM expressed across all the texts about wages and employment. Whether that aggregated structure is right at 50%, in particular, is unanswerable from inside the system — the graph's edges are not annotated with the regime conditions under which they hold.

Paradigm 4: this work

The query enters the library as natural language. The translation primitives from the Translation page back-translate it into a structured query. The mediator's first attempt — re-translating "50% national hike" as "12% aggregated state-level" — is flagged as a frame shift. The user is shown the transformation and given the opportunity to confirm or reject. The wage-elasticity model in the library has scope conditions declared: state-level, 5–25% magnitude, US 2005–2018. The 50% national hike is outside the regime scope. The library refuses, and refuses informatively: this query would require a model with national-magnitude regime scope; no such model exists in this library; here is why one would be hard to build. The audit trail records every check that fired and every check that didn't. A caretaker reviewing the trail later can adjudicate whether the refusal was correct.

Each paradigm produces a different output for the same query. Paradigm 1 says nothing because the data doesn't extend. Paradigm 2 says something fluent that may or may not be right. Paradigm 3 says something derived from textual aggregate that has no regime annotation. Paradigm 4 refuses, and the refusal is itself a piece of information.

Paradigms 2 and 3 are willing to be wrong; the methods arc is willing to be silent.

Reading the contrast above, three things distinguish the methods arc from the other three paradigms.

1. What is curated, and by whom

Paradigm 1 curates priors; the data does the rest. Paradigm 2 curates nothing; the LLM is the entire symbolic system. Paradigm 3 curates the extraction pipeline; the LLM produces the contents. The methods arc curates the contents themselves — every model, every scope declaration, every identification claim, every parameter range, with named human authorship and provenance through the library's audit trail. The bet is that this curation work is the load-bearing work, and that no amount of LLM coherence substitutes for it.

2. What the LLM is asked to do

The LLM in this work performs four operations and only four: recognize a situation, locate a relevant model, instantiate variables, and translate between natural language and structured queries. It does not reason about causation. It does not generate causal structure. The reasoning is in the library; the generation is by humans. This is a deliberately narrow assignment and the Translation page is largely about not letting the LLM do more than it has been assigned.

3. What validation rests on

Paradigm 1 grounds validation in statistical tests. Paradigm 2 grounds validation in benchmark accuracy. Paradigm 3 grounds validation in coherence across the assembled graph. The methods arc grounds validation in three things together: the library's declared scope and identification metadata, the audit trail of how each query was answered, and the human caretakers who interpret the trail and stand behind models with their own standing. None of the three would be sufficient alone; together they constitute a different kind of validation than the other paradigms claim.

How this fits Marcus and Pearl

The methods arc is consistent with Marcus's neurosymbolic vision (Marcus, 2020) in its commitment to a hybrid architecture with explicit symbolic structure. It is narrower than Marcus's program: he advocates for symbolic reasoning and prior knowledge generally; this work commits specifically to causal models with declared scope and identification, which is one cut through his broader frame.

The methods arc takes Pearl's skepticism of LLMs more seriously than paradigms 2 and 3 do. Pearl has argued that generative AI has hindered the causal community by shifting attention away from the apparatus that causal answers require. The methods arc's response is to insist that the apparatus stays — every model in the library is a Pearl-style structural causal model with declared assumptions — and that the LLM is confined to mediation, not reasoning. Whether Pearl would accept this confinement as adequate is another question. The work commits to the bet that confined mediation is workable; Pearl might judge that any LLM mediation imports the failure modes he warns against. This page does not resolve that disagreement; it names it.

What the bet costs and pays

Paradigm 4 is the slowest. Three real models in one domain take months of human work. There is no scaling shortcut. What the bet pays for that cost: refusals are informative, errors are recoverable, scope is declared, and an answer signed by a caretaker is defensible to a regulator six months later. The methods arc says that trade is worth making.

Companion arguments

This page is the positioning argument. Two adjacent pages make the architectural case directly:

Why Not Use an LLM? — the negative case for the architecture, with the LLM failure modes named in detail.

LLM-Mediated SCM Libraries — the frontier framing. Five-layer architecture, current research, and what an engagement actually builds today.

The page makes four claims that are worth being explicit about — and four that it doesn't.

It doesn't claim that paradigms 1–3 are wrong

Each of the other paradigms is doing valuable work and answering questions paradigm 4 doesn't address. Paradigm 1 is the natural fit for domains where data is rich and the question is "what's the structure underneath what we measured." Paradigm 2 is the natural fit for cases where benchmark accuracy is the relevant criterion. Paradigm 3 is the natural fit for building wide, shallow causal landscapes where coverage matters more than depth. The methods arc is making a different bet, not a refutation.

It doesn't claim novelty for the architecture

An LLM-mediated retrieval over a curated knowledge base is a familiar architectural pattern. What's specific here is the contents of the knowledge base — structural causal models with declared scope, identification metadata, and provenance — and the operations defined on those contents in the methods arc. The architecture is conventional; the contents and the operations are the work.

It doesn't claim the bet is settled

Whether human-curated SCM libraries with LLM mediation can scale to enough domains to be useful is an empirical question the methods arc does not yet answer. Three or four real models in finance, marketing, or pricing would constitute a stronger answer than the present pages can give. The Caretakers page's ask is partly an ask for collaborators in those domains.

It doesn't claim immunity to its own failure modes

The library can be wrong. Caretakers can be wrong. Scope conditions can be misdeclared. Provenance can be incomplete. The methods arc develops machinery for surfacing these failures rather than preventing them — refusal-as-information, audit replay, the Caretakers' adjudication operation. The bet is not that paradigm 4 fails less than the others; the bet is that paradigm 4's failures are recoverable in a way that fluent confident wrong answers are not.

Discovery assistant
Paradigm in which the LLM provides priors or constraints to a classical causal-discovery algorithm. The data does the structure-learning work; the LLM informs it.
Causal reasoner (LLM-as)
Paradigm in which the LLM answers causal queries directly, without external causal apparatus. Validation is by benchmark accuracy.
Causal-knowledge generator
Paradigm in which the LLM extracts causal claims from text at scale, normalized into triples and integrated into a large causal graph. The symbolic side is built from the LLM rather than separately curated.
Mediator (LLM-as)
Paradigm where the LLM performs only surface operations — recognizing situations, retrieving models, instantiating variables, translating queries — over a separately curated library of causal models. The position the methods arc commits to.
Coherence-grounded validation
Validation by internal consistency of an assembled causal graph, without recourse to interventions, data, or human adjudication. The form of validation paradigm 3 most relies on.
Audit-grounded validation
Validation by combination of declared scope, recorded query trail, and human caretaker review. The form of validation the methods arc commits to.
Next Step

If the bet this page describes is the bet you'd want to make, the methods arc develops the operations: library design, composition, scope, translation, audit, and the people who do the work — caretakers.

info@rung3.ai

References

Cohrs, K.-H., Varando, G., Diaz, E., Sitokonstantinou, V., & Camps-Valls, G. (2024). Large language models for constrained-based causal discovery. arXiv:2406.07378.

Gendron, G., Witbrock, M., Rožanec, J. M., & Dobbie, G. (2024). Counterfactual causal inference in natural language with large language models. arXiv:2410.06392.

Kıcıman, E., Ness, R., Sharma, A., & Tan, C. (2023/2024). Causal reasoning and large language models: Opening a new frontier for causality. Transactions on Machine Learning Research. arXiv:2305.00050.

Mahadevan, S. (2025). Large causal models from large language models. arXiv:2512.07796. Adobe Research and University of Massachusetts, Amherst.

Marcus, G. (2020). The next decade in AI: Four steps towards robust artificial intelligence. arXiv:2002.06177.

Takayama, M., Okuda, T., Pham, T., Ikenoue, T., Fukuma, S., & Shimizu, S. (2024). Integrating large language models in causal discovery: A statistical causal approach. arXiv:2402.01454.

↑ Back to top