The LLMs-and-causality literature is moving fast and decomposes into four paradigms with different bets. This page names the bets, walks one query through all four, and says explicitly where the methods arc sits — including what it commits to that the other three don't.
The Question Behind the Field
Causal reasoning has a structure that observation alone does not capture. A correlation between cigarette smoking and lung cancer is consistent with the causal claim, with the reverse claim, and with hidden-variable explanations involving stress or genetics. Pearl's program clarified this: the symbolic apparatus needed for causal answers — directed graphs, do-calculus, identification under assumptions, counterfactuals — is not in the data. It has to be brought to the data.
Large language models present a complication. They are trained on text in which causal claims are made constantly — in textbooks, in scientific papers, in news articles, in everyday writing. A natural question follows: can an LLM substitute for, augment, or replace the symbolic causal apparatus that previously had to be hand-built? The literature has answered this question four times, in four different directions.
The four paradigms below are not strict partitions; some papers occupy more than one. But they correspond to genuinely different bets about what the LLM is doing, what the symbolic side is doing, and what the human has to do. A reader of this site benefits from seeing the four laid out together, because the methods arc this site develops sits in the fourth — and the fourth's distinctness from the other three is most of what makes the methods arc's commitments load-bearing.
Can coherence across massive text substitute for experimental validation? Three of the paradigms below say partly; one says yes, eventually, at scale; the methods arc says no. The cost of saying no is that something else has to do the validating work. The methods arc names that something else: humans inhabiting a discipline.
Four Paradigms
Each paradigm below takes a different position on three questions: what is the LLM trusted to do? what is the symbolic side, and where does it come from? what does the human contribute? I name a representative paper for each. The list is illustrative, not exhaustive.
The LLM proposes priors, constraints, or candidate edges; classical causal-discovery algorithms (PC, GES, LiNGAM) do the actual structure-learning work on observational or experimental data. The LLM is a knowledge source, not a reasoner. Statistical tests remain the grounding mechanism.
Representative work: Takayama et al. (2024) propose "statistical causal prompting" that synthesizes LLM-derived background knowledge with classical structure-learning. Cohrs et al. (2024) use the LLM as a conditional-independence oracle inside the PC algorithm.
Failure mode: Inherits all the failure modes of classical causal discovery — sensitivity to test choices, faithfulness assumptions, finite-sample noise — and adds the LLM's tendency to import correlation as causation. Stays scientifically honest; stays slow.
Direct queries to an LLM about pairwise causation, counterfactuals, or effect direction. No external causal apparatus; the LLM is asked to reason, and benchmark scores measure how well it does. The position is empirical: maybe LLMs have absorbed enough causal structure from training to be useful causal reasoners, even if the mechanism is unclear.
Representative work: Kiciman, Ness, Sharma & Tan (2023) showed GPT-3.5/4 achieving high accuracy on pairwise discovery, counterfactual reasoning, and event-causality benchmarks — surpassing many existing methods.
Failure mode: Memorization vs reasoning is the central debate. A model that has read the smoking-and-lung-cancer literature can produce the right answer about smoking and lung cancer without doing causal reasoning at all. Robustness across novel domains is contested. The mechanism is opaque, which makes scope conditions ungrounded.
The most ambitious paradigm. The LLM is asked to extract or generate causal claims about a domain at scale; those claims are normalized into causal triples or graphs; the result is a large causal model spanning many domains. The symbolic side is built from the LLM, not just informed by it.
Representative work: Mahadevan's DEMOCRITUS system (2025) exemplifies this paradigm — proposing topics, generating causal questions, extracting causal statements, normalizing into triples, integrating into a global causal model across archeology, biology, climate, economics, and medicine. Gendron et al. (2024) work in the same direction at smaller scale, building causal graphs from natural-language documents and conducting counterfactual inference on the resulting structure.
Failure mode: No ground-truth anchor. No interventions, no statistical tests on the resulting model. LLM hallucinations propagate when integrated; conflicting edges destabilize the graph. Validation is by coherence, not by experiment. Risks producing a coherent worldview that is also wrong.
The symbolic side is a curated library of structural causal models, each authored or sponsored by a human modeler with named provenance, declared scope, identification metadata, and a regime under which the model applies. The LLM does not generate the contents. It performs the surface work — recognizing the situation a user describes, finding a candidate model, instantiating variables, posing the structured query, translating the answer back. Validation is grounded in the library's contents and in the audit trail of how queries are answered, with humans (caretakers) doing the operations that the library cannot.
Representative work: The methods arc developed across the SCM library design, composition, scope, translation, provenance and audit, and caretakers pages on this site.
Failure mode: The library has to actually be built, by humans, with their time. There is no scaling shortcut. Three real models with declared scope and audit trails are worth more than thirty toy models, and producing the three takes domain-expert months. The bet is on curation, and curation is expensive.
| Paradigm | What is the LLM trusted to do? | Where does symbolic content come from? | Validation grounded in |
|---|---|---|---|
| 1. Discovery assistant | Provide priors and candidate edges | Classical structure learning over data | Statistical tests on observational data |
| 2. Causal reasoner | Answer causal queries directly | Implicit in LLM weights | Benchmark accuracy |
| 3. Knowledge generator | Generate causal claims at scale | Extracted from LLM outputs | Internal coherence at scale |
| 4. Mediator (this work) | Translate, retrieve, instantiate, refuse | Human modelers, with provenance | Library scope + audit trail + caretakers |
A Worked Contrast
Take the same query the methods arc has been working with: "What would happen to employment under a 50% national minimum-wage hike?" Each paradigm handles it differently. The differences are the point.
Paradigm 1: LLM as discovery assistant
An analyst with state-level minimum-wage and employment data uses an LLM to propose candidate causal edges for inclusion in a structure-learning algorithm. The LLM suggests common confounders (regional economic cycles, industry composition), and the algorithm runs structure learning constrained by those suggestions. The output is a graph fit to the available data. To answer the user's question, the analyst notes that the data covers 5–25% state-level hikes; a 50% national hike is out of distribution. The honest paradigm-1 answer is: the model fit to available data does not extrapolate to this regime, and the LLM's priors don't substitute for data we don't have.
Paradigm 2: LLM as causal reasoner
The query is asked directly to a state-of-the-art LLM. The model produces a fluent response — likely citing standard labor-economics arguments, possibly invoking elasticity estimates, possibly hedging. Whether the answer is correct depends on whether the LLM's training data contained correct reasoning about wage hikes at this magnitude. There is no mechanism in the system to detect that the user's 50% national hike is out of the regime where its absorbed knowledge was generated. The answer is fluent and ungrounded.
Paradigm 3: LLM as causal-knowledge generator
A pre-existing large causal model — extracted from text across many domains — contains nodes for "minimum wage," "employment," "labor demand," and edges between them. The user's query is mapped onto this graph, an effect path is identified, and a result is computed. The result reflects the aggregated causal structure that the LLM expressed across all the texts about wages and employment. Whether that aggregated structure is right at 50%, in particular, is unanswerable from inside the system — the graph's edges are not annotated with the regime conditions under which they hold.
Paradigm 4: this work
The query enters the library as natural language. The translation primitives from the Translation page back-translate it into a structured query. The mediator's first attempt — re-translating "50% national hike" as "12% aggregated state-level" — is flagged as a frame shift. The user is shown the transformation and given the opportunity to confirm or reject. The wage-elasticity model in the library has scope conditions declared: state-level, 5–25% magnitude, US 2005–2018. The 50% national hike is outside the regime scope. The library refuses, and refuses informatively: this query would require a model with national-magnitude regime scope; no such model exists in this library; here is why one would be hard to build. The audit trail records every check that fired and every check that didn't. A caretaker reviewing the trail later can adjudicate whether the refusal was correct.
Each paradigm produces a different output for the same query. Paradigm 1 says nothing because the data doesn't extend. Paradigm 2 says something fluent that may or may not be right. Paradigm 3 says something derived from textual aggregate that has no regime annotation. Paradigm 4 refuses, and the refusal is itself a piece of information.
Where This Work Sits
Reading the contrast above, three things distinguish the methods arc from the other three paradigms.
1. What is curated, and by whom
Paradigm 1 curates priors; the data does the rest. Paradigm 2 curates nothing; the LLM is the entire symbolic system. Paradigm 3 curates the extraction pipeline; the LLM produces the contents. The methods arc curates the contents themselves — every model, every scope declaration, every identification claim, every parameter range, with named human authorship and provenance through the library's audit trail. The bet is that this curation work is the load-bearing work, and that no amount of LLM coherence substitutes for it.
2. What the LLM is asked to do
The LLM in this work performs four operations and only four: recognize a situation, locate a relevant model, instantiate variables, and translate between natural language and structured queries. It does not reason about causation. It does not generate causal structure. The reasoning is in the library; the generation is by humans. This is a deliberately narrow assignment and the Translation page is largely about not letting the LLM do more than it has been assigned.
3. What validation rests on
Paradigm 1 grounds validation in statistical tests. Paradigm 2 grounds validation in benchmark accuracy. Paradigm 3 grounds validation in coherence across the assembled graph. The methods arc grounds validation in three things together: the library's declared scope and identification metadata, the audit trail of how each query was answered, and the human caretakers who interpret the trail and stand behind models with their own standing. None of the three would be sufficient alone; together they constitute a different kind of validation than the other paradigms claim.
How this fits Marcus and Pearl
The methods arc is consistent with Marcus's neurosymbolic vision (Marcus, 2020) in its commitment to a hybrid architecture with explicit symbolic structure. It is narrower than Marcus's program: he advocates for symbolic reasoning and prior knowledge generally; this work commits specifically to causal models with declared scope and identification, which is one cut through his broader frame.
The methods arc takes Pearl's skepticism of LLMs more seriously than paradigms 2 and 3 do. Pearl has argued that generative AI has hindered the causal community by shifting attention away from the apparatus that causal answers require. The methods arc's response is to insist that the apparatus stays — every model in the library is a Pearl-style structural causal model with declared assumptions — and that the LLM is confined to mediation, not reasoning. Whether Pearl would accept this confinement as adequate is another question. The work commits to the bet that confined mediation is workable; Pearl might judge that any LLM mediation imports the failure modes he warns against. This page does not resolve that disagreement; it names it.
Paradigm 4 is the slowest. Three real models in one domain take months of human work. There is no scaling shortcut. What the bet pays for that cost: refusals are informative, errors are recoverable, scope is declared, and an answer signed by a caretaker is defensible to a regulator six months later. The methods arc says that trade is worth making.
This page is the positioning argument. Two adjacent pages make the architectural case directly:
Why Not Use an LLM? — the negative case for the architecture, with the LLM failure modes named in detail.
LLM-Mediated SCM Libraries — the frontier framing. Five-layer architecture, current research, and what an engagement actually builds today.
What This Doesn't Claim
The page makes four claims that are worth being explicit about — and four that it doesn't.
It doesn't claim that paradigms 1–3 are wrong
Each of the other paradigms is doing valuable work and answering questions paradigm 4 doesn't address. Paradigm 1 is the natural fit for domains where data is rich and the question is "what's the structure underneath what we measured." Paradigm 2 is the natural fit for cases where benchmark accuracy is the relevant criterion. Paradigm 3 is the natural fit for building wide, shallow causal landscapes where coverage matters more than depth. The methods arc is making a different bet, not a refutation.
It doesn't claim novelty for the architecture
An LLM-mediated retrieval over a curated knowledge base is a familiar architectural pattern. What's specific here is the contents of the knowledge base — structural causal models with declared scope, identification metadata, and provenance — and the operations defined on those contents in the methods arc. The architecture is conventional; the contents and the operations are the work.
It doesn't claim the bet is settled
Whether human-curated SCM libraries with LLM mediation can scale to enough domains to be useful is an empirical question the methods arc does not yet answer. Three or four real models in finance, marketing, or pricing would constitute a stronger answer than the present pages can give. The Caretakers page's ask is partly an ask for collaborators in those domains.
It doesn't claim immunity to its own failure modes
The library can be wrong. Caretakers can be wrong. Scope conditions can be misdeclared. Provenance can be incomplete. The methods arc develops machinery for surfacing these failures rather than preventing them — refusal-as-information, audit replay, the Caretakers' adjudication operation. The bet is not that paradigm 4 fails less than the others; the bet is that paradigm 4's failures are recoverable in a way that fluent confident wrong answers are not.
Key Terms
If the bet this page describes is the bet you'd want to make, the methods arc develops the operations: library design, composition, scope, translation, audit, and the people who do the work — caretakers.
info@rung3.ai
Cohrs, K.-H., Varando, G., Diaz, E., Sitokonstantinou, V., & Camps-Valls, G. (2024). Large language models for constrained-based causal discovery. arXiv:2406.07378.
Gendron, G., Witbrock, M., Rožanec, J. M., & Dobbie, G. (2024). Counterfactual causal inference in natural language with large language models. arXiv:2410.06392.
Kıcıman, E., Ness, R., Sharma, A., & Tan, C. (2023/2024). Causal reasoning and large language models: Opening a new frontier for causality. Transactions on Machine Learning Research. arXiv:2305.00050.
Mahadevan, S. (2025). Large causal models from large language models. arXiv:2512.07796. Adobe Research and University of Massachusetts, Amherst.
Marcus, G. (2020). The next decade in AI: Four steps towards robust artificial intelligence. arXiv:2002.06177.
Takayama, M., Okuda, T., Pham, T., Ikenoue, T., Fukuma, S., & Shimizu, S. (2024). Integrating large language models in causal discovery: A statistical causal approach. arXiv:2402.01454.