Why generic AI tools are unreliable for regulatory compliance research

The problem is not that AI is wrong. It is that AI does not know when it is wrong.

Compliance professionals are under pressure to work faster with fewer resources. Generic AI tools, including ChatGPT, Gemini, Perplexity, and a growing list of AI-powered legal research products, are an obvious candidate for offloading the research burden. They are fast, fluent, and confident. They summarise regulation in plain English, answer questions about compliance obligations, and produce well-structured responses that look authoritative.

That appearance of authority is the problem.

A generic language model answering a question about MiCA authorisation requirements, DORA ICT risk management obligations, or SFDR disclosure categories is drawing on its training data, a frozen snapshot of text from the public internet assembled at a point in the past. It does not know whether that information is current. It cannot tell you when its knowledge of a regulation was last updated. It does not distinguish between the final published text of a regulation and a working draft from eighteen months before final publication. And it has no mechanism for flagging the difference between a confident claim it can verify and a confident claim it is producing from pattern-matching across ambiguous sources.

For most applications, this is an acceptable trade-off. For compliance research, it is not. The consequences of acting on incorrect regulatory information range from wasted preparation effort to material non-compliance with enforceable obligations.

This article is for informational purposes only and does not constitute legal advice. Consult a qualified legal professional for advice specific to your situation.

What hallucination looks like in a regulatory context

Hallucination is the term used when a language model generates factually incorrect content with no signal to the reader that the output may be wrong. In the context of general knowledge questions, hallucination is an inconvenience. In a regulatory context, it has a specific and serious character.

EU financial regulation is precise in ways that matter enormously. Whether a requirement is mandatory or subject to discretionary application by the national competent authority (NCA) determines whether a firm needs to rebuild a system or update a policy. Whether an exemption applies to a firm depends on details of its authorisation type, size threshold, and business model, not on a general characterisation of what “most firms” are required to do. Whether a deadline has passed, is approaching, or applies to a specific firm depends on the exact provisions of the regulation and any transitional arrangements in force.

Generic AI tools routinely get these details wrong in ways that are difficult to detect without independent verification. Common failure modes include:

Presenting an obligation as settled when the relevant regulatory technical standard has not yet been finalised. A language model trained before a final RTS was published may describe the requirements based on the consultation draft, which may differ from the adopted text.
Describing an exemption that was modified or removed between the initial proposal and the final regulation. Trilogue negotiations frequently alter scope provisions. A model trained on early-stage proposal documents will not reliably reflect the final legislative text.
Conflating requirements across related but distinct regulations. DORA and NIS2 share subject matter but impose different obligations on firms in scope for both. A model that has absorbed large volumes of text about both regulations may blend their requirements in ways that are not accurate for either.
Citing a provision as applying to a firm type to which it does not apply, or failing to surface an obligation that does apply, because the categorisation logic in the training data was imprecise.

None of these errors comes with a warning. The output reads the same whether the underlying information is accurate or not.

Why the confidence signal cannot be trusted

Language models are calibrated to produce fluent, confident text. This is a feature for most applications and a liability for compliance research. A model that expressed uncertainty proportional to its actual reliability would produce hedged, equivocal output that most users would find frustrating. The models that have achieved wide adoption are the ones that are easy to read and sound authoritative.

The result is that a generic AI tool asked about CASP authorisation requirements under MiCA will produce a clear, well-structured answer regardless of whether its information is current, complete, or correctly scoped to the jurisdiction the user is asking about. The user has no way to assess from the response alone whether the answer reflects the final published regulation (CELEX: 32023R1114), a prior draft, a general characterisation of the regime, or a blend of all three.

This is structurally different from the uncertainty a lawyer or compliance professional would express. A qualified practitioner asked the same question would tell you what they know, what they are uncertain about, and what you would need to verify independently. They would cite the specific text they are relying on. They would flag where the regulatory position is contested or where NCA interpretations diverge. A language model does none of these things because it has no mechanism for knowing what it does not know.

The knowledge cutoff problem is worse than it appears

Every large language model has a training cutoff: a date beyond which it has no information. This is widely understood. What is less widely appreciated is that the knowledge cutoff understates the problem for regulatory research in two ways.

First, the training data for any model is not uniformly distributed across time. Text about a regulation published in the months immediately before the training cutoff is typically underrepresented relative to text published years earlier, because the internet has had less time to process, discuss, and analyse recent developments. A model with a training cutoff of late 2024 may have thin, unreliable coverage of regulatory developments from mid-2024 onwards even though those developments technically fall within its training window.

Second, EU financial regulation moves faster than training cycles. MiCA’s final-stage implementing regulations and supervisory guidance have continued to develop through 2025 and into 2026. The AI Act’s financial services provisions became subject to active supervisory interpretation in 2025. DORA’s first supervisory assessment cycle has generated NCA guidance that postdates any plausible training cutoff for current generation models. A tool trained even in late 2024 is already materially out of date on active regulatory files.

A compliance professional relying on a generic AI tool is not just accepting the risk of occasional errors. They are systematically accepting a view of the regulatory landscape that may be one to two years behind the current position.

Why source anchoring is the baseline requirement

The only defensible architecture for AI-assisted regulatory research is one in which every claim in the output traces back to a specific, retrievable source document. Not a general characterisation of the regulatory landscape. Not a synthesis of training data. A specific document, with a specific identifier, published by a specific authority on a specific date.

In EU financial regulation, this means anchoring to EUR-Lex and the publications of the European Supervisory Authorities. EUR-Lex assigns a unique CELEX identifier to every EU legal instrument. A regulatory intelligence system built on EUR-Lex can tell you not just what the regulation requires, but which version of the regulation it is drawing on, when it was published, and where in the document the relevant provision appears.

This is not a feature. It is the baseline requirement for any system whose output will be used to inform compliance decisions.

The contrast with generic AI tools is structural, not a matter of degree. A generic language model cannot provide source anchoring because its outputs are not produced by retrieval from a defined source set. They are produced by inference from training weights. The model does not know where a given claim came from because the claim did not come from any single place.

A retrieval-augmented system built on verified official sources works differently. The retrieval layer fetches specific documents from a defined source corpus. The inference layer is constrained to draw only on those documents. Every claim in the output corresponds to a chunk of source text with a retrievable origin. The system can fail to retrieve the right documents, or the source corpus can have gaps, but it cannot produce confident claims about documents it has not retrieved. For the implications of retrieval failures and how to engineer around them, see why deterministic RAG beats generative AI for research.

The accountability problem

Compliance is a domain in which decisions are made by accountable professionals and later reviewed by supervisors, auditors, and sometimes courts. The question that matters is not “did the AI give a plausible answer?” but “can you demonstrate the basis on which this compliance decision was made?”

If a compliance officer relied on a generic AI tool to assess whether their firm needed to comply with a particular DORA obligation and the tool provided an incorrect answer, the liability for that non-compliance sits with the compliance officer, not the tool. The tool does not have a professional duty of care. It does not appear before regulators. It is not accountable in any sense that the regulatory framework recognises.

This creates a specific obligation for the professionals who use these tools: they are responsible for verifying what the tool tells them against authoritative sources. If verification against authoritative sources is required anyway, the value of using a generic AI tool for the research step is the speed of producing a first draft to verify, not a reliable answer in its own right. For high-stakes compliance decisions, that is a narrow use case.

The standard that regulatory intelligence needs to meet is the same standard that applies to the practitioner who acts on it: can you point to the source, and is the source authoritative and current? Generic AI tools cannot meet that standard. Source-anchored systems built on verified official documents can. For a broader treatment of why traceability and auditability matter in any research pipeline, see AI belongs after the data is clean, not before.

What good regulatory AI looks like

The failures described above are not arguments against AI in regulatory intelligence. They are arguments against AI applied without source anchoring, without retrieval constraints, and without transparency about the basis for each claim.

A well-engineered regulatory intelligence system uses AI at the right stage of the pipeline. Ingestion and retrieval are deterministic: documents are fetched from official sources, indexed with their CELEX identifiers and publication metadata, and retrieved against specific queries with hard constraints on what the inference layer can draw on. The AI layer does what it is genuinely good at: synthesising across retrieved documents, identifying the provisions most relevant to a specific firm profile, and surfacing the compliance implications in plain language. The output includes citations to the specific source documents and a retrievable audit trail.

This is not a description of how generic AI tools work. It is a description of how a purpose-built regulatory intelligence system needs to work to be fit for compliance use.

The architectural principle is the same one that applies to research pipelines more generally: data integrity before analysis. Deterministic collection and filtering first, AI for synthesis and interpretation on clean, source-linked data. For a detailed treatment of the engineering behind this architecture in the context of research more broadly, see why deterministic RAG beats generative AI for research.

The practical test

Before relying on any AI tool for regulatory research, apply three tests.

The first is source transparency: can the tool tell you exactly which document a given claim came from, with a retrievable identifier? Not “based on MiCA” but “based on Article 59 of Regulation (EU) 2023/1114 (CELEX: 32023R1114), published in the Official Journal on 9 June 2023.” If the tool cannot provide that level of specificity, the claim is unverifiable.

The second is currency: does the tool draw on a corpus that is continuously updated from official sources, and does it tell you the publication date of each source it is citing? A tool that cannot confirm when its source documents were published, or that relies on a static training snapshot with no update mechanism, is not fit for compliance use.

The third is scope discipline: does the tool distinguish between what the regulation requires and what NCAs in specific jurisdictions have said about implementation? EU regulations are directly applicable, but supervisory interpretation varies by member state. A tool that blends the regulation text with general commentary about implementation without flagging the distinction is compressing information that compliance professionals need to keep separate.

Generic AI tools fail all three tests. Purpose-built regulatory intelligence systems, built on verified official source corpora with full retrieval transparency, can pass them.

Forseti monitors EU financial regulation continuously, delivering personalised impact analysis anchored to verified EUR-Lex sources with full CELEX traceability — the source-anchored architecture this article describes. Start for free.