Why ChatGPT and Perplexity fail compliance teams: the source anchoring problem

This article is for informational purposes only and does not constitute legal advice. Consult a qualified legal professional for advice specific to your situation.

The problem with a confident wrong answer

A compliance officer at a payment institution is preparing for a DORA review. They ask ChatGPT: what are the incident reporting timelines under DORA for major ICT incidents? They receive a clear, well-structured answer with specific timeframes. The answer reads like it was written by someone who knows the regulation. It cites the relevant framework. It is formatted for easy reference.

It is also based on a consultation draft of the implementing RTS that was superseded by the final published standard. The timeframes in the answer differ from the timeframes in the regulation currently in force. The compliance officer, who has no way to know this from the response itself, builds their incident response procedure around the wrong figures.

This is not a hypothetical edge case. It is the structural consequence of using a tool that was not built for compliance-grade regulatory research. The failure is not a bug that will be fixed in the next model update. It is the inevitable result of how generic AI tools work and what they were designed to do.

This article extends the argument in why generic AI tools are unreliable for regulatory compliance research into direct product comparison territory. It applies the three-test framework to ChatGPT, Perplexity, and similar general-purpose AI tools, and explains what source-anchored architecture prevents.

How ChatGPT and Perplexity approach regulatory questions

ChatGPT generates responses from learned weights: a statistical representation of patterns in its training data. When asked about DORA incident reporting timelines, it does not retrieve the current text of Commission Delegated Regulation (EU) 2024/1505 from EUR-Lex. It generates a response that reflects the patterns in whatever text about DORA incident reporting was in its training corpus. That corpus is a frozen snapshot assembled at a point in the past. It does not update when new implementing acts are published.

Perplexity is architecturally different. It retrieves from the live web and cites sources alongside its answers, which gives it a currency advantage over pure language models. The problem is that web retrieval is not the same as official source retrieval. The web contains discussion of EU financial regulation, commentary on it, summaries of it, and frequently outdated or inaccurate characterisations of it. Perplexity’s sources may include EUR-Lex pages, but they may equally include law firm client alerts from eighteen months ago, industry association summaries that characterise proposals as final text, or commentary that reflects a jurisdiction-specific interpretation not applicable to the user’s firm type.

Neither tool has a mechanism for distinguishing between a retrieved or generated claim that reflects the current official text of an EU regulation and one that reflects a prior draft, a commentary, or an inference. The output looks the same in both cases.

📋

Compare with a source-anchored dataset: The MiCA crypto registry is built on deterministic ingestion from ESMA’s weekly updates — every CASP entry traces back to a specific source. No hallucination, no inference, no “based on pattern”.

The three tests applied

The framework from why generic AI tools are unreliable for regulatory compliance research provides three tests that any regulatory intelligence tool should pass. Applying them to ChatGPT and Perplexity:

Test 1: source transparency

Source transparency means the tool can tell you exactly which document a given claim came from, with a retrievable identifier. Not “based on DORA” but “based on Article 17(3) of Regulation (EU) 2022/2554 (CELEX: 32022R2554), published in the Official Journal on 27 December 2022.”

ChatGPT fails this test structurally. Its outputs are not produced by retrieval from a defined source set. They are produced by inference from training weights. The model cannot tell you which document a claim came from because the claim did not come from any single document. It came from a statistical aggregation of patterns across many documents, some of which may be accurate, some outdated, and some in direct conflict with each other.

Perplexity passes a weaker version of this test in that it cites sources. It fails the compliance-grade version because its sources are web pages rather than official EUR-Lex publications identified by CELEX number. A citation to a law firm client alert or industry association summary is not the same as a citation to the regulation itself.

Test 2: currency

Currency means the tool draws on a corpus that is continuously updated from official sources, and it tells you the publication date of each source it is citing.

ChatGPT fails this test by design. Its training data has a cutoff date. Regulatory developments after that cutoff are absent. The model does not know what it does not know: it will answer questions about post-cutoff regulatory developments using pre-cutoff information, with no signal to the user that the information may be outdated.

The knowledge cutoff is also not a clean boundary. Training data is not uniformly distributed across time. Text about regulatory developments published in the months immediately before the training cutoff is typically underrepresented relative to older text, because the internet has had less time to process and discuss those developments. A model with a late 2024 training cutoff may have thin, unreliable coverage of MiCA implementing regulations published in mid-2024 even though those regulations technically fall within its training window.

Perplexity performs better on currency because its web retrieval is live. The problem is that web content is not the same as official source content. A Perplexity search for DORA incident reporting timelines may surface a 2023 blog post summarising the consultation draft alongside a 2025 official publication of the final RTS, with no clear signal to the user which of those sources the answer is drawing on.

Test 3: scope discipline

Scope discipline means the tool distinguishes between what the regulation requires and what commentators, analysts, or the model itself has inferred about implementation. It also means the tool distinguishes between the requirements of the EU regulation itself and the supervisory interpretations of individual national competent authorities, which may differ across member states.

Both ChatGPT and Perplexity fail this test. Neither has a mechanism for flagging whether a given claim reflects the regulation text, an NCA interpretation, a commentary characterisation, or a model inference. The output compresses these distinctions in ways that are invisible to the user.

For compliance professionals, the distinction matters significantly. Whether an obligation is directly imposed by the regulation or introduced by a specific NCA’s supervisory guidance determines whether it applies to firms in all EU member states or only to firms under that NCA’s supervision. A tool that blends these levels without flagging the difference is producing a systematically unreliable picture of the compliance landscape.

Specific failure scenarios

Beyond the three-test framework, a number of specific failure patterns recur when generic AI tools are used for EU financial regulation research.

Describing proposal text as adopted law. EU financial regulation is developed through a multi-year process involving Commission proposals, European Parliament and Council negotiations, and eventual adoption. Generic AI tools trained on a corpus that includes both proposal-stage and adopted-stage documents may blend the two. A tool asked about SFDR disclosure requirements may describe provisions from the November 2025 Commission proposal to revise the framework as if they were current obligations, when the current obligations are still those of the original regulation and its Level 2 RTS.

Citing superseded technical standards. The implementing and delegated acts under EU financial regulations are frequently updated. The DORA incident reporting RTS went through a consultation phase before finalisation. SFDR’s Level 2 RTS replaced earlier guidance. MiCA’s implementing regulations have been published progressively across 2024 and 2025. A model trained at any point during these development cycles may describe superseded versions of these standards.

Mischaracterising exemptions. EU financial regulations contain detailed scope and exemption provisions that determine which firms are subject to which obligations. These provisions often turn on details of firm size, authorisation type, or business activity. Generic AI tools regularly mischaracterise exemption thresholds, applying them too broadly or too narrowly, because the training data does not provide enough firm-type-specific context for the model to scope its answer correctly.

Blending requirements across related regulations. DORA and NIS2 share subject matter. SFDR and the Taxonomy Regulation interact. MiFID II and MiCA overlap for firms offering both traditional and crypto-asset services. Generic AI tools regularly blend the requirements of related regulations in ways that are not accurate for any of them individually.

What source-anchored architecture prevents

A source-anchored regulatory intelligence system is built differently at the foundation level. The retrieval layer fetches specific documents from a defined official source corpus, identified by CELEX number. The inference layer is constrained to draw only on those documents. Every claim in the output corresponds to a retrievable chunk of source text with a specific CELEX identifier and article number attached.

This architecture prevents the specific failure modes described above. It cannot describe proposal text as adopted law if the retrieval corpus isolates proposals from adopted instruments. It cannot cite superseded technical standards if the corpus is updated continuously from EUR-Lex as new instruments are published. It cannot mischaracterise exemptions beyond what the source text says because the output is constrained to what was retrieved. It cannot blend requirements across regulations in ways the source text does not support because the retrieval step makes the source boundaries explicit.

Source-anchored architecture can still fail. The retrieval layer can fetch the wrong document, or the source corpus can have gaps that have not yet been filled. But these failures are categorically different from the failures of generic AI tools. A source-anchored system that fails does so visibly: it either retrieves nothing and surfaces a gap notice, or it retrieves and cites a specific source that the user can verify. A generic AI tool that fails does so invisibly: it produces a confident, well-formatted answer that gives the user no signal that verification is required.

For a detailed treatment of the engineering principles behind source-anchored retrieval, see why deterministic RAG beats generative AI for research.

The practical distinction for compliance teams

The practical question for a compliance professional is not whether generic AI tools can be useful. They can, for tasks where the consequences of occasional errors are low and verification is easy. Drafting internal communications, summarising meeting notes, structuring a policy document framework: these are tasks where fluency matters more than precision, and where an error is easily caught.

Regulatory research is not that task. It is a task where a confident wrong answer about an exemption threshold, a reporting timeline, or an authorisation requirement leads directly to non-compliance with an enforceable obligation. The consequence of acting on incorrect regulatory information is not a minor correction. It can be a supervisory finding, a fine, or a remediation programme.

The standard that applies to the compliance professional who acts on the research applies equally to the tool they used to produce it. Can you point to the source? Is the source authoritative and current? Can you demonstrate that the answer reflects the current text of the specific provision, not a prior draft, a commentary, or a model inference?

Generic AI tools cannot meet that standard. Source-anchored systems built on verified official documents and continuous EUR-Lex ingestion can.

For background on how a source-anchored approach to regulatory monitoring works in practice, see what is regulatory horizon scanning and why compliance teams need it.

Forseti monitors EU financial regulation continuously across five streams: adopted legislation, proposals, supervisory guidance, consultations and draft standards, and case law. Every answer cites the specific CELEX identifier and article number it is drawn from. Start for free.

This article is part of a series on the EU regulatory intelligence platform landscape. See also: Forseti vs Wolters Kluwer: EU regulatory intelligence for firms that are not a major bank and Why generic AI tools are unreliable for regulatory compliance research.