
How we decided what not to build
Every technique for reducing LLM hallucination sounds compelling in a blog post. The harder question is which ones are actually worth building for your specific system, your specific failure modes, and your specific cost constraints. Here is how we worked through that for Forseti.
The list that keeps growing
If you spend time reading about LLM quality, you accumulate a long list of techniques that promise to reduce hallucination, improve retrieval, or make your system more reliable. HyDE. Re-ranking. Model chaining. Fine-tuning. Mixture of experts. Confidence scoring. Constitutional AI. Knowledge graphs. Cross-encoders. Ensemble voting.
Each one has a paper behind it, a benchmark result, and usually a thoughtful blog post explaining why it matters. Each one sounds like something you should probably add to your system.
The problem is that adding things to a system is not the same as improving it. Every new component introduces complexity, maintenance overhead, and new failure modes. Some techniques improve one metric while quietly degrading another. Some are genuinely useful at scale but add no value at the corpus size you are currently at. Some solve a problem you do not actually have.
When we built Forseti, our EU regulatory intelligence product, we ended up with a fairly short list of techniques we actually implemented and a longer list we deliberately chose not to build. The discipline of the second list turned out to matter as much as the first.
The baseline test as decision infrastructure
The most important thing we did before evaluating any technique was establish a fixed baseline test. Five queries, fixed user profile, fixed corpus, run after every change. Each query scored as good, partial, or poor. The total score was the only number we tracked.
This sounds obvious. In practice, most teams skip it, or they run ad hoc tests after changes and convince themselves things have improved because the queries they tested improved. Without a fixed set that includes queries you are not actively trying to fix, you cannot see regressions.
The baseline made decisions mechanical in a useful way. A technique that moved the score from 4/5 to 5/5 was worth keeping. A technique that left the score at 4/5 but removed a structural fragility from the system was worth keeping. A technique that left the score at 4/5 and added a new LLM call to every query was not.
We ran 13 baselines over the course of building Forseti. The score progression looked like this:
| Baseline | Key change | Score |
|---|---|---|
| B1 | Fixed-word chunking | 4/5 |
| B2 | Article-level chunking and diversity retrieval | 4/5 |
| B3 | HyDE replacing hardcoded query expansion | 4/5 |
| B4 | Relevance cutoff and profile mismatch detection | 3/5 |
| B5 | Post-retrieval quality check | 1/5 |
| B6 | CRR3 corpus addition | 2/5 |
| B7 | Quality check decoupled from answer path | 4/5 |
| B8 | Chunk metadata enrichment | 4/5 |
| B9 | Relevance cutoff raised | 4/5 |
| B10 | Model upgrade | 4/5 |
| B11 | Gap extraction fix | 4/5 |
| B12 | Retrieval pool widened, quality check pre-cutoff | 5/5 |
| B13 | Query classification and routing | 5/5 |
B4 and B5 are the most instructive rows. Both show score drops. Neither represents a mistake. B4 surfaced a genuine corpus gap that had previously been silent because the system was not detecting gaps correctly. The score dropped because the system became more honest about what it did not know. B5 introduced a quality check that correctly identified retrieval failures, but we had wired it into the answer path rather than gap detection only, so it started suppressing answers unnecessarily. B7 fixed the wiring. The score recovered immediately.
The lesson from those two baselines was that a score drop is not always a regression. Sometimes it is the system learning to surface a real problem. You cannot tell which it is without understanding the domain well enough to read the actual output.
What we built and why
Article-level chunking was the first meaningful change from the starting point. Fixed-word chunking splits documents at arbitrary token boundaries, often mid-sentence and almost always mid-article. For EU regulatory text this is particularly damaging because the legally material content, the obligation articles stating what firms must actually do, is dense and tightly structured. Splitting an article across two chunks means neither chunk contains a coherent obligation. We switched to splitting on article boundaries, with word-based fallback for documents that do not have article structure. For DORA, this went from roughly 30 chunks to 137. More chunks, each more coherent.
HyDE solved a structural mismatch problem we have written about in detail elsewhere. The short version: users ask questions in plain language, and EU regulatory text is written in compressed legal syntax. The two embed differently, so a vector search for a user’s question consistently surfaces explanatory preamble text rather than the obligation articles that actually answer the question. HyDE generates a short hypothetical regulatory answer first, embeds that instead of the raw question, and the mismatch largely disappears. No regulation-specific code, no hardcoded keyword lists. The same approach works automatically for every document in the corpus.
Gap detection with profile mismatch distinction turned out to be one of the more subtle pieces of work. A retrieval failure can happen for two structurally different reasons: the document the user needs is not in the corpus at all, or the document is in the corpus but is tagged to different sectors than the user’s profile. These look identical from the outside. Both produce low retrieval scores and thin results. But they require completely different responses. A genuine gap should trigger an admin notification, a corpus gap log entry, and an attempt to serve a provisional answer from the live EUR-Lex source. A profile mismatch should surface an amber nudge to the user suggesting they update their profile, with no noise to admins. We built separate detection logic for each condition, using an unfiltered secondary search to distinguish between them. Getting this wrong in either direction is expensive: false gap alerts create admin overhead, and missed profile mismatches send users a generic error when the answer exists and is retrievable.
The long context fallback is the one technique we use that sits outside the core RAG pipeline. When gap detection confirms a document exists in EUR-Lex but is not yet in the corpus, we fetch it live and pass it directly to the model as a long context prompt rather than making the user wait for formal ingestion. The answer is marked provisional in the UI. Formal ingestion still runs regardless. This is not a replacement for the retrieval pipeline and we do not treat it as one. It is a specifically scoped response to a specific failure mode: the timing gap between when a regulation is published and when it is ingested. We have written about this in more detail in our piece on RAG and long context.
Query classification came late, at B13, and produced the only baseline improvement from 4/5 to 5/5 that a single change delivered. The insight was simple: broad landscape queries benefit from a larger retrieval pool because they need coverage across multiple documents, while specific queries about a particular obligation benefit from a smaller, more precise pool. Routing the two types differently improved the fifth test query, which had been consistently returning partial results. The classification itself runs in parallel with HyDE, so it adds no latency.
What we chose not to build
This is the more useful half of the story.
Model chaining would add a second LLM call after generation to verify every claim in the answer against the retrieved source set. The appeal is obvious: an independent check on the output. The cost is approximately double the LLM spend per query plus meaningful latency increase. We did not build it because the guardrail we already have, which checks that every CELEX ID cited in the answer actually appears in the retrieved source set, already catches the most consequential failure mode. A model could cite DORA accurately while mischaracterising what Article 28 requires, and the guardrail would not catch that. But that failure requires the model to have retrieved the correct document and then specifically misread it. In practice, when retrieval is grounded and the model is constrained to the context window, this is rare enough that adding 2x cost to catch it is not justified. We would revisit this if user feedback surfaced a consistent pattern of claim-level errors on correctly retrieved documents.
Fine-tuning comes up constantly in discussions about reducing hallucination. The argument is that a model trained on EU legislative text will be less likely to confabulate regulatory facts. This is probably true. It is also beside the point for Forseti. Fine-tuning improves a model’s statistical tendencies. Retrieval grounding enforces factual accuracy by only giving the model verified text to draw on. For a compliance product where an incorrect answer has real consequences, enforcement beats tendencies. Fine-tuning also loses the ability to use hosted providers directly, requires GPU infrastructure, and needs ongoing retraining as the regulatory landscape changes. The infrastructure and maintenance cost is not justified when the retrieval architecture is already doing the heavy lifting.
Re-ranking with a cross-encoder would add a second scoring pass over the initial retrieval results, using a model that scores query-chunk pairs jointly rather than independently. Cross-encoders are more accurate than bi-encoder retrieval. They are also slower and more expensive. At our current corpus size and with our baseline at 5/5, retrieval quality is not the bottleneck. The trigger for building a re-ranking layer would be a baseline regression caused by retrieval failures rather than corpus gaps, on a corpus large enough that the diversity controls and TOP_K tuning we already have cannot compensate. That trigger has not fired.
Knowledge graph traversal would let the system reason across legal relationships explicitly: DORA references EBA RTS, CRR3 amends CRD IV, SFDR delegates to RTS 2022/1288. The current RAG pipeline has no awareness that two documents are legally linked. For multi-document reasoning queries, this matters. We are not building it now because B13 is 5/5 and the corpus is still small enough that the existing diversity retrieval handles multi-document queries adequately. What we have done is take the preparation steps that cost nothing while still in development: harvesting EUR-Lex relationship triples during ingestion, adding stable document identifiers, enforcing a controlled vocabulary for sector tags. When the corpus grows and multi-document queries start failing baseline tests, the graph traversal layer can be bolted on without rearchitecting anything. Critically, it stays entirely in Postgres via recursive CTEs, so we are not adding a graph database to the infrastructure.
Ensemble voting across multiple models would run each query through several LLMs and combine their outputs. This is sometimes used in high-stakes domains where the cost of a wrong answer is extreme. It is also roughly three to five times the LLM cost per query. The structural constraint that RAG already provides, limiting the model to verified retrieved sources, means that the variance between models on the same grounded context is small. The additional cost is hard to justify when the variance it would catch is already low.
The decision framework that emerged
After 13 baselines, the implicit framework we had been applying became explicit.
The first question is whether the baseline is actually failing. If it is not, adding complexity to a working system is a liability, not an improvement. The baseline score exists precisely to give this question a concrete answer.
The second question is whether the failure is a retrieval problem or a generation problem. These require different interventions. Retrieval failures, where the right document is in the corpus but the wrong chunks are surfaced, respond to chunking strategy, HyDE, retrieval diversity, and relevance cutoff tuning. Generation failures, where the right chunks are retrieved but the model misinterprets or misrepresents them, respond to prompt engineering, guardrails, and in serious cases model chaining. Confusing the two leads to applying the wrong fix.
The third question is what an additional LLM call per query actually costs at the query volumes you are projecting, and whether the failure it targets is consequential enough to justify that cost. In a high-stakes domain like compliance, even a small reduction in a serious failure rate can justify significant additional spend. But cost is not the only consideration: sequential LLM calls also add latency, and in a streaming product where users are watching a response build in real time, the time to first token matters. The question is whether the specific failure mode being addressed actually occurs at meaningful frequency. An expensive, slow fix for a failure that rarely happens is still a poor trade.
The fourth question is whether the problem exists at your current scale. Re-ranking is valuable at large corpus sizes. Knowledge graph traversal is valuable when documents are densely cross-referenced and queries routinely span multiple of them. Fine-tuning is valuable when the model’s domain knowledge is the bottleneck rather than the retrieval architecture. None of those conditions applied to us at B13. Building for the scale you do not yet have is the most reliable way to add complexity without improving quality.
What this means for the 85/15 principle
The core principle behind Forseti’s architecture is that the system finds the law deterministically and the LLM explains it. Eighty-five percent of the work is retrieval, filtering, validation, and gap detection. Fifteen percent is the LLM translating verified retrieved text into a personalised plain-English explanation.
Every technique we evaluated was implicitly a question about whether it belonged in the 85% or the 15%, and whether it was actually necessary in either place.
The techniques we built mostly belong in the 85%: chunking strategy, retrieval diversity, relevance cutoffs, gap detection, profile mismatch logic, query classification. These are deterministic or near-deterministic. They change what the LLM sees rather than how the LLM behaves.
The techniques we skipped mostly tried to improve the 15% by adding more LLM calls. Model chaining, ensemble voting, and full self-critique loops all work by asking more of the LLM rather than constraining what the LLM is asked to do. For a product where the value proposition is grounded accuracy, improving the deterministic 85% almost always delivers more per unit of effort than adding to the generative 15%.
The exception is the long context fallback, which is a second LLM call but one that is strictly scoped to a specific failure mode and fires infrequently. That is the pattern worth following: when an additional LLM call is genuinely warranted, it should be scoped to a precise condition, not applied to every query.
The list we are watching
There are techniques on the deferred list that will become relevant at a different stage.
Re-ranking becomes relevant when the corpus grows large enough that retrieval diversity controls cannot prevent good documents from being crowded out by a high volume of adjacent but less relevant material.
Knowledge graph traversal becomes relevant when the corpus contains enough cross-referenced documents that multi-document reasoning queries start failing. The preparation work is done. The build is deferred until the trigger fires.
Forseti monitors EU financial regulation continuously, delivering personalised impact analysis anchored to verified EUR-Lex sources. If you want to be kept informed ahead of launch, get in touch.
Subscribe for news updates.
Press releases and news alerts are the surface layer. The signals that actually matter, structural moves, regulatory filings, hiring patterns and operational complaints, live elsewhere and arrive earlier.