RAG, long context, and the case for using both

The debate that misses the point

The RAG versus long context debate has a familiar structure. RAG is powerful but complex. Long context is simple but expensive. Pick the right tool for the job. The future is probably hybrid.

That framing is not wrong. It is just not very useful if you are actually building something.

When we built Forseti, our EU regulatory intelligence product, we did not sit down and choose between RAG and long context as a strategic architectural decision. We built RAG because the use case demanded it, ran into a specific class of failure that RAG cannot solve on its own, and added long context to handle exactly that failure. The result is a system that uses both, not because hybrid is fashionable, but because they solve genuinely different problems and the boundary between those problems is precise enough to define in code.

This is what we learned.

The case for RAG

RAG is the right starting point for a corpus that is large, frequently queried, and where accuracy is non-negotiable.

The core advantage is that retrieval separates what the model is allowed to say from what it might want to say. When you build a RAG pipeline properly, the model can only explain text that has already been retrieved from a verified source. It cannot reach outside the context window. It cannot fill gaps with training data. Every claim traces back to a specific chunk with a specific origin. For a regulatory intelligence product where a wrong answer has real compliance consequences, that structural constraint is the entire value proposition. We cover this principle in more depth in why deterministic RAG beats generative AI for research.

RAG also scales economically in a way that long context does not. A corpus of EUR-Lex regulations, chunked at article level and indexed once, can be queried many times at very low retrieval cost. The indexing work is paid once. Long context pays the full document processing cost on every single query. For documents that are queried repeatedly, that difference compounds quickly.

The third advantage is precision. A well-tuned retrieval layer returns the specific chunks that answer the question rather than the entire document. For regulatory text specifically, this matters more than it might seem. EU regulations contain two kinds of content: recitals that explain legislative intent in flowing prose, and obligation articles that state what firms must actually do. Naive retrieval almost always surfaces the recitals. The obligation articles, written in compressed legal syntax, embed differently and rank lower. Getting retrieval right on formal documents is a non-trivial engineering problem, and RAG gives you the tools to solve it systematically. For a detailed account of why this happens and how to address it, see why your RAG retrieves the wrong chunks.

The honest costs of RAG are real, though. A production pipeline requires chunking strategy decisions, embedding model selection, index maintenance, relevance tuning, and ongoing evaluation. Changing one parameter often improves some queries and regresses others. The system requires someone who understands both the retrieval engineering and the domain well enough to know when an answer is wrong. We wrote about this in why generative RAG feels more like art than science. It is not a set-it-and-forget-it system.

The case for long context

Long context removes the retrieval layer entirely. The document goes in. The model reads all of it. There is no search step that can fail.

This has genuine advantages worth taking seriously. The most significant is that long context eliminates what some call the retrieval lottery. RAG can only return what its search logic finds. If a relevant passage scores below the relevance cutoff, or if the chunking strategy has fragmented it in an unhelpful way, the model never sees it and cannot answer the question, even though the information exists. Long context does not have this failure mode. The model sees the full document and can reason across it globally.

Long context also handles cross-document reasoning more naturally. If you need to compare two regulatory texts, find what one requires that the other exempts, or trace how an implementing act relates to its parent regulation, RAG retrieves fragments. Long context can hold both documents simultaneously and reason across them directly.

For bounded, one-off analytical tasks, long context is often the simpler and more reliable choice. A legal team reviewing a specific contract does not need a vector index. A researcher doing a deep analysis of a single regulation does not need chunking strategy decisions. The complexity of RAG is only justified when the corpus is large enough that you cannot fit everything in a prompt, or when you need the audit trail that retrieval provides.

The costs are also real. Processing a large document on every query is expensive relative to retrieval. More importantly, long context does not give you the structural guarantees that RAG does. You cannot apply a relevance cutoff. You cannot easily detect when the document you need is absent, because there is no retrieval step to return zero results. The model will produce an answer regardless, drawing on whatever is in the prompt and, when that is insufficient, on its training data. For a product where accuracy must be auditable and traceable, that is a meaningful problem.

Why we ended up with both

The failure mode that pushed us toward a hybrid approach was specific and detectable: the timing gap.

A timing gap occurs when a regulation exists on EUR-Lex but has not yet been ingested into the corpus. The daily ingestion process has not run. A user asks about it. The RAG pipeline correctly identifies that the document is absent. But the user gets nothing useful in the meantime.

This is a problem that retrieval tuning cannot solve. The document is not in the index. What the system can do instead is fetch the document live from the official source and pass it directly to the model as a long context prompt, serving a provisional answer while formal ingestion proceeds in the background.

The system distinguishes between two conditions that both look like failures from the outside. The first is a profile mismatch, where the document exists in the corpus but is tagged to different sectors than the user’s profile. The second is a genuine gap, where the document is absent entirely. Only the second condition triggers the long context path, and only after a live fetch from EUR-Lex confirms the document actually exists there. If the fetch fails or returns nothing, no provisional answer is served. The boundary is precise, not probabilistic.

One further advantage of this design: because multiple users may ask about the same unindexed regulation before formal ingestion completes, the long context path benefits from response caching at the model layer. The first request pays the full cost of processing the document. Subsequent requests for the same document are significantly cheaper. The provisional path gets more economical with usage, which partially offsets the cost disadvantage of long context relative to retrieval.

The broader principle here, that sequencing and pipeline design matter as much as model capability, is something we have written about separately in AI belongs after the data is clean, not before.

What the hybrid path gives up

It is worth being honest about what the provisional long context path trades away compared to the normal RAG path.

The RAG pipeline has been tuned specifically for the structure of EU legislative text. The chunking strategy, retrieval diversity controls, query expansion approach, relevance scoring, and post-retrieval quality checks all exist because testing showed that naive approaches systematically fail on formal regulatory documents. The long context path bypasses all of it. For most regulations this is adequate. For very large documents, or for regulations where the legally material content sits deep in an annex rather than in the main articles, the long context path may miss specific content that a properly tuned retrieval pipeline would surface.

This is why the provisional answer carries explicit caveats in the UI and why formal ingestion remains mandatory regardless of whether the provisional path succeeds. The provisional answer is a better-than-nothing response for the user in the moment. It is not a substitute for the ingestion pipeline, and we treat it that way architecturally.

The decision framework, stated plainly

Based on what we built, the choice reduces to a few clear questions.

RAG earns its place when the corpus is large enough that injecting everything on every query is not viable, when you need audit trails linking every claim to a specific source, when the same documents will be queried repeatedly and indexing cost can be amortised, and when retrieval precision matters enough to justify the engineering investment. For any product where accuracy is the core value proposition, that last condition is almost always met.

Long context earns its place when the document set is bounded and the task requires global reasoning across the full text rather than targeted retrieval of specific passages, or when the document is not yet indexed and serving something is meaningfully better than serving nothing.

The hybrid pattern makes sense when you have a large, stable corpus that benefits from proper indexing, alongside a predictable class of failures where the corpus is temporarily incomplete. The word predictable is doing real work in that sentence. If you cannot define the boundary condition precisely, you are not building a deliberate hybrid. You are building a system that falls back unpredictably, which undermines the auditability that makes either approach trustworthy in the first place.