Why generative RAG feels more like art than science

The feeling nobody warns you about

You read the tutorials. You follow the best practices. You chunk your documents, choose an embedding model, set up a vector store, and write a careful system prompt instructing the model to use only the provided context. The first few test queries work beautifully. The answers are accurate, cited, and fast.

Then you change one thing.

Not a major architectural change. Just the chunk size, from 512 to 384 tokens. Suddenly queries that worked perfectly are retrieving the wrong chunks. Answers that were confident become hedged or wrong. A hallucination you have never seen before appears.

You change it back. The original problem returns, but in testing the revert, you discover a different query was already broken. You just hadn’t looked.

This is not a failure of engineering discipline. It is a structural property of generative RAG systems. They are unstable in ways that deterministic software is not. Fixing one failure mode often introduces another. Improving one metric regresses a different one.

The behaviour feels less like engineering and more like a game of whack-a-mole. And maintaining a system like this successfully requires not just technical knowledge but genuine subject matter expertise.

What whack-a-mole looks like in practice

The pattern is consistent across teams and use cases. Each change you make improves something and degrades something else.

Change you make	Improves	But often regresses
Increase chunk size	More context for complex queries	More irrelevant information, more hallucinations
Decrease chunk size	Better precision retrieval	Missing context, incomplete answers
Add more documents	Broader coverage	More noise, slower retrieval, conflicting information
Tune the embedding model	Better similarity matching	Bias toward certain phrasing, miss synonyms
Add a reranking step	Better top-k relevance	May filter out correct but slightly off-phrased chunks
Improve the system prompt	Better instruction following	May become too rigid and refuse valid queries
Add query rewriting	Better retrieval for vague questions	May distort the user’s original intent
Increase top-k	Capture more relevant chunks	More chance to retrieve irrelevant content

None of these trade-offs is inherently wrong. The problem is that they interact. A chunk size that works well with one embedding model may perform poorly with another. A system prompt that works at temperature 0.0 may produce repetitive outputs at 0.2. A retrieval strategy that works for short, factual questions may fail for exploratory, open-ended ones.

Because the interactions are non-linear and unpredictable, you cannot optimise the system once. You can only tune it continuously, responding to failures as they emerge.

To be fair, some of these trade-offs can be sidestepped by fixing things upstream. Better document structuring, richer metadata, and hybrid search combining dense and sparse retrieval reduce the surface area of the problem. But they do not eliminate it. They shift the whack-a-mole game to a different table.

Why the architecture resists deterministic control

A generative RAG pipeline is not one system. It is several loosely coupled subsystems, each with its own failure modes.

flowchart TD
    Q[User query] --> C
    D[Documents] --> C

    C[Chunking]
    C --> E[Embedding]
    E --> S[Vector search]
    S --> R[Reranking]
    Q --> P
    R --> P[Prompt construction]
    P --> M[Generation LLM]
    M --> A[Answer]

Each subsystem has knobs. Chunk size, chunk overlap, embedding model choice, similarity metric, top-k, reranker model, system prompt wording, temperature, top_p, frequency penalty. Changing any one of them changes the behaviour of the others in ways that are difficult to predict.

This is the root of the whack-a-mole problem. You are not tuning a single model. You are balancing a system of interacting components. A change that improves performance on your evaluation set may silently break a category of queries you forgot to test.

The deterministic lie: temperature 0.0 is not enough

Many teams assume that setting temperature to zero makes the model deterministic. It does not. Not fully.

Temperature zero eliminates sampling variance. The model always picks the token with the highest probability. That part is deterministic. But the probabilities themselves are computed using floating-point arithmetic on GPUs, and floating-point operations are not truly associative.

When you run the same inference twice on a GPU, the order of parallel operations can vary slightly due to hardware scheduling, driver versions, and memory layout. Most of the time, the differences are too small to change the top token. But when two tokens have nearly identical probabilities, a tiny floating-point difference can flip which one is considered highest.

Once the first token differs, the entire generation diverges. Two runs with the same prompt, same temperature zero, same model can produce different outputs. The probability is low on simple queries, but it is never zero. In practice, teams running repeated tests at temperature zero see different outputs on a small percentage of calls.

This means that even with perfect retrieval, perfect context, and a perfectly written prompt, the generation step still has a non-deterministic element. You cannot engineer your way to complete stability.

The parallel universes problem

The consequence of all this instability is that two teams working on similar problems can have genuinely different experiences of the same tools.

One team reports that GPT-4 is amazing at reasoning. Another reports that it is sloppy and lazy. Both are telling the truth about their specific setup, their specific prompts, their specific temperature settings, and the random seeds their API calls happened to use.

One team says RAG fixed all their hallucinations. Another says RAG made theirs worse. Both are correct. The difference is chunk size, embedding model, document quality, and the specific queries they are testing.

These disagreements are not resolvable through anecdote. You cannot settle an argument about LLM behaviour by saying “try it yourself” because the other person’s attempt will differ from yours, sometimes meaningfully.

This is what it means to work with probabilistic systems. Your experience is real for your setup. Their contradictory experience is real for theirs. The only way to compare rigorously is controlled evaluation on shared datasets with fixed parameters.

Why subject matter expertise is not optional

Here is where many teams make a mistake. They assume that RAG is an engineering problem. Hire good engineers, choose the right vector database, tune the retrieval, and the system works.

This misses something critical. When a RAG system fails, the failure often looks like a plausible but wrong answer. The model retrieves the wrong chunk but the chunk is still on topic. The model ignores the retrieved context and falls back on its training. The model misinterprets a correctly retrieved regulation because the language is ambiguous.

A pure engineer looking at this failure sees a black box. They can tune parameters, but they cannot tell whether the answer is right because they do not know the domain. A subject matter expert looking at the same failure sees exactly what went wrong. They know which regulation article was supposed to be retrieved. They know that the model misinterpreted a conditional clause. They know that the cited source does not actually support the claim the model made.

This is why the person maintaining a RAG system needs both up-to-date knowledge of RAG techniques and deep subject matter expertise. The technical knowledge tells you which knobs to turn. The domain expertise tells you whether you turned them the right way.

Without both, you are tuning blind. You can optimise retrieval metrics without improving answer quality. You can reduce measured hallucination rates on your test set while introducing new failure modes you never thought to test.

The gap is concrete, not abstract. An engineer sees retrieval precision of 0.87 and considers it a success. A domain expert reads the top retrieved chunk and immediately recognises it is the wrong version of the regulation — superseded eighteen months ago, still in the index, still topically similar enough to rank highly. The metric looked fine. The answer was wrong. Only one of those two people knew it.

The evaluation problem no one solves perfectly

If you cannot rely on intuition and you cannot rely on deterministic guarantees, you need evaluation. But evaluation for RAG is harder than it looks.

A good evaluation set requires hundreds of question-answer pairs with verified source citations. Creating it requires subject matter expertise. Maintaining it requires updating the pairs as the underlying documents change. Running it after every change requires automation.

Even with a good evaluation set, you face a choice. Do you measure retrieval accuracy (did the right chunk appear in the top-k)? Or do you measure answer accuracy (was the final answer correct)? The first is easier to automate but does not guarantee the second. The second requires human judgment or a stronger LLM acting as a judge, which introduces its own biases.

Most teams compromise on a combination: automated metrics for retrieval, sampled human review for answer quality. The samples will never cover everything, but chosen well, with an eye for edge cases, query types, and domain-specific traps, they cover enough. It is not perfect rigour. It is calibrated pragmatism, and for most systems it is the right call.

The tooling has matured. Frameworks like RAGAS and platforms like LangSmith have made systematic evaluation significantly more accessible than it was even two years ago. They give you a workable path. But workable is not the same as solved. The frameworks still need a human to decide what a correct answer looks like, and that human still needs to know the domain.

What this means for the people building RAG systems

The honest picture is this. Generative RAG is powerful. It can ground answers in your documents, reduce hallucinations compared to raw LLMs, and provide citations that let users verify claims. But it is not a set-it-and-forget-it system. It requires ongoing maintenance. It surprises you. It does things you did not expect.

The teams that succeed with it have two things. First, they have someone who stays current with RAG research, who knows when to use HyDE versus query expansion, who understands the trade-offs between embedding models, and who can diagnose retrieval failures. Second, they have someone who knows the domain well enough to tell when an answer is wrong, what the correct answer should be, and which source documents actually contain it.

The best case is the same person doing both roles. A subject matter expert who has learned RAG engineering. Or an engineer who has developed deep domain expertise. Either path works, but you cannot skip one side.

Pure engineers who do not know the domain will tune metrics without understanding quality. Pure subject matter experts who do not understand RAG will be frustrated by the black box and unable to improve it. The hybrid practitioner, the one who speaks both languages, is the one who can maintain a RAG system over time.

The bottom line

Generative RAG feels like art because it is not fully controllable. The components interact in unpredictable ways. Temperature zero does not guarantee determinism. Improvements in one area cause regressions in another.

But that does not mean it is not worth building. It means you need the right expectations and the right people. Technical currency keeps the system running. Domain expertise keeps it accurate. Neither is sufficient alone.

The teams that understand this will build RAG systems that work. The teams that assume it is pure engineering will keep playing whack-a-mole and wondering why they cannot win.