Skip to content
Tech Nerve
RETRIEVALJournal
← All journal

The four ways RAG breaks in production.

Retrieval-augmented generation looks perfect on the whiteboard and breaks on Monday morning. These are the four failure modes we see most often in client systems — and what actually fixes them.

Tech Nerve AI/AI / ML team·7 min

RAG is the default architecture for LLM features that need to be grounded in a client's own knowledge. It also has a remarkably low survival rate once it meets real users. We have rebuilt enough of these systems for clients to have a consistent list of the things that go wrong. None of them are exotic. All of them are fixable.

1. The chunking is wrong

Most pipelines chunk documents by a fixed token count. That optimises for the embedding model's context window — not for the unit of meaning your users actually ask about. A 512-token chunk cut mid-paragraph loses its subject. A medical note split mid-sentence loses its diagnosis. A legal clause separated from its definition becomes gibberish.

Ask your users what a complete answer looks like and chunk toward that shape. For policy documents that is usually the clause. For product docs it is the section. For long-form content it is the paragraph plus its heading. Then overlap by one full unit of meaning, not by a handful of tokens.

2. The embeddings are lying

Semantic similarity and answer-relevance are not the same problem. The query "does the employee handbook allow working from home?" embeds close to "is working from home forbidden?" — fine. It also embeds close to a paragraph about equipment reimbursement that happens to share surface vocabulary. Dense retrieval will happily hand you the wrong one.

Pair embedding retrieval with a reranker. Cohere's rerank-english, Jina's rerank or a fine-tuned cross-encoder will lift top-1 accuracy by double digits on most corpora. Evaluate retrieval with labelled query-passage pairs. Not cosine numbers — actual precision at k.

3. The index is stale

RAG systems that index once and never re-index produce confident, cited, wrong answers. The cite says "Policy v3"; the policy is on v7. We have seen production systems where fifteen percent of answers referenced a document that had been superseded for eighteen months.

  • Run a continuous indexing job triggered by source-of-truth webhooks — never a nightly cron on everything.
  • Version every chunk. Retrieval includes the version; the answer cites the version.
  • Track which chunks answered which queries. When a chunk goes cold or contradicts a newer chunk, flag it for review.

4. The generator is over-trusted

The final LLM is not a verifier. It will synthesise an answer from whatever retrieval hands it — including three near-duplicate chunks and one obsolete policy. Wrap generation in a structured critique: does the answer actually cite a source in context, and does the source actually support the claim?

A RAG system without a critique pass is a system that cannot tell when it is wrong. That is not a minor gap. It is the whole difference between a demo and a product.

Cheap to run, catches most public-facing embarrassment before it happens. We typically run this as a second pass with a smaller model — the same one doing the generation, a 4-bit local model, or the fastest tier of whichever vendor you are already paying. Budget: a few cents per answer. Return: the difference between a feature you can ship and a feature you cannot.

What actually ships

Production RAG is not one system. It is a pipeline: hybrid retrieval (BM25 plus dense), rerank, structured generation with citations, critique pass, feedback loop into your eval set. Skip any of these and you will be back here, rebuilding.

Tagged
  • RAG
  • Retrieval
  • LLM
  • Architecture