Designing Scalable RAG Systems: Patterns and Pitfalls

Retrieval-augmented generation has moved from a clever trick to the default architecture for any LLM application that needs to ground its answers in a specific corpus. Most RAG systems that ship today look broadly similar on a whiteboard and perform wildly differently in practice. The gap almost always lives in the details.

Chunking is a modeling decision

The single most consequential choice in a RAG system is how the source documents are split. Too small and you lose the context that makes a passage useful. Too large and the embedding becomes a blurry average that matches everything and nothing. Semantic chunking, which respects document structure like sections and paragraphs, consistently outperforms fixed-size chunking once you measure it carefully.

Overlap between chunks matters more than most teams expect. A little overlap preserves the boundaries that would otherwise split a critical sentence. Too much overlap wastes storage and inflates retrieval cost without improving quality.

Embeddings are not interchangeable

The leap in embedding quality over the past year has been significant, but the right model for a domain is often not the current leader on a public benchmark. Domain-specific corpora benefit disproportionately from embeddings that were either fine-tuned on similar text or selected by honest evaluation against your own query set. The cost of running this evaluation is modest. The cost of shipping with the wrong embedding model is not.

Hybrid retrieval outperforms pure vector search

Dense vector retrieval is excellent at semantic similarity and unreliable at exact matches, proper names, and rare terms. Keyword retrieval is the opposite. A hybrid system that fuses the two, using reciprocal rank fusion or a learned combiner, consistently does better than either alone on real workloads. Anyone building a serious RAG system should consider pure vector search a starting point, not a final architecture.

Re-ranking changes the economics

A cross-encoder re-ranker applied to the top fifty or one hundred candidates from the first-stage retrieval produces dramatically better top-ten results than the retriever alone. It is the single highest-impact addition most RAG systems can make. The cost is modest because the re-ranker only runs on a small candidate set.

The catch is latency. Re-rankers add tens to hundreds of milliseconds, and teams with tight budgets sometimes need to trade depth of re-ranking for speed. That tradeoff should be measured, not guessed.

Evaluation is the hardest part

Most RAG teams spend too little time on evaluation, which is precisely why so many deployed systems quietly underperform. The relevant metrics are not a single number. Retrieval quality, grounding quality, faithfulness, and answer usefulness are distinct, and a system can regress on one while improving on another. Synthetic evaluation with an LLM judge works well for rapid iteration but needs human spot-checks to stay honest.

Where RAG systems break

The common failure modes are predictable. Stale indexes that do not reflect updated source documents. Prompt templates that do not actually use the retrieved context. Hallucinations that survive because grounding metrics are not measured. Query distributions in production that look nothing like the ones used for evaluation. Each of these is an operational problem as much as a modeling problem, and each yields to disciplined measurement.

Building a RAG prototype is easy. Building a RAG system that remains accurate, fresh, and affordable as it scales is an ongoing engineering effort. The systems that succeed treat retrieval, generation, and evaluation as equal citizens.