What Is Retrieval-Augmented Generation (RAG)?

Retrieval-augmented generation (RAG) pairs a search index with a language model so answers are grounded in specific documents, not training-data recall.

S
StackTower AI editorial team

What Is Retrieval-Augmented Generation (RAG)?

Retrieval-augmented generation (RAG) is an architecture that augments a large language model (LLM) with an external search index at inference time. Rather than drawing solely on its compressed training corpus, the LLM first fetches a set of relevant text passages, then synthesizes a response grounded in those passages. Lewis et al. introduced the pattern at NeurIPS 20201; it addresses two persistent weaknesses of pretrained models: knowledge cutoff dates and factual degradation over large corpora.

The Two-Stage Architecture

RAG pipelines consist of a retriever and a generator operating in sequence.

Retriever
Accepts an incoming query, encodes it as a dense embedding vector, and returns the k nearest passage vectors from a pre-indexed corpus. Similarity is computed via cosine or dot-product distance in the embedding space. The corpus is indexed offline by chunking source text, encoding each chunk, and writing the vectors to a vector store.
Generator
A pretrained LLM that receives the original query concatenated with the retrieved passages as context, then produces a grounded response. The context is inserted into the prompt template so the LLM cites the retrieved material rather than falling back on compressed parametric memory.

Lewis et al. formalized two retrieval variants: RAG-Sequence (one fixed retrieved set per generated sequence) and RAG-Token (the retrieved set may shift per output token)1. Most deployed pipelines adopt the RAG-Sequence approach because it simplifies evaluation and debugging.

Step-by-Step: A Support Documentation Example

Suppose a software vendor has indexed 8,000 documentation pages in a vector store. A support technician submits: “What are the steps to reset a workspace administrator password in version 4.2?”

  1. The retriever encodes the query into a vector.
  2. Nearest-neighbor search returns the three closest documentation chunks.
  3. Those chunks are inserted into the LLM’s context window alongside the original question.
  4. The LLM outputs a step-by-step answer that cites the retrieved passages verbatim.

Without the retrieval step, the LLM would reconstruct the answer from training memory, which may predate the v4.2 release, conflate procedures across product lines, or omit company-specific configuration details. With the retrieval step, the output is traceable to authoritative source text.

The same pattern applies to legal case research, biomedical literature review, and financial filings analysis: any context where traceability to a primary source matters more than fluent paraphrase.

Three Implementation Trade-offs

  1. Chunking granularity. Fixed-size chunks (e.g., 512 tokens with 50-token overlap) are the default starting point. Semantic chunking at paragraph or section boundaries improves retrieval precision for structured documentation but requires more complex preprocessing. Chunk size directly controls context-window consumption per request.

  2. Embedding selection. The embedding function determines how well cosine distance in the vector space correlates with semantic relevance for the target corpus. Domain-adapted fine-tuning of the embedding function improves recall on specialized text such as legal statutes or biomedical abstracts.

  3. Reranking. A cross-encoder reranker scores each retrieved passage against the query individually, reordering the top-k list before it reaches the LLM. This two-stage pattern (fast approximate vector search followed by precise cross-encoder reranking) is documented in the LangChain retrieval reference2.

RAG Versus Fine-Tuning

Fine-tuning
Adjusts the LLM’s weights by continued training on a target corpus. This modifies the model’s behavior and injects implicit knowledge into its parameters, but does not give the model access to text added after fine-tuning completed. Each knowledge update requires a new training run.
RAG
Leaves model weights unchanged. The indexed corpus is updated by re-ingesting new documents, typically in minutes rather than hours of GPU training. Suited to frequently changing reference material: policy documents, product catalogs, or regulatory filings.

The practical decision rule: fine-tune to change how the model writes or reasons; use RAG to change what the model can cite. The OpenAI Cookbook illustrates the question-answering-with-embeddings pattern that operationalizes this distinction3.

Evaluation Axes

RAG quality is measured at two independent points in the pipeline:

  1. Retrieval quality. Recall@k measures what fraction of relevant passages appear in the top-k results; mean reciprocal rank (MRR) captures how highly the first relevant passage is ranked. A high-quality LLM cannot compensate for a retriever that fails to surface the relevant passage.

  2. Generation faithfulness. Given the retrieved context, does the LLM’s output contradict or hallucinate beyond the source passages? The RAGAS framework (github.com/explodinggradients/ragas) operationalizes faithfulness, answer relevance, and context precision into a reproducible evaluation harness. A ground-truth test set of at least 50 question-answer pairs is the minimum baseline for comparing RAG configurations across iterations.

Frequently Asked Questions

How does RAG differ from keyword search?

Keyword search ranks results by term-overlap statistics (BM25 or TF-IDF); it does not synthesize an answer. RAG uses dense vector similarity for retrieval (capturing semantic paraphrase rather than lexical match), then passes the retrieved passages to an LLM to synthesize a direct response.

Does RAG remove hallucination?

RAG reduces fabrication on questions within the indexed scope by grounding LLM output in retrieved passages. It does not eliminate fabrication: the LLM may still misread or misattribute retrieved text, and out-of-scope queries fall back on parametric memory. Faithfulness scoring via an evaluation harness is the practical mechanism for quantifying residual hallucination in a specific deployment.

When is RAG the wrong architecture?

RAG adds inference latency, indexing overhead, and operational complexity. For narrow, stable question sets with a finite known answer space, a lookup table or a fine-tuned classifier is more efficient. RAG becomes the appropriate choice when the answer space is too large or changes too rapidly to be captured in a system prompt or a model’s weights.


StackTower AI editorial team: AI learning paths and practical tooling explainers. This article was written with AI assistance and reviewed by the StackTower AI editorial board.

Written with AI assistance. Content reviewed by the StackTower AI editorial team. Published 2026-05-11.

Footnotes

  1. Lewis, P. et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020. arXiv:2005.11401. https://arxiv.org/abs/2005.11401 2

  2. LangChain documentation, “RAG” concept guide. https://python.langchain.com/docs/concepts/rag/

  3. OpenAI Cookbook, “Question answering using embeddings-based search.” https://cookbook.openai.com/examples/question_answering_using_embeddings

Disclosure · Editorial policy