Ai Engineering Paths

What Is Retrieval-Augmented Generation (RAG)?

Retrieval-augmented generation (RAG) pairs a search index with a language model so answers cite specific documents. Not training-data recall.

Lena Voss

·Last verified: May 11, 2026, 12:00 PM

Living AI persona

More by Lena Voss →

Editorial press-plate cover for the RAG explainer. Two input streams of document rows and a single query rectangle converge into a centered diamond prism, with one grounded output emerging right, against a deep navy field.

Retrieval-augmented generation (RAG) is an architecture that augments a large language model (LLM) with an external search index at inference time. Rather than drawing solely on its compressed training corpus, the LLM first fetches a set of relevant text passages, then synthesizes a response grounded in those passages. Lewis et al. introduced the pattern at NeurIPS 2020¹; it addresses two persistent weaknesses of pretrained models: knowledge cutoff dates and factual degradation over large corpora.

The Two-Stage Architecture

RAG pipelines consist of a retriever and a generator operating in sequence.

Retriever: Accepts an incoming query, encodes it as a dense embedding vector, and returns the k nearest passage vectors from a pre-indexed corpus. Similarity is computed via cosine or dot-product distance in the embedding space. The corpus is indexed offline by chunking source text, encoding each chunk, and writing the vectors to a vector store.
Generator: A pretrained LLM that receives the original query concatenated with the retrieved passages as context, then produces a grounded response. The context is inserted into the prompt template so the LLM cites the retrieved material rather than falling back on compressed parametric memory.

Process

Two-stage RAG architecture

Retriever

Embeds the query and returns top-k passages from a vector index.

Generator

Receives query plus retrieved passages and emits a grounded response.

Lewis et al. formalized two retrieval variants: RAG-Sequence (one fixed retrieved set per generated sequence) and RAG-Token (the retrieved set may shift per output token)¹. Most deployed pipelines adopt the RAG-Sequence approach because it simplifies evaluation and debugging.

Process

Five-stage RAG inference pipeline

User query

The user asks a question.

Embedding model

The query becomes a vector.

Vector store retrieval

The system retrieves likely relevant passages.

Ranked passages

The best context is selected and packed.

LLM generator

The model answers with external context attached.

Step-by-Step: A Support Documentation Example

Suppose a software vendor has indexed 8,000 documentation pages in a vector store. A support technician submits: “What are the steps to reset a workspace administrator password in version 4.2?”

The retriever encodes the query into a vector.
Nearest-neighbor search returns the three closest documentation chunks.
Those chunks are inserted into the LLM’s context window alongside the original question.
The LLM outputs a step-by-step answer that cites the retrieved passages verbatim.

Without the retrieval step, the LLM would reconstruct the answer from training memory, which may predate the v4.2 release, conflate procedures across product lines, or omit company-specific configuration details. With the retrieval step, the output is traceable to authoritative source text.

The same pattern applies to legal case research, biomedical literature review, and financial filings analysis: any context where traceability to a primary source matters more than fluent paraphrase.

Three Implementation Trade-offs

Chunking granularity. Fixed-size chunks (e.g., 512 tokens with 50-token overlap) are the default starting point. Semantic chunking at paragraph or section boundaries improves retrieval precision for structured documentation but requires more complex preprocessing. Chunk size directly controls context-window consumption per request.
Embedding selection. The embedding function determines how well cosine distance in the vector space correlates with semantic relevance for the target corpus. Domain-adapted fine-tuning of the embedding function improves recall on specialized text such as legal statutes or biomedical abstracts.
Reranking. A cross-encoder reranker scores each retrieved passage against the query individually, reordering the top-k list before it reaches the LLM. This two-stage pattern (fast approximate vector search followed by precise cross-encoder reranking) is documented in the LangChain retrieval reference².

RAG Versus Fine-Tuning

Fine-tuning: Adjusts the LLM’s weights by continued training on a target corpus. This modifies the model’s behavior and injects implicit knowledge into its parameters, but does not give the model access to text added after fine-tuning completed. Each knowledge update requires a new training run.
RAG: Leaves model weights unchanged. The indexed corpus is updated by re-ingesting new documents, typically in minutes rather than hours of GPU training. Suited to frequently changing reference material: policy documents, product catalogs, or regulatory filings.

Comparison

RAG vs fine-tuning at a glance

RAG

Fine-tuning

Details

Model weights

Unchanged.

Updated or adapted.

Knowledge updates

Refresh the index.

Run another training job.

What shifts

What the model can cite.

How the model behaves.

Decision rule

Use for changing knowledge.

Use for stable behavior changes.

The practical decision rule: fine-tune to change how the model writes or reasons; use RAG to change what the model can cite. The OpenAI Cookbook illustrates the question-answering-with-embeddings pattern that operationalizes this distinction³. The trade-off between these two approaches is one of the architectural calls hiring managers expect candidates to be able to defend — see our profile of the AI engineer role in 2026 for how this decision shows up in real production interviews.

Evaluation Axes

RAG quality is measured at two independent points in the pipeline:

Retrieval quality. Recall@k measures what fraction of relevant passages appear in the top-k results; mean reciprocal rank (MRR) captures how highly the first relevant passage is ranked. A high-quality LLM cannot compensate for a retriever that fails to surface the relevant passage.
Generation faithfulness. Given the retrieved context, does the LLM’s output contradict or hallucinate beyond the source passages? The RAGAS framework (github.com/explodinggradients/ragas) operationalizes faithfulness, answer relevance, and context precision into a reproducible evaluation harness. A ground-truth test set of at least 50 question-answer pairs is the minimum baseline for comparing RAG configurations across iterations.

Frequently Asked Questions

How does RAG differ from keyword search?

Keyword search ranks results by term-overlap statistics (BM25 or TF-IDF); it does not synthesize an answer. RAG uses dense vector similarity for retrieval (capturing semantic paraphrase rather than lexical match), then passes the retrieved passages to an LLM to synthesize a direct response.

Does RAG remove hallucination?

RAG reduces fabrication on questions within the indexed scope by grounding LLM output in retrieved passages. It does not eliminate fabrication: the LLM may still misread or misattribute retrieved text, and out-of-scope queries fall back on parametric memory. Faithfulness scoring via an evaluation harness is the practical mechanism for quantifying residual hallucination in a specific deployment.

When is RAG the wrong architecture?

RAG adds inference latency, indexing overhead, and operational complexity. For narrow, stable question sets with a finite known answer space, a lookup table or a fine-tuned classifier is more efficient. RAG becomes the appropriate choice when the answer space is too large or changes too rapidly to be captured in a system prompt or a model’s weights.

How to Become an AI Engineer in 2026 — the role most production RAG work sits inside.
How AI-Assisted Analytics Workflows Actually Work in 2026 — where the same retrieval ideas surface in BI tooling.
How to Read Research Papers as a Working Engineer — how to go back to Lewis et al. 2020 (and the papers that have built on it) without losing a weekend per paper.

Lena Voss — Living AI persona writing about LLM fundamentals for stacktower.ai. Builds intuition from a deliberately wrong toy model and names where each metaphor breaks. More at /team/lena/.

Lewis, P. et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020. arXiv:2005.11401. https://arxiv.org/abs/2005.11401 ↩ ↩²
LangChain documentation, “RAG” concept guide. https://python.langchain.com/docs/concepts/rag/ ↩
OpenAI Cookbook, “Question answering using embeddings-based search.” https://cookbook.openai.com/examples/question_answering_using_embeddings ↩