Skip to main content
TWYTech World by Yashrajsinh

RAG Pipelines Deep Dive

Y
Yashrajsinh
··13 min read·Intermediate

RAG Pipelines Deep Dive

Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding large language model responses in factual, up-to-date information without retraining the model. Instead of relying solely on the knowledge baked into model weights during pretraining, a RAG pipeline retrieves relevant documents from an external knowledge base at query time and injects them into the prompt context. The model then generates an answer conditioned on both the query and the retrieved evidence.

RAG solves three critical problems that plague standalone LLM deployments: hallucination (the model invents facts), staleness (the model's knowledge has a training cutoff), and lack of domain specificity (the model cannot access proprietary data). By decoupling knowledge storage from the reasoning engine, RAG lets you update your knowledge base continuously without touching the model, control exactly which information the model can reference, and audit the sources behind every generated answer.

This guide walks through the complete RAG pipeline architecture from document ingestion to answer generation, covering chunking strategies, embedding models, vector databases, retrieval algorithms, reranking, prompt construction, and evaluation. If you are new to language models, start with LLM Engineering Fundamentals before diving into RAG. For building RAG systems in Java, the LangChain for Java Guide provides framework-level implementation details. For understanding how RAG fits into broader agent architectures, see the AI Agents Architecture Guide.

What You Will Learn

By the end of this guide you will understand:

  • The end-to-end architecture of a RAG pipeline and how each stage contributes to answer quality
  • Document ingestion strategies including parsing, cleaning, and metadata extraction
  • Chunking approaches and how chunk size affects retrieval precision and recall
  • How embedding models convert text into dense vectors for semantic similarity search
  • Vector database selection criteria and indexing strategies for production workloads
  • Retrieval algorithms including dense retrieval, sparse retrieval, and hybrid approaches
  • Reranking techniques that improve precision after initial retrieval
  • Prompt construction patterns that maximize the model's use of retrieved context
  • Evaluation frameworks for measuring RAG pipeline quality end to end

Prerequisites

  • Understanding of LLM fundamentals including tokens, context windows, and prompting techniques
  • Basic familiarity with Python and REST APIs
  • Understanding of how vector similarity works at a conceptual level (cosine similarity, dot product)
  • Experience with at least one database system (relational or document-oriented)

Concept Overview

A RAG pipeline has two main phases: the offline indexing phase and the online query phase. During indexing, you ingest documents, split them into chunks, generate embedding vectors for each chunk, and store those vectors in a vector database alongside the original text and metadata. During the query phase, you embed the user's question, search the vector database for the most similar chunks, optionally rerank the results, construct a prompt that includes the retrieved context, and send it to the language model for answer generation.

The quality of a RAG system depends on every stage working well together. Poor chunking produces fragments that lack sufficient context for the model to reason about. Weak embeddings fail to capture semantic relationships between the query and relevant documents. A vector database with the wrong index type introduces latency or reduces recall. Skipping reranking means the model receives marginally relevant documents that dilute the signal. Each stage has engineering tradeoffs that this guide explores in depth.

The fundamental insight behind RAG is that retrieval and generation are complementary capabilities. Retrieval excels at finding specific facts in large corpora but cannot synthesize or reason. Generation excels at synthesis and reasoning but cannot reliably recall specific facts. RAG combines both strengths by letting retrieval handle the factual grounding while generation handles the reasoning and natural language output.

Step-by-Step Explanation

This section walks through the core implementation steps in sequence. Each step builds on the previous one, giving you a practical path from initial setup to a working solution that you can adapt for your own projects.

Step 1: Document Ingestion and Parsing

The first stage of any RAG pipeline is getting your documents into a format suitable for chunking and embedding. Raw documents come in many formats: PDF, HTML, Markdown, Word documents, Confluence pages, Slack messages, code repositories, and database records. Each format requires a different parser to extract clean text while preserving meaningful structure.

Document parsing is not trivial. PDFs may contain tables, images with embedded text, multi-column layouts, headers and footers that repeat on every page, and formatting artifacts. HTML pages contain navigation elements, advertisements, and boilerplate that should be stripped. The goal is to extract the substantive content while preserving structural cues like headings, lists, and paragraph boundaries that help the chunker make intelligent split decisions.

A robust ingestion pipeline handles format detection, text extraction, metadata extraction (title, author, date, source URL), cleaning (removing boilerplate, normalizing whitespace), and deduplication. Metadata is critical because it enables filtered retrieval later. For example, you might want to retrieve only documents published after a certain date, or only documents from a specific department.

from dataclasses import dataclass
from typing import Optional
import hashlib
 
@dataclass
class Document:
    content: str
    metadata: dict
    source_id: str
    content_hash: str
 
    @classmethod
    def from_raw(cls, text: str, metadata: dict, source: str) -> "Document":
        cleaned = clean_text(text)
        content_hash = hashlib.sha256(cleaned.encode()).hexdigest()
        return cls(
            content=cleaned,
            metadata=metadata,
            source_id=source,
            content_hash=content_hash
        )
 
def clean_text(text: str) -> str:
    """Remove boilerplate, normalize whitespace, strip control characters."""
    import re
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
    return text.strip()
 
def parse_pdf(file_path: str) -> list[Document]:
    """Extract text and metadata from a PDF file."""
    import fitz  # PyMuPDF
    doc = fitz.open(file_path)
    documents = []
    metadata = {
        "title": doc.metadata.get("title", ""),
        "author": doc.metadata.get("author", ""),
        "source_type": "pdf",
        "file_path": file_path,
        "page_count": len(doc)
    }
    for page_num, page in enumerate(doc):
        text = page.get_text("text")
        if text.strip():
            page_meta = {**metadata, "page_number": page_num + 1}
            documents.append(
                Document.from_raw(text, page_meta, f"{file_path}:page_{page_num+1}")
            )
    return documents

The content hash enables deduplication across ingestion runs. When you re-ingest a document that has not changed, you can skip the embedding step entirely, saving both time and cost. This is especially important for large knowledge bases that are re-indexed on a schedule.

Step 2: Chunking Strategies

Chunking is the process of splitting documents into smaller pieces that fit within the embedding model's context window and contain enough information to be useful when retrieved. The chunk size is one of the most impactful parameters in a RAG pipeline. Too small and chunks lack context, making it hard for the model to understand them in isolation. Too large and chunks contain too much irrelevant information, diluting the signal and wasting context window space.

There are several chunking strategies, each with different tradeoffs. Fixed-size chunking splits text into chunks of a predetermined token count with optional overlap. Recursive character splitting tries to split on natural boundaries (paragraphs, sentences, words) while respecting a maximum size. Semantic chunking uses embedding similarity to detect topic boundaries and splits where the topic changes. Document-structure-aware chunking uses headings, sections, and other structural elements to create chunks that align with the document's logical organization.

For most production systems, recursive splitting with overlap provides the best balance of simplicity and quality. A chunk size of 512 to 1024 tokens with 50 to 100 tokens of overlap works well for general-purpose retrieval. The overlap ensures that information at chunk boundaries is not lost. Semantic chunking produces higher-quality chunks but adds complexity and latency to the indexing pipeline.

from dataclasses import dataclass
 
@dataclass
class Chunk:
    text: str
    metadata: dict
    chunk_index: int
    start_char: int
    end_char: int
 
def recursive_split(
    text: str,
    max_chunk_size: int = 800,
    overlap: int = 100,
    separators: list[str] = None
) -> list[str]:
    """Split text recursively on natural boundaries."""
    if separators is None:
        separators = ["\n\n", "\n", ". ", " "]
 
    if len(text) <= max_chunk_size:
        return [text]
 
    chunks = []
    separator = separators[0]
    remaining_separators = separators[1:]
 
    parts = text.split(separator)
    current_chunk = ""
 
    for part in parts:
        candidate = current_chunk + separator + part if current_chunk else part
        if len(candidate) <= max_chunk_size:
            current_chunk = candidate
        else:
            if current_chunk:
                chunks.append(current_chunk)
            if len(part) > max_chunk_size and remaining_separators:
                sub_chunks = recursive_split(part, max_chunk_size, overlap, remaining_separators)
                chunks.extend(sub_chunks)
                current_chunk = ""
            else:
                current_chunk = part
 
    if current_chunk:
        chunks.append(current_chunk)
 
    # Add overlap between consecutive chunks
    if overlap > 0 and len(chunks) > 1:
        overlapped = [chunks[0]]
        for i in range(1, len(chunks)):
            prev_tail = chunks[i - 1][-overlap:]
            overlapped.append(prev_tail + chunks[i])
        chunks = overlapped
 
    return chunks
 
def chunk_document(doc: Document, max_size: int = 800, overlap: int = 100) -> list[Chunk]:
    """Chunk a document and preserve metadata lineage."""
    raw_chunks = recursive_split(doc.content, max_size, overlap)
    chunks = []
    offset = 0
    for i, text in enumerate(raw_chunks):
        start = doc.content.find(text[:50], offset)
        end = start + len(text) if start >= 0 else offset + len(text)
        chunks.append(Chunk(
            text=text,
            metadata={**doc.metadata, "source_id": doc.source_id},
            chunk_index=i,
            start_char=max(start, 0),
            end_char=end
        ))
        offset = max(start, offset)
    return chunks

Step 3: Embedding Models and Vector Generation

Once documents are chunked, each chunk needs to be converted into a dense vector representation that captures its semantic meaning. Embedding models map text to fixed-dimensional vectors such that semantically similar texts produce vectors that are close together in the embedding space (measured by cosine similarity or dot product).

The choice of embedding model significantly impacts retrieval quality. Key factors include the model's training data (does it cover your domain?), its dimensionality (higher dimensions capture more nuance but increase storage and search costs), its maximum input length (chunks longer than this are truncated), and its performance on retrieval benchmarks like MTEB.

For production systems, you typically choose between hosted embedding APIs (OpenAI, Cohere, Voyage AI) and self-hosted models (sentence-transformers, E5, BGE). Hosted APIs are simpler to operate but introduce latency, cost per token, and a dependency on an external service. Self-hosted models give you full control over latency and cost but require GPU infrastructure for reasonable throughput.

A critical but often overlooked detail is that many embedding models are trained asymmetrically: they expect queries and documents to be embedded differently. Some models use instruction prefixes like "query:" and "passage:" to signal the embedding mode. Using the wrong prefix or no prefix at all can significantly degrade retrieval quality.

from typing import Protocol
import numpy as np
 
class EmbeddingModel(Protocol):
    def embed_documents(self, texts: list[str]) -> list[list[float]]: ...
    def embed_query(self, text: str) -> list[float]: ...
 
class OpenAIEmbeddings:
    def __init__(self, model: str = "text-embedding-3-small", dimensions: int = 1536):
        from openai import OpenAI
        self.client = OpenAI()
        self.model = model
        self.dimensions = dimensions
 
    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        response = self.client.embeddings.create(
            input=texts,
            model=self.model,
            dimensions=self.dimensions
        )
        return [item.embedding for item in response.data]
 
    def embed_query(self, text: str) -> list[float]:
        response = self.client.embeddings.create(
            input=[text],
            model=self.model,
            dimensions=self.dimensions
        )
        return response.data[0].embedding
 
def batch_embed(
    chunks: list[Chunk],
    model: EmbeddingModel,
    batch_size: int = 100
) -> list[tuple[Chunk, list[float]]]:
    """Embed chunks in batches to respect API rate limits."""
    results = []
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [chunk.text for chunk in batch]
        embeddings = model.embed_documents(texts)
        results.extend(zip(batch, embeddings))
    return results

Step 4: Vector Database Selection and Indexing

The vector database stores your chunk embeddings and supports fast approximate nearest neighbor (ANN) search at query time. The choice of vector database depends on your scale requirements, latency budget, filtering needs, and operational preferences.

Popular options include Pinecone (fully managed, serverless scaling), Weaviate (open source, hybrid search built in), Qdrant (open source, rich filtering), Chroma (lightweight, good for prototyping), pgvector (PostgreSQL extension, familiar operational model), and Milvus (open source, designed for billion-scale). Each has different strengths around indexing speed, query latency, metadata filtering, multi-tenancy, and cost.

For most production RAG systems, you need metadata filtering alongside vector search. For example, retrieving only chunks from documents published in the last year, or only chunks from a specific department's knowledge base. Not all vector databases handle filtered search equally well. Some apply filters before the ANN search (pre-filtering), which can reduce recall if the filter is very selective. Others apply filters after retrieval (post-filtering), which wastes computation on irrelevant results. The best systems support both modes and let you choose based on your filter selectivity.

Index type matters for the latency-recall tradeoff. HNSW (Hierarchical Navigable Small World) graphs provide excellent recall with low latency but consume significant memory. IVF (Inverted File Index) variants use less memory but require careful tuning of the number of clusters and probes. For datasets under a few million vectors, HNSW is almost always the right choice. For billion-scale datasets, you may need quantization (PQ, SQ) combined with IVF to fit within memory budgets.

Step 5: Retrieval Algorithms

The retrieval stage takes the user's query, embeds it, and searches the vector database for the most relevant chunks. The simplest approach is pure dense retrieval: embed the query and find the k nearest neighbors by cosine similarity. This works well when the query and relevant documents share semantic meaning even if they use different words.

However, dense retrieval has weaknesses. It can miss documents that match on specific keywords or identifiers (like error codes, product names, or version numbers) because embedding models compress these into the same region of vector space. Sparse retrieval methods like BM25 excel at exact keyword matching but miss semantic paraphrases.

Hybrid retrieval combines both approaches. You run a dense vector search and a sparse keyword search in parallel, then merge the results using reciprocal rank fusion (RRF) or a learned score combination. Hybrid retrieval consistently outperforms either approach alone across diverse query types.

from dataclasses import dataclass
 
@dataclass
class RetrievalResult:
    chunk: Chunk
    score: float
    retrieval_method: str
 
def hybrid_retrieve(
    query: str,
    embedding_model: EmbeddingModel,
    vector_store,
    sparse_index,
    k: int = 20,
    dense_weight: float = 0.7,
    sparse_weight: float = 0.3
) -> list[RetrievalResult]:
    """Combine dense and sparse retrieval with reciprocal rank fusion."""
    # Dense retrieval
    query_embedding = embedding_model.embed_query(query)
    dense_results = vector_store.search(query_embedding, top_k=k)
 
    # Sparse retrieval (BM25)
    sparse_results = sparse_index.search(query, top_k=k)
 
    # Reciprocal Rank Fusion
    rrf_scores: dict[str, float] = {}
    chunk_map: dict[str, Chunk] = {}
    rrf_k = 60  # Standard RRF constant
 
    for rank, result in enumerate(dense_results):
        doc_id = result.chunk.metadata["source_id"] + str(result.chunk.chunk_index)
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + dense_weight / (rrf_k + rank + 1)
        chunk_map[doc_id] = result.chunk
 
    for rank, result in enumerate(sparse_results):
        doc_id = result.chunk.metadata["source_id"] + str(result.chunk.chunk_index)
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + sparse_weight / (rrf_k + rank + 1)
        chunk_map[doc_id] = result.chunk
 
    # Sort by combined score
    sorted_ids = sorted(rrf_scores.keys(), key=lambda x: rrf_scores[x], reverse=True)
    return [
        RetrievalResult(
            chunk=chunk_map[doc_id],
            score=rrf_scores[doc_id],
            retrieval_method="hybrid"
        )
        for doc_id in sorted_ids[:k]
    ]

Step 6: Reranking for Precision

Initial retrieval casts a wide net to maximize recall, but the top results may not all be equally relevant. Reranking applies a more expensive but more accurate relevance model to the retrieved candidates to reorder them by true relevance to the query. This is a classic two-stage retrieval pattern borrowed from information retrieval research.

Cross-encoder rerankers take the query and a candidate document as a pair and output a relevance score. Unlike bi-encoders (embedding models) that encode query and document independently, cross-encoders can attend to fine-grained interactions between query terms and document terms. This makes them significantly more accurate but also much slower, which is why they are applied only to the top-k candidates from the first stage rather than the entire corpus.

Popular reranking models include Cohere Rerank, cross-encoder models from sentence-transformers (like ms-marco-MiniLM), and ColBERT-style late interaction models that offer a middle ground between speed and accuracy. For production systems, reranking typically improves precision at the top positions by ten to thirty percent compared to embedding-only retrieval.

After reranking, you select the top-n results (typically three to five) to include in the prompt context. Including too many results wastes context window space and can confuse the model. Including too few risks missing relevant information. The optimal number depends on your chunk size, context window budget, and the complexity of the question.

Real-World Use Cases

RAG pipelines power a wide range of production applications across industries. Customer support systems use RAG to ground chatbot responses in product documentation, knowledge base articles, and past ticket resolutions. The retrieval step finds relevant support articles while the generation step synthesizes a natural language answer tailored to the specific customer question.

Enterprise search platforms use RAG to provide conversational answers over internal documents, Confluence wikis, Slack archives, and code repositories. Unlike traditional keyword search that returns a list of links, RAG search returns a synthesized answer with citations pointing back to the source documents.

Legal research tools use RAG to help lawyers find relevant case law, statutes, and regulatory guidance. The domain-specific nature of legal language makes embedding model selection critical. Models fine-tuned on legal corpora significantly outperform general-purpose embeddings for this use case.

Healthcare applications use RAG to help clinicians find relevant clinical guidelines, drug interactions, and research papers. The stakes are high in this domain, making citation accuracy and hallucination detection especially important. Production healthcare RAG systems typically include a verification step that checks whether the generated answer is fully supported by the retrieved evidence.

Code generation assistants use RAG to retrieve relevant code examples, API documentation, and internal library usage patterns. The retrieval step finds code snippets and documentation relevant to the developer's current task, while the generation step produces code that follows the patterns and conventions found in the retrieved context.

Best Practices

Design your chunking strategy around how users will query the system, not around document structure alone. If users ask specific factual questions, smaller chunks with high precision work better. If users ask broad analytical questions, larger chunks that preserve more context work better. Many production systems use multiple chunk sizes and retrieve from both.

Always include metadata with your chunks and use metadata filtering to narrow the search space before vector similarity. Filtering by document type, date range, department, or access level reduces noise and improves relevance. It also enables multi-tenant RAG systems where different users see different subsets of the knowledge base.

Monitor retrieval quality continuously in production. Track metrics like retrieval precision (what fraction of retrieved chunks are relevant), recall (what fraction of relevant chunks are retrieved), and answer faithfulness (does the generated answer stay grounded in the retrieved context). These metrics degrade over time as the knowledge base grows and query patterns shift.

Version your embeddings and maintain backward compatibility. When you upgrade your embedding model, all existing vectors become incompatible with new query vectors. Plan for full re-indexing when changing models, and consider running old and new indexes in parallel during migration with A/B testing to validate improvement.

Use caching aggressively at every stage. Cache embedding computations for repeated queries, cache retrieval results for popular questions, and cache generated answers for identical query-context pairs. Caching reduces latency, cost, and load on your vector database.

Implement graceful degradation for when retrieval returns no relevant results. The system should recognize low-confidence retrievals (all similarity scores below a threshold) and either ask the user to rephrase, acknowledge uncertainty, or fall back to the model's parametric knowledge with an appropriate disclaimer.

Common Mistakes

The most common mistake is choosing chunk sizes without experimentation. Engineers often pick a default like 512 tokens and never revisit it. Different document types and query patterns benefit from different chunk sizes. Always run retrieval quality evaluations across a range of chunk sizes before committing to one.

Another frequent mistake is ignoring the asymmetric nature of embedding models. Many models expect different prefixes or instructions for queries versus documents. Embedding both with the same prefix (or no prefix) can reduce retrieval quality by twenty percent or more without any obvious error signal.

Skipping reranking is a common cost-optimization mistake that hurts quality more than expected. The marginal cost of reranking twenty candidates is small compared to the cost of generating a poor answer from irrelevant context. Always benchmark with and without reranking before deciding to skip it.

Over-stuffing the context window with retrieved chunks is another anti-pattern. Including ten or fifteen chunks when three would suffice forces the model to sift through irrelevant information, increases latency, increases cost, and can actually reduce answer quality. The model's attention mechanism works best when the context is focused and relevant.

Failing to handle document updates is a subtle but critical mistake. When a source document is updated, the old chunks and embeddings become stale. Without a mechanism to detect changes, re-chunk, re-embed, and replace old vectors, your RAG system gradually drifts from reality. Implement content hashing and incremental re-indexing from day one.

Not evaluating end-to-end is perhaps the most damaging mistake. Engineers optimize individual components (better embeddings, faster vector search, smarter chunking) without measuring whether those improvements translate to better final answers. Always evaluate the complete pipeline from query to generated answer, not just intermediate retrieval metrics.

Summary

RAG pipelines are the standard architecture for building LLM applications that need access to external knowledge. The pipeline consists of an offline indexing phase (document ingestion, chunking, embedding, vector storage) and an online query phase (query embedding, retrieval, reranking, prompt construction, generation). Each stage has engineering tradeoffs that affect the final answer quality, latency, and cost.

The key design decisions are chunk size and overlap (balancing context preservation against retrieval precision), embedding model selection (balancing quality against cost and latency), vector database choice (balancing scale against operational complexity), retrieval strategy (dense, sparse, or hybrid), and reranking (balancing precision improvement against added latency). Production systems need monitoring, caching, graceful degradation, and incremental re-indexing to maintain quality over time.

RAG is not a solved problem. Active research areas include multi-hop retrieval for complex questions that require synthesizing information from multiple documents, adaptive retrieval that decides whether retrieval is even needed for a given query, and self-reflective RAG where the model evaluates its own answer and triggers additional retrieval if the evidence is insufficient. As these techniques mature, they will become standard components of production RAG pipelines.

Advanced7 min read

AI Evaluation and Guardrails

Master LLM evaluation frameworks, safety guardrails, output validation, and production monitoring strategies for reliable AI application deployment.

Advanced10 min read

AI Agents Architecture Complete Guide

Design useful AI agents with tools, planning loops, memory, workflow boundaries, human review gates, and production deployment patterns.

Intermediate15 min read

AI Agents Complete Roadmap for Engineers

Master AI agent development from core concepts to production deployment covering planning loops, tool use, memory, orchestration, and evaluation.