Agentic + LLM Systems

RAG Pipeline Production Architecture 2026: Chunking, Retrieval, Re-ranking, and Evaluation

Most RAG tutorials get you from zero to a working demo in 30 minutes. Production RAG takes 6–12 months to get right, and the problems that sink it are not the ones covered in the tutorial. This is the production engineering guide: chunking strategy, hybrid retrieval, re-ranking, evaluation frameworks, and the operational patterns that keep RAG systems working after launch.

Share this article
Comments
Share:
Most RAG tutorials get you from zero to a working demo in 30 minutes. Production RAG takes 6–12 months to get right, and the problems that sink it are not the ones covered in the tutorial. This is the production engineering guide: chunking strategy, hybrid retrieval, re-ranking, evaluation frameworks, and the operational patterns that keep RAG systems working after launch.
Table of Contents

Every RAG tutorial ends at the same place: you’ve embedded some documents, built a retriever, wired it to an LLM, and the demo works. “Context-aware answers from your own data in 30 minutes” is a real thing you can build.

What happens next is the production engineering problem. The demo works on 50 clean documents with clearly-formulated questions. Production has 50,000 documents, messy formatting, questions that span multiple documents, users who phrase queries in unexpected ways, a corpus that changes daily, and stakeholders who expect the system to be right — not just approximately right — when it’s connected to a compliance or financial workflow.

The gaps between “demo works” and “production works” are specific and well-documented at this point. This is the production engineering guide for each of them.


The Four Failure Points in Production RAG

Before getting into implementation, the diagnosis: most production RAG failures fall into one of four categories.

1. Chunking failure. The relevant information is split across chunk boundaries. The retriever finds chunks A and C but not B, and the context is incomplete. Or the chunks are too large and include irrelevant content that confuses the generator. Or the chunk boundaries cut across a table, a numbered list, or a definition in a way that makes each chunk semantically incoherent.

2. Retrieval failure. The relevant chunks exist in the corpus but don’t get retrieved. Pure semantic search misses exact terminology. Retrieval returns the right documents but ranks them wrong. Metadata filtering is too aggressive and excludes relevant content.

3. Evaluation gap. There is no systematic measurement of retrieval quality or answer quality. Developers test a handful of queries manually, the system goes to production, and quality drift isn’t detected until users complain.

4. Corpus management failure. Documents are updated or deleted, but the vector index isn’t. Queries return stale or removed content. The corpus grows without a strategy for managing index size, and retrieval quality degrades as irrelevant old documents compete with current ones.

Each has a solution. None of the solutions are complicated. All of them require engineering discipline that the 30-minute tutorial skips.


Chunking Strategy

Chunking is the most underrated decision in RAG architecture. The quality of your retrieval is bounded by the quality of your chunks — a perfect retriever can’t compensate for chunks that don’t contain coherent, self-contained information units.

Naive Chunking (and Why It Fails)

Fixed-size chunking — split every document into 512-token chunks with 50-token overlap — is the default in most tutorials. It fails in production for documents with meaningful structure: legal documents, financial reports, technical specifications, regulatory guidance.

A fixed-size chunk that starts 200 tokens into a section heading and ends 200 tokens into the next section is semantically incoherent. The embedding represents a chimera of two different topics. Retrieval returns it for queries that match either topic, with degraded precision.

Semantic Chunking

Semantic chunking splits documents at semantic boundaries rather than fixed token counts. The approach:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_anthropic import AnthropicEmbeddings

embeddings = AnthropicEmbeddings()

# Semantic chunker: split where embedding similarity between
# consecutive sentences drops below a threshold
chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90  # Split at points of high semantic discontinuity
)

chunks = chunker.split_text(document_text)

Semantic chunking produces chunks with coherent semantic content at the cost of variable chunk sizes. For documents with clear topical sections (regulatory guidance, policy documents, product documentation), the quality improvement is substantial.

Structure-Aware Chunking

For structured documents — tables, numbered lists, code blocks, hierarchical sections — semantic chunking still misses important structure. Structure-aware chunking uses document structure (HTML tags, markdown headers, PDF section boundaries) to guide splitting:

from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

# Stage 1: Split at markdown headers (preserves section context)
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "section"),
        ("##", "subsection"),
        ("###", "subsubsection"),
    ]
)
header_splits = markdown_splitter.split_text(document_text)

# Stage 2: Further split large sections by sentence boundary
char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " "]  # Prefer sentence boundaries
)
final_chunks = char_splitter.split_documents(header_splits)

# Each chunk now carries section metadata from header splits
# chunk.metadata = {"section": "Model Validation", "subsection": "Back-testing Procedures"}

The metadata preservation is as important as the splitting logic. Chunk metadata enables downstream metadata filtering: “find chunks from the Model Validation section” is a much more precise query than “find chunks about validation.”

Parent Document Retriever

For long documents where relevant context spans multiple paragraphs, the parent document retriever pattern is the most robust production approach:

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma

# Child chunks: small (200 tokens), stored in vector DB for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Parent chunks: large (2000 tokens), stored in docstore, returned to LLM
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

docstore = InMemoryStore()  # Replace with Redis or PostgreSQL for production
vectorstore = Chroma(embedding_function=embeddings)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

Why this works: Small child chunks have higher retrieval precision (they represent a narrow semantic unit). Large parent chunks provide sufficient context for the generator (the full surrounding section). The retriever finds the precise small chunk, then returns the full parent document section to the LLM. This resolves the fundamental tension between chunk size for retrieval and chunk size for generation.


Hybrid Retrieval

Pure semantic (embedding) search is insufficient for production RAG on domain-specific corpora. The solution is hybrid retrieval — combining embedding similarity with BM25 keyword matching.

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# Semantic retriever (vector search)
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# BM25 retriever (keyword search over the same corpus)
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10

# Ensemble: weighted combination (RRF or linear score fusion)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[0.4, 0.6]  # Tune per corpus — more BM25 weight for exact-term-heavy corpora
)

results = ensemble_retriever.invoke("What are the SR 26-2 model validation requirements?")

The weight between BM25 and semantic search is a tuning parameter. For regulatory and legal corpora with defined terminology ("SR 26-2", "model validation", "covered institution"), BM25 weight should be higher (0.4–0.6). For narrative corpora where the user’s phrasing varies widely from the source text, semantic weight should be higher (0.7–0.8).

Reciprocal Rank Fusion

RRF is a more principled score combination approach than linear weighting. It combines ranked lists from multiple retrievers by the inverse of each result’s rank position, making it more robust to score scale differences between retriever types:

def reciprocal_rank_fusion(result_lists: list[list], k: int = 60) -> list:
    """
    Combine multiple ranked result lists using RRF.
    k controls the impact of rank position (higher k = less steep)
    """
    scores = {}
    for results in result_lists:
        for rank, doc in enumerate(results):
            doc_id = doc.metadata.get("chunk_id", doc.page_content[:50])
            if doc_id not in scores:
                scores[doc_id] = {"doc": doc, "score": 0}
            scores[doc_id]["score"] += 1 / (rank + k)
    
    return sorted(scores.values(), key=lambda x: x["score"], reverse=True)

RRF is the default score fusion approach in most production hybrid retrieval implementations because it doesn’t require calibrating score scales between retrievers.


Re-ranking

The retriever returns the top-K most relevant chunks. A cross-encoder re-ranker then scores each retrieved chunk against the query with higher accuracy than the bi-encoder embedding model, and reorders the results.

The distinction: embedding models use bi-encoders (encode query and document separately, compare with dot product). Cross-encoders process the query and document together and produce a single relevance score. Cross-encoders are 10–100x slower but significantly more accurate.

The production pattern: retrieve top-20 with the fast bi-encoder, re-rank with the cross-encoder, pass top-5 to the LLM.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, retrieved_docs: list, top_n: int = 5) -> list:
    pairs = [(query, doc.page_content) for doc in retrieved_docs]
    scores = reranker.predict(pairs)
    
    ranked = sorted(
        zip(scores, retrieved_docs),
        key=lambda x: x[0],
        reverse=True
    )
    return [doc for score, doc in ranked[:top_n]]

When re-ranking is worth the latency cost: For retrieval-critical applications (compliance Q&A, medical information, legal research) where wrong answers have real consequences, re-ranking materially improves precision and is worth the 100–300ms latency addition. For conversational applications where speed matters more than precision, skip the cross-encoder.

Cohere Rerank, Jina Rerank, and BGE Re-ranker are production-grade managed and open-source cross-encoders respectively. For enterprise deployments, a self-hosted BGE re-ranker (via Hugging Face inference server) keeps the re-ranking computation within your infrastructure boundary.


Evaluation: The Missing Layer

Most production RAG systems have no systematic evaluation. This is the gap that causes quality drift to go undetected for weeks. A production RAG system needs three evaluation layers:

1. Retrieval Evaluation (Offline)

Measure whether your retriever finds the relevant chunks for a representative query set.

from ragas import evaluate
from ragas.metrics import context_precision, context_recall
from datasets import Dataset

# Build evaluation dataset: questions + ground truth contexts + ground truth answers
eval_data = Dataset.from_list([
    {
        "question": "What does SR 26-2 require for model validation?",
        "contexts": retriever.invoke("SR 26-2 model validation requirements"),
        "ground_truth": "SR 26-2 requires independent validation including conceptual soundness review..."
    },
    # ... more examples
])

# RAGAS context_precision: what fraction of retrieved context is relevant?
# RAGAS context_recall: what fraction of relevant information was retrieved?
results = evaluate(eval_data, metrics=[context_precision, context_recall])
print(f"Context Precision: {results['context_precision']:.3f}")
print(f"Context Recall: {results['context_recall']:.3f}")

Run this evaluation every time you change chunking strategy, embedding model, retrieval parameters, or corpus. Track metrics over time. A context_recall drop indicates that relevant information is being missed; a context_precision drop indicates irrelevant chunks are being included.

2. Answer Quality Evaluation (Offline + Online)

from ragas.metrics import answer_relevancy, faithfulness

# faithfulness: are all claims in the answer grounded in the retrieved context?
# answer_relevancy: how relevant is the answer to the question?
answer_results = evaluate(eval_data, metrics=[faithfulness, answer_relevancy])

Faithfulness is the most critical metric for regulated applications. A faithfulness score below 0.85 means the system is generating claims not grounded in retrieved context — hallucination at a rate that is unacceptable for compliance or financial applications.

3. Production Monitoring (Online)

For every production query, log:

  • Query text (or hash, for PII-sensitive environments)
  • Number of chunks retrieved
  • Re-ranker top-1 score (proxy for retrieval quality)
  • Response latency breakdown (retrieval / re-rank / generation)
  • Whether the user accepted or rejected the response (if UI supports feedback)

Alert on: re-ranker top-1 score dropping below threshold (retrieval quality degrading), latency P99 exceeding SLA, user rejection rate increasing.


Corpus Management in Production

Incremental Updates

Every production RAG corpus changes. New documents are added; existing documents are updated; some are deleted. The corpus management system must handle all three without requiring full re-index.

class RAGCorpusManager:
    def __init__(self, vectorstore, docstore, embedder):
        self.vectorstore = vectorstore
        self.docstore = docstore
        self.embedder = embedder
    
    async def add_document(self, document: Document) -> str:
        chunks = self.chunker.split(document)
        embeddings = await self.embedder.embed_batch([c.text for c in chunks])
        chunk_ids = self.vectorstore.upsert(chunks, embeddings)
        self.docstore.set(document.id, {"chunk_ids": chunk_ids, "updated_at": now()})
        return document.id
    
    async def update_document(self, document_id: str, new_document: Document):
        # Delete old chunks first
        old_meta = self.docstore.get(document_id)
        if old_meta:
            self.vectorstore.delete(ids=old_meta["chunk_ids"])
        # Add updated chunks
        await self.add_document(new_document)
    
    async def delete_document(self, document_id: str):
        meta = self.docstore.get(document_id)
        if meta:
            self.vectorstore.delete(ids=meta["chunk_ids"])
            self.docstore.delete(document_id)

The document ID → chunk ID mapping in the docstore is the critical bookkeeping that enables updates and deletes without full re-index. Every chunk must carry a document ID in its metadata, and the docstore must maintain the reverse mapping.

Embedding Model Version Management

When you upgrade the embedding model (e.g., from text-embedding-ada-002 to text-embedding-3-large), your entire vector index is incompatible — the embedding space changed. This is a full re-index operation.

Production pattern: maintain a model_version field in each chunk’s metadata. When upgrading the embedding model, run a background re-indexing job that processes documents in batches, writes new embeddings with the new model version, and atomically cuts over retrieval to the new index once re-indexing is complete. Never do an embedding model upgrade as a hard cutover.


The Production RAG Stack

Putting it together, a production RAG architecture that handles the failure points above:

Document Ingestion

    ├── Structure-aware chunking (markdown/HTML/PDF aware)
    ├── Parent document store (large context) + child vector index (precision)
    └── Metadata extraction and enrichment
    
Retrieval Pipeline

    ├── Hybrid search (semantic + BM25, RRF fusion)
    ├── Metadata pre-filtering (access control, document type, date range)
    └── Cross-encoder re-ranking (top-20 → top-5)
    
Generation

    ├── Citation-grounded prompt (every claim must map to a retrieved chunk)
    ├── Faithfulness check (post-generation verification)
    └── Structured response with source citations
    
Evaluation

    ├── Offline: RAGAS context_precision, context_recall, faithfulness
    ├── Online: retrieval quality proxy metrics, latency, user feedback
    └── Corpus health: staleness monitoring, embedding model version tracking

This is not the 30-minute demo. It is what production RAG looks like after the demo survives contact with real documents, real users, and real governance requirements. The components are not exotic — they are all available in open-source tooling today. What distinguishes production RAG from demo RAG is not the technology. It is the engineering discipline to build all of it, instrument it, and keep it working after launch.


Enterprise AI Architecture

Want more enterprise AI architecture breakdowns?

Subscribe to SuperML.

Comments

Sign in to leave a comment

Back to Blog

Related Posts

View All Posts »

OODA Loop Architecture for Production AI Agents

John Boyd designed the OODA loop for fighter pilots making life-or-death decisions in milliseconds with incomplete information. It turns out this is a better mental model for production AI agents than the ReAct loop — especially in high-stakes, time-pressured environments where agents need to fail fast, course-correct, and maintain situational awareness across a multi-step decision horizon.

LangChain vs LangGraph 2026: Which to Use for Enterprise Agents

LangChain and LangGraph solve different problems and the choice between them is not about preference — it's about the shape of your workflow. This is the architecture decision guide: when chains are enough, when you need stateful graphs, and when to use neither.