Agentic + LLM Systems

Components of a Production RAG System: Architecture, Retrieval, Generation, and Evaluation

A production RAG system is more than a vector database connected to an LLM. This guide breaks down the core components: ingestion, chunking, embeddings, retrieval, reranking, prompt assembly, generation, evaluation, observability, and governance.

Share this article
Comments
Share:
Table of Contents

Retrieval-Augmented Generation, or RAG, is an architecture pattern that combines information retrieval with language generation. Instead of relying only on the model’s internal training data, a RAG system retrieves relevant context from external sources and uses that context to generate a grounded answer.

This makes RAG useful for enterprise systems where answers must reflect current documents, policies, product data, customer records, knowledge bases, or domain-specific procedures.

However, a production RAG system is not just a vector database attached to a chatbot. It requires a full pipeline for ingestion, indexing, retrieval, ranking, prompt construction, generation, evaluation, monitoring, and governance.

What Problem Does RAG Solve?

Large language models are strong at reasoning over language, but they have three practical limitations in enterprise use cases:

  1. Knowledge freshness: the model may not know recent or private information.
  2. Source grounding: the model may generate plausible answers without evidence.
  3. Domain specificity: the model may not understand internal terminology, policies, or business rules.

RAG addresses these limitations by retrieving relevant information at query time and injecting it into the model context.

A simplified flow looks like this:

User Query

Query Understanding

Retriever

Relevant Documents / Chunks

Reranker and Context Builder

Prompt Assembly

Language Model

Grounded Response + Citations

Core Components of a RAG System

A production RAG system usually contains the following components:

ComponentPurpose
Data ingestionLoads source documents from systems such as websites, PDFs, databases, SharePoint, Confluence, S3, or CRM platforms.
Document parsingExtracts text, tables, metadata, and structure from source files.
ChunkingSplits documents into retrievable units.
Embedding modelConverts chunks and queries into vector representations.
Vector store / search indexStores embeddings and supports similarity search.
RetrieverFinds candidate chunks for a query.
RerankerReorders retrieved chunks by relevance.
Context builderSelects and formats evidence for the LLM prompt.
Prompt managerApplies instruction templates, policies, and response constraints.
Language modelGenerates the final response.
EvaluatorMeasures retrieval quality, answer quality, faithfulness, and latency.
Observability layerTracks traces, latency, cost, errors, and user feedback.
Governance layerApplies access control, privacy, auditability, and compliance rules.

Each component affects the quality of the final answer. A weak ingestion pipeline or poor chunking strategy can cause bad answers even when the LLM itself is strong.

1. Data Ingestion

The ingestion layer brings enterprise knowledge into the RAG pipeline. Common sources include:

  • internal documentation
  • product manuals
  • customer support articles
  • PDFs and policy documents
  • CRM records
  • database tables
  • emails or tickets
  • knowledge graphs
  • regulatory documents

The ingestion process should capture both content and metadata. Metadata is important because it allows filtering, access control, citation, freshness checks, and lineage tracking.

Useful metadata includes:

  • document ID
  • title
  • source system
  • author or owner
  • created date
  • updated date
  • access permissions
  • department or domain
  • document version
  • URL or source path

A RAG system that ignores metadata will struggle in production because it cannot reliably answer questions such as:

  • Is this document still current?
  • Is the user allowed to see this content?
  • Which source was used for the answer?
  • Which version of the policy was retrieved?

2. Document Parsing

Document parsing converts source material into clean machine-readable text. This sounds simple, but it is often one of the hardest parts of production RAG.

PDFs, slides, spreadsheets, web pages, and scanned documents may contain:

  • headers and footers
  • tables
  • images
  • charts
  • code blocks
  • nested sections
  • page numbers
  • repeated navigation text
  • OCR errors

Poor parsing creates noisy chunks. Noisy chunks reduce retrieval quality and increase hallucination risk.

A strong parsing pipeline should preserve structure where possible:

{
  "document_id": "policy-1024",
  "title": "Credit Card Eligibility Policy",
  "section": "Income Verification",
  "page": 7,
  "text": "Applicants must provide income verification when...",
  "source_url": "s3://policies/credit-card-eligibility.pdf"
}

For regulated or high-risk use cases, document structure matters. The difference between a policy heading, a footnote, and an exception clause can change the answer.

3. Chunking

Chunking splits documents into smaller retrievable units. The objective is to create chunks that are large enough to preserve meaning but small enough to retrieve precisely.

Common chunking strategies include:

StrategyDescriptionBest fit
Fixed-size chunkingSplits text by token or character count.Simple documentation and quick prototypes.
Recursive chunkingSplits by section, paragraph, sentence, then token limit.General-purpose RAG systems.
Semantic chunkingSplits based on topic boundaries or embedding similarity.Long-form content with topic shifts.
Structure-aware chunkingUses headings, tables, page numbers, and document layout.Policies, manuals, contracts, and technical docs.
Entity-aware chunkingKeeps entities and relationships together.Knowledge graph and domain-heavy use cases.

Chunk size should be tested, not guessed. Smaller chunks may improve precision but lose context. Larger chunks may preserve context but introduce irrelevant text.

Typical starting points:

  • 300 to 800 tokens for support articles
  • 500 to 1,200 tokens for technical documentation
  • section-based chunks for policy or legal documents
  • table-preserving chunks for financial or operational data

4. Embedding Model

An embedding model converts text into numerical vectors. These vectors capture semantic similarity, allowing the system to retrieve documents that are related in meaning, not just keyword overlap.

For example, these two queries may be semantically related even though they use different words:

How do I reset my password?
I cannot access my account.

Embedding model selection matters. A weak embedding model can miss relevant documents even when the knowledge base contains the correct answer.

When selecting an embedding model, evaluate:

  • retrieval accuracy
  • language support
  • domain vocabulary coverage
  • latency
  • vector dimension
  • cost
  • deployment model: API, self-hosted, or local
  • support for long inputs
  • performance on your actual documents

Do not choose embeddings only based on benchmark scores. Test retrieval quality using representative user questions and real documents.

5. Vector Store and Search Index

A vector store indexes embeddings and supports similarity search. Common options include Pinecone, Weaviate, Milvus, Qdrant, Chroma, Elasticsearch, OpenSearch, Redis, and PostgreSQL with pgvector.

The vector store is responsible for retrieving candidate chunks quickly. In production, it should support:

  • metadata filtering
  • hybrid search
  • namespace or tenant isolation
  • access control integration
  • index versioning
  • backups
  • monitoring
  • scalable ingestion
  • low-latency query performance

Vector search alone is often not enough. Enterprise RAG systems usually benefit from hybrid retrieval.

6. Retrieval

Retrieval is the process of finding candidate chunks for a user query.

Common retrieval methods include:

Retrieval methodDescription
Dense retrievalUses embedding similarity.
Sparse retrievalUses keyword-based search such as BM25.
Hybrid retrievalCombines vector similarity and keyword search.
Metadata-filtered retrievalRestricts results by user, product, region, date, department, or document type.
Query expansionRewrites or expands the query to improve recall.
Multi-query retrievalGenerates multiple related queries and merges results.

For enterprise use cases, hybrid retrieval is often stronger than pure vector search. Keyword search helps with exact terms, product codes, policy IDs, error messages, names, and abbreviations. Dense retrieval helps with semantic meaning.

A reliable retriever should answer four questions:

  1. Did it find the right documents?
  2. Did it rank them near the top?
  3. Did it filter out restricted or irrelevant content?
  4. Did it return enough evidence for the model to answer accurately?

7. Reranking

The initial retriever often returns a broad set of candidates. A reranker scores those candidates again using a more precise model.

The typical flow is:

Retrieve top 50 chunks

Rerank top 50 using cross-encoder or LLM-based relevance scoring

Select top 5 to 10 chunks for the prompt

Reranking improves answer quality because the final prompt has limited context space. The system should not pass every retrieved chunk to the model. It should pass the best evidence.

Reranking is especially useful when:

  • documents are long
  • many chunks are semantically similar
  • the knowledge base contains duplicate or near-duplicate content
  • exact policy language matters
  • the model context window is expensive or limited

8. Context Builder

The context builder decides which retrieved evidence goes into the final prompt and how it is formatted.

This layer is important because the LLM does not see the entire knowledge base. It only sees the selected context.

A good context builder should:

  • remove duplicate chunks
  • preserve source citations
  • keep section headings
  • order evidence logically
  • avoid mixing conflicting versions
  • fit within token limits
  • include metadata when useful
  • separate source content from instructions

Example context format:

SOURCE 1
Title: Credit Card Eligibility Policy
Section: Income Verification
Updated: 2025-03-12
Content: Applicants must provide income verification when...

SOURCE 2
Title: Exception Handling Guide
Section: Manual Review
Updated: 2025-04-02
Content: If income verification fails, route the application to...

This format makes source attribution easier and reduces the chance that the model blends unrelated evidence.

9. Prompt Management

Prompt management defines how the model should use retrieved context. In production systems, prompts should not be scattered across application code.

A prompt layer should manage:

  • system instructions
  • domain-specific rules
  • tone and formatting
  • citation requirements
  • refusal behavior
  • safety constraints
  • JSON or schema output requirements
  • fallback instructions when evidence is missing

A simple RAG prompt may look like this:

You are an enterprise knowledge assistant.
Answer the user question using only the provided sources.
If the sources do not contain enough information, say that the answer is not available in the provided context.
Cite the source title and section for each important claim.

User question:
{question}

Retrieved sources:
{context}

The most important rule is this: the model should not be allowed to invent an answer when retrieval fails.

10. Language Model

The language model generates the final answer using the prompt and retrieved context.

Model choice depends on the use case:

RequirementModel consideration
Low latencySmaller model or optimized inference.
Complex reasoningStronger general-purpose model.
Sensitive dataPrivate deployment or strict data controls.
Structured outputModel with strong JSON/schema-following ability.
Cost controlSmaller model with better retrieval and reranking.
Multilingual supportModel tested on target languages.

A stronger model does not fix a weak retrieval pipeline. In many RAG systems, retrieval quality has a larger impact than model size.

11. Evaluation

RAG evaluation must measure both retrieval and generation. Evaluating only the final answer hides the root cause of failure.

Key retrieval metrics include:

MetricMeaning
Recall@kWhether the correct source appears in the top k retrieved results.
Precision@kHow many retrieved chunks are actually relevant.
MRRHow high the first relevant result appears in the ranking.
nDCGWhether highly relevant documents rank above weakly relevant ones.

Key generation metrics include:

MetricMeaning
FaithfulnessWhether the answer is supported by retrieved context.
Answer relevanceWhether the answer directly addresses the question.
Context relevanceWhether selected context was useful.
Citation accuracyWhether citations support the claims made.
Refusal accuracyWhether the system correctly refuses when evidence is insufficient.

Operational metrics include:

  • end-to-end latency
  • retrieval latency
  • model latency
  • token usage
  • cost per query
  • error rate
  • timeout rate
  • cache hit rate
  • user feedback

For production RAG, build a test set of real user questions with expected sources and expected answer characteristics. This allows regression testing when you change chunking, embeddings, retrieval, reranking, prompts, or models.

12. Observability

Observability helps teams understand why a RAG system produced a specific answer.

A useful trace should capture:

  • user query
  • rewritten query, if any
  • retrieval filters
  • retrieved chunks
  • reranked chunks
  • final context sent to the model
  • prompt template version
  • model name and version
  • generated answer
  • latency by stage
  • token usage
  • errors
  • feedback

Without tracing, debugging becomes guesswork. If an answer is wrong, the team needs to know whether the problem came from ingestion, chunking, retrieval, reranking, prompt design, or model behavior.

13. Governance, Security, and Compliance

Enterprise RAG systems must enforce governance controls. This is especially important in banking, healthcare, insurance, legal, HR, and customer support environments.

Important controls include:

  • role-based access control
  • document-level permissions
  • row-level and field-level filtering
  • PII redaction
  • audit logs
  • prompt and response logging policies
  • source attribution
  • data retention rules
  • model risk management
  • human review for high-risk decisions

A RAG system should not retrieve content the user is not allowed to access. Access control must happen before the model sees the context, not after the response is generated.

Common Failure Modes

RAG systems fail in predictable ways.

Failure modeCauseMitigation
Correct document not retrievedPoor chunking, weak embeddings, missing metadata, or bad query handling.Improve chunking, hybrid retrieval, query rewriting, and evaluation.
Wrong chunk retrievedSimilar documents, duplicated content, or weak ranking.Add reranking and metadata filters.
Answer not groundedModel ignores context or context is insufficient.Strengthen prompt rules and refusal behavior.
Citation mismatchContext builder loses source mapping.Preserve chunk IDs and source metadata throughout the pipeline.
Stale answerOld document retrieved over current version.Add versioning, freshness filters, and document lifecycle controls.
Sensitive data leakageRetrieval ignores permissions.Enforce access control before retrieval and context assembly.
High latencyToo many retrieval steps, large context, or slow model.Add caching, reranking limits, optimized indexes, and model routing.
High costExcessive context and expensive model calls.Improve retrieval precision, reduce prompt size, and use tiered models.

Reference Python Example: Minimal RAG Flow

The following example is intentionally minimal. It demonstrates the basic flow but should not be treated as production-ready.

from dataclasses import dataclass
from typing import List
import math


@dataclass
class DocumentChunk:
    chunk_id: str
    text: str
    source: str
    embedding: List[float]


def cosine_similarity(a: List[float], b: List[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(y * y for y in b))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)


def toy_embed(text: str) -> List[float]:
    # Demonstration only. Use a real embedding model in production.
    tokens = text.lower().split()
    return [
        len(tokens),
        sum(len(token) for token in tokens),
        1.0 if "rag" in tokens else 0.0,
        1.0 if "retrieval" in tokens else 0.0,
        1.0 if "generation" in tokens else 0.0,
    ]


class SimpleRetriever:
    def __init__(self, chunks: List[DocumentChunk]):
        self.chunks = chunks

    def retrieve(self, query: str, top_k: int = 3) -> List[DocumentChunk]:
        query_embedding = toy_embed(query)
        ranked = sorted(
            self.chunks,
            key=lambda chunk: cosine_similarity(query_embedding, chunk.embedding),
            reverse=True,
        )
        return ranked[:top_k]


def build_context(chunks: List[DocumentChunk]) -> str:
    context_blocks = []
    for chunk in chunks:
        context_blocks.append(
            f"Source: {chunk.source}\nChunk ID: {chunk.chunk_id}\nContent: {chunk.text}"
        )
    return "\n\n".join(context_blocks)


def generate_answer(question: str, context: str) -> str:
    # Replace this with a real LLM call.
    return (
        "Use the retrieved context to answer the question.\n\n"
        f"Question: {question}\n\n"
        f"Retrieved context:\n{context}"
    )


raw_chunks = [
    ("1", "RAG combines retrieval with language generation.", "rag-architecture.md"),
    ("2", "Embedding models convert text into vectors for semantic search.", "embedding-guide.md"),
    ("3", "Reranking improves the relevance of retrieved chunks before generation.", "reranking-guide.md"),
]

chunks = [
    DocumentChunk(
        chunk_id=chunk_id,
        text=text,
        source=source,
        embedding=toy_embed(text),
    )
    for chunk_id, text, source in raw_chunks
]

retriever = SimpleRetriever(chunks)
question = "What are the main components of a RAG system?"
retrieved_chunks = retriever.retrieve(question)
context = build_context(retrieved_chunks)
answer = generate_answer(question, context)

print(answer)

In a real implementation, replace the toy embedding function with a production embedding model, use a scalable search index, apply metadata filtering, add reranking, and log the full trace for observability.

Production Readiness Checklist

Before deploying a RAG system, validate the following:

  • Source documents are parsed correctly.
  • Chunks preserve enough context to answer questions accurately.
  • Metadata includes source, version, ownership, and permissions.
  • Retrieval quality is measured with a test set.
  • Reranking is evaluated against retrieval-only performance.
  • The prompt includes clear grounding and refusal rules.
  • The model does not answer when evidence is missing.
  • Citations map back to exact source chunks.
  • Access control is enforced before context reaches the model.
  • Latency and cost are monitored by pipeline stage.
  • Evaluation runs automatically after major pipeline changes.
  • User feedback is captured and reviewed.

Summary

A RAG system is a multi-stage architecture, not a single model call. The quality of the final answer depends on the full pipeline: ingestion, parsing, chunking, embeddings, retrieval, reranking, context assembly, prompt management, generation, evaluation, observability, and governance.

For prototypes, a simple vector store and LLM prompt may be enough. For enterprise systems, that is not sufficient. Production RAG requires measurable retrieval quality, controlled generation, source attribution, access control, operational monitoring, and continuous evaluation.

The most reliable RAG systems are built like search and knowledge platforms first, and chatbot experiences second.

Enterprise AI Architecture

Want more enterprise AI architecture breakdowns?

Subscribe to SuperML.

Comments

Sign in to leave a comment

Back to Blog

Related Posts

View All Posts »

LangChain vs LangGraph 2026: Which to Use for Enterprise Agents

LangChain and LangGraph solve different problems and the choice between them is not about preference — it's about the shape of your workflow. This is the architecture decision guide: when chains are enough, when you need stateful graphs, and when to use neither.