Your RAG Retrieval Quality Is a Chunking Problem, Not a Model Problem

When RAG quality degrades, the first reaction is almost always to swap the embedding model or upgrade the LLM. The second reaction is to increase k. Neither of these usually fixes the actual problem.

Most production RAG failures trace back to one upstream decision: chunking. And it’s the decision that gets the least architectural thought.

A 200-token chunk of dense code, embedded with a model expecting prose, lands in your index as semantic noise. A 1,024-token chunk of legal text gets cut mid-clause and embedded as half a sentence. None of this shows up at query time as an error — it shows up as the LLM confidently answering a question with the wrong document.

The RAG Chunking Calculator is built to make the chunking decision visible before you index 50GB of content the wrong way.

What the calculator actually models

Inputs:

Document type — unstructured prose, source code, tables, PDFs
Chunking strategy — fixed-size, semantic, sliding window
Chunk size and overlap percentage
Corpus size
Embedding model — to compute storage and cost

Outputs:

Total chunk count
Overlap waste (the percentage of your embedding budget you’re spending on duplicate content)
Vector storage size
Embedding cost
Retrieval quality risk for your chosen strategy on this document type
Recommended chunking strategy

The number that usually moves the conversation: overlap waste. Fixed-size chunking with 20% overlap means 20% of your embeddings are duplicates. At 50M chunks, that’s 10M embeddings you paid for that recall nothing new.

The architecture decision it forces

1. Fixed-size or semantic chunking? Fixed-size is cheap and dumb. Semantic chunking respects natural boundaries (paragraph, function, clause) and produces higher-quality retrieval — but adds preprocessing cost (you usually need an LLM call per document to determine breakpoints). The calculator quantifies the trade-off: at what corpus size does semantic chunking’s quality improvement outweigh its preprocessing cost?

2. How much overlap is enough? Zero overlap risks losing concepts that span the boundary. 50% overlap is wasteful. The sweet spot is usually 10–20% — and the calculator shows the diminishing-returns curve so you can pick deliberately instead of by default.

3. One chunking strategy or many? This is the answer most teams resist: heterogeneous corpora need heterogeneous chunking. Code chunks at 512–1,024 tokens. Prose chunks at 256–512. Tables embedded as JSON, not as flattened text. A single uniform strategy across all document types is the cheapest decision and the worst one for quality.

Three things the calculator surfaces that teams miss

Code chunks are not prose chunks. Embedding models trained primarily on natural language treat code as low-information text. Code chunks need to be larger (full function or class) and often need a separate embedding model. Cutting a function in half is worse than not indexing it at all.

Overlap above 20% pays for nothing. The calculator’s heuristic: each 5% of overlap beyond 20% costs you exponentially more storage for linear quality gains. Most teams default to 30–50% because tutorials use that number; almost nobody benchmarks down.

Total chunk count is your operational complexity multiplier. Doubling chunks doubles your embedding cost, doubles your vector DB cost, and roughly doubles your retrieval latency. A 10x chunking efficiency improvement is often a 10x infrastructure cost improvement.

When to actually pull this calculator out

Before your first index. Chunking decisions are sticky; re-chunking 100GB of content is days of work and weeks of cost.
Before adding a new document type to an existing index. PDFs joining a prose corpus need their own strategy.
Before upgrading the embedding model. Different models prefer different chunk sizes; re-embedding is a chance to also re-chunk.
When retrieval quality drops on a specific document class. Diagnose chunking before swapping models.

The one-line takeaway

RAG quality is a chunking problem disguised as a model problem. The calculator forces the chunking decision into the open before you’ve embedded the entire corpus the wrong way.

Run the RAG Chunking Calculator →

RAG Vector DB Cost Calculator — the downstream cost of your chunking choices
Context Window Calculator — how many chunks you can actually inject
AI Architecture Pattern Selector — when RAG is the right answer at all

Part of the Plan Before You Build series on superml.dev — calculators for AI/ML architects who would rather do the math once than debug at 2am.

Tags: #AI #RAG #Chunking #Embeddings #VectorDB #Architecture #MachineLearning #LLM

Your RAG Retrieval Quality Is a Chunking Problem, Not a Model Problem

What the calculator actually models

The architecture decision it forces

Three things the calculator surfaces that teams miss

When to actually pull this calculator out

The one-line takeaway

Want more enterprise AI architecture breakdowns?

Contents

Tags

Related Articles

Your RAG Bill Isn't the LLM. It's the Embeddings: The Math Most Teams Skip

'Should We Use RAG or Fine-Tuning?' Is the Wrong Question: A Decision Calculator for AI Architects

Your 1M-Token Context Window Is a Lie: How to Plan Real Capacity for RAG, MCP, and Agents

Share Article

Comments

Related Posts

Your RAG Bill Isn't the LLM. It's the Embeddings: The Math Most Teams Skip

'Should We Use RAG or Fine-Tuning?' Is the Wrong Question: A Decision Calculator for AI Architects

Your 1M-Token Context Window Is a Lie: How to Plan Real Capacity for RAG, MCP, and Agents

Your Agent Demo Costs 4 Cents. Production Will Cost $4: The Multiplier Nobody Models

What the calculator actually models

The architecture decision it forces

Three things the calculator surfaces that teams miss

When to actually pull this calculator out

The one-line takeaway

Related planning tools in this series

Want more enterprise AI architecture breakdowns?

Contents

Tags

Related Articles

Your RAG Bill Isn't the LLM. It's the Embeddings: The Math Most Teams Skip

'Should We Use RAG or Fine-Tuning?' Is the Wrong Question: A Decision Calculator for AI Architects

Your 1M-Token Context Window Is a Lie: How to Plan Real Capacity for RAG, MCP, and Agents

Share Article

Comments

Related Posts

Your RAG Bill Isn't the LLM. It's the Embeddings: The Math Most Teams Skip

'Should We Use RAG or Fine-Tuning?' Is the Wrong Question: A Decision Calculator for AI Architects

Your 1M-Token Context Window Is a Lie: How to Plan Real Capacity for RAG, MCP, and Agents

Your Agent Demo Costs 4 Cents. Production Will Cost $4: The Multiplier Nobody Models