Opinionated AI Briefs

Your RAG Bill Isn't the LLM. It's the Embeddings: The Math Most Teams Skip

At meaningful query volume, embedding and vector DB cost routinely exceed LLM inference. Model it before you commit to a vendor β€” or watch re-embedding quietly dominate your bill.

Share this article
Comments
Share:
At meaningful query volume, embedding and vector DB cost routinely exceed LLM inference. Model it before you commit to a vendor β€” or watch re-embedding quietly dominate your bill.
Table of Contents

Ask any AI architect β€œwhat does RAG cost?” and the first answer is usually the inference bill β€” the LLM calls that generate the final answer.

That answer is wrong on most high-volume systems.

At meaningful query volume, embedding cost and vector database hosting routinely exceed the LLM inference bill β€” and the gap widens every time you re-embed your corpus.

Three numbers determine the actual cost of a RAG system:

  1. How many chunks you produce from your corpus
  2. What you pay per embedded vector (once for storage, again every time you re-embed)
  3. What your vector database charges for hosting those vectors at your query QPS

The RAG Vector DB Cost Calculator is built to do this math before you commit to a vendor.


What the calculator actually models

Inputs:

  • Corpus size (MB or document count)
  • Chunk size and overlap β€” determines total vector count
  • Embedding model β€” OpenAI text-embedding-3-large, Voyage, Cohere, open-source on your own GPU
  • Vector database β€” Pinecone, Weaviate Cloud, Milvus self-hosted, pgvector
  • Query volume β€” monthly QPS pattern

Outputs:

  • Total chunks generated
  • Embedding storage (GB)
  • One-time embedding cost (and recurring, if you re-embed quarterly)
  • Monthly vector DB cost
  • Cost per query
  • Total monthly RAG infrastructure cost

The number that changes minds: cost per query at year-1 volume vs. year-2 volume. Managed vector DBs scale beautifully until they don’t. Self-hosted scales flat after the infrastructure investment β€” but only if you can operate it.


The architecture decision it forces

1. Managed vs. self-hosted vector DB. Pinecone is operationally trivial; Milvus on your own k8s cluster is operationally non-trivial. The break-even is usually around 50M–200M vectors or sustained 100+ QPS. Below that, managed wins on engineering time. Above that, self-hosted wins on $/vector. The calculator gives you the crossover point for your corpus, not a generic benchmark.

2. Embedding model choice. OpenAI’s text-embedding-3-large is excellent and expensive. A self-hosted bge-large-en-v1.5 on a $200/month GPU may give you 90% of the retrieval quality at 5% of the embedding cost β€” if you have the engineering bandwidth to operate it. The calculator lets you compare.

3. Re-embedding cadence. If your corpus drifts (new documents weekly), you are paying the embedding cost over and over. A 100GB corpus re-embedded monthly with a frontier embedding model can cost more than your LLM inference. Many teams have never written down their re-embedding policy β€” they should.


Three things the calculator surfaces that teams miss

Embedding cost is mostly invisible until you re-embed. Initial embedding is a one-time cost that gets amortized into β€œinfra setup” and forgotten. Then you change embedding models six months later and discover the re-embed costs $40,000.

Hybrid search reduces vector count by 40–60%. Pure dense retrieval needs every chunk indexed. Hybrid (BM25 + dense) can rely on sparse retrieval for many query classes and reduce the dense index size dramatically. This is a cost-side argument for hybrid that rarely gets made.

Self-hosted vector DBs have flat query cost. Once you’ve paid for the cluster, query volume doesn’t change the bill. Managed providers bill per-query (Pinecone serverless) or per-replica (Pinecone pods). At high QPS, the unit economics diverge fast.


When to actually pull this calculator out

  • Before picking a vector database vendor. Vendor lock-in for RAG infrastructure is expensive to undo at scale.
  • Before changing embedding models. The re-embedding cost is the hidden line item.
  • Before agreeing to a corpus expansion. A 10x corpus is rarely a 10x cost (storage and embedding scale linearly, but operational complexity does not).
  • Quarterly. Embedding model prices drop and new vector DB pricing models appear. What was uneconomical six months ago may now be cheap.

The one-line takeaway

RAG cost is not β€œthe LLM call at the end.” It’s the embedding pipeline you set up once, paid for, and never measured again. Model it before you commit, or it will quietly dominate your bill.

Run the RAG Vector DB Cost Calculator β†’



Part of the Plan Before You Build series on superml.dev β€” calculators for AI/ML architects who would rather do the math once than debug at 2am.

Tags: #AI #RAG #VectorDB #Embeddings #Pinecone #Milvus #Architecture #FinOps #MachineLearning

Enterprise AI Architecture

Want more enterprise AI architecture breakdowns?

Subscribe to SuperML.

Comments

Sign in to leave a comment

Back to Blog