RAG Query Cost Calculator

Track the marginal cost of serving a RAG response by combining retrieved context width, model pricing, cache hit rates, and any reranker or vector database charges. Adjust the optional fields to mirror your actual architecture.

Chunks retrieved per query

Number of document chunks fetched for the prompt.

Average tokens per chunk

Tokens injected into the prompt from each chunk after truncation.

Generated tokens per answer

Expected output length in tokens.

LLM cost (USD per 1K tokens)

Marginal cost for the generation model, e.g., $0.0020 per 1K tokens.

Embedding cost (USD per 1K tokens)

Optional. Default $0.0001 per 1K tokens for query embedding.

Cache hit rate (%)

Optional. Default 0% — portion of queries served from cache, skipping model + reranker.

Reranker/API cost per served query (USD)

Optional. Default $0.0000 — include reranker or external API fees per live request.

Vector store read cost (USD per 1K tokens)

Optional. Default $0.0000 — cost to read embeddings from your vector database.

Pricing fluctuates between providers. Update costs frequently and validate against live billing dashboards for production systems.

Examples

8 chunks × 1,500 tokens, 600-token answer, $0.0020 LLM, $0.0001 embedding, 0% cache ⇒ Cost per query: $0.0264 (prompt 12,000 tokens + completion 600 tokens) • Cache savings worth $0.0000 per query • Model spend share 95.45% • 1K queries/week ≈ $26.40.
6 chunks × 900 tokens, 400-token answer, $0.0015 LLM, cache hit 40%, reranker $0.0010, vector $0.0002 ⇒ Cost per query: $0.0074 (prompt 5,400 tokens + completion 400 tokens) • Cache savings worth $0.0039 per query • Model spend share 70.16% • 1K queries/week ≈ $7.44.

FAQ

How do I include fixed infrastructure cost?

Use the per-query cost here for marginal spend and add a separate fixed monthly line item (e.g., hosting) when building your full P&L.

What if I use two models (rerank + generator)?

Enter the reranker or API fee in the dedicated field and keep the generator price in the LLM cost input. The calculator treats them separately so you can see the share of spend.

Can I model semantic cache warm-up?

Yes—change the cache hit rate to reflect expected hit ratios at different traffic levels (e.g., 10% during launch vs 60% steady state).

Additional Information

Embedding cost covers fresh query embeddings; cached responses still require embedding unless you skip retrieval entirely.
Cache hit rate removes both model and reranker spend for served responses while retaining vector and embedding costs.
Adjust the tokens per chunk to reflect truncation after re-ranking—overestimating inflates the marginal cost.

RAG Query Cost Calculator

Examples

FAQ

How do I include fixed infrastructure cost?

What if I use two models (rerank + generator)?

Can I model semantic cache warm-up?

Additional Information

Trusted, consistent, and transparent

Team roles

Review cadence

Examples

FAQ

How do I include fixed infrastructure cost?

What if I use two models (rerank + generator)?

Can I model semantic cache warm-up?

Additional Information

Related calculators

Trusted, consistent, and transparent

Team roles

Review cadence