RAG Query Cost Calculator

Track the marginal cost of serving a RAG response by combining retrieved context width, model pricing, cache hit rates, and any reranker or vector database charges. Adjust the optional fields to mirror your actual architecture.

Number of document chunks fetched for the prompt.
Tokens injected into the prompt from each chunk after truncation.
Expected output length in tokens.
Marginal cost for the generation model, e.g., $0.0020 per 1K tokens.
Optional. Default $0.0001 per 1K tokens for query embedding.
Optional. Default 0% — portion of queries served from cache, skipping model + reranker.
Optional. Default $0.0000 — include reranker or external API fees per live request.
Optional. Default $0.0000 — cost to read embeddings from your vector database.

Pricing fluctuates between providers. Update costs frequently and validate against live billing dashboards for production systems.

Examples

  • 8 chunks × 1,500 tokens, 600-token answer, $0.0020 LLM, $0.0001 embedding, 0% cache ⇒ Cost per query: $0.0264 (prompt 12,000 tokens + completion 600 tokens) • Cache savings worth $0.0000 per query • Model spend share 95.45% • 1K queries/week ≈ $26.40.
  • 6 chunks × 900 tokens, 400-token answer, $0.0015 LLM, cache hit 40%, reranker $0.0010, vector $0.0002 ⇒ Cost per query: $0.0074 (prompt 5,400 tokens + completion 400 tokens) • Cache savings worth $0.0039 per query • Model spend share 70.16% • 1K queries/week ≈ $7.44.

FAQ

How do I include fixed infrastructure cost?

Use the per-query cost here for marginal spend and add a separate fixed monthly line item (e.g., hosting) when building your full P&L.

What if I use two models (rerank + generator)?

Enter the reranker or API fee in the dedicated field and keep the generator price in the LLM cost input. The calculator treats them separately so you can see the share of spend.

Can I model semantic cache warm-up?

Yes—change the cache hit rate to reflect expected hit ratios at different traffic levels (e.g., 10% during launch vs 60% steady state).

Additional Information

  • Embedding cost covers fresh query embeddings; cached responses still require embedding unless you skip retrieval entirely.
  • Cache hit rate removes both model and reranker spend for served responses while retaining vector and embedding costs.
  • Adjust the tokens per chunk to reflect truncation after re-ranking—overestimating inflates the marginal cost.