RAG Latency Budget Calculator

Determine how much latency remains after retrieval and generation to hit retrieval-augmented generation service levels.

Maximum end-to-end latency allowed for the experience.
Observed latency for vector search, rerankers, and data fetch.
Total tokens expected during generation (prompt plus completion).
Effective decoding speed for the deployed model.
Time spent on guardrails, formatting, and delivery. Defaults to 80 ms.

Operational budgeting tool. Validate the numbers against production observability dashboards before committing to SLAs.

Examples

  • Latency SLA 2,500 ms, retrieval 600 ms, 120 tokens, 80 tokens/s, post-processing 150 ms ⇒ Generation 1,500.00 ms, consumption 2,250.00 ms, headroom 250.00 ms.
  • Latency SLA 3,000 ms, retrieval 850 ms, 90 tokens, 120 tokens/s, post-processing blank ⇒ Generation 750.00 ms, consumption 1,680.00 ms, headroom 1,320.00 ms.

FAQ

How do I incorporate streaming tokens?

Use the average streaming throughput instead of offline batch throughput, and consider modelling partial delivery latency separately for first-token metrics.

Can I budget for multiple retrievers?

Yes. Sum the latency of each retriever stage—vector search, rerankers, graph lookups—before entering the total retrieval latency.

What about variability?

Track P95 or P99 latencies for each component and re-run the calculator with those values to ensure your SLA covers tail behaviour.

Additional Information

  • Result unit: milliseconds for latency components and percentage of SLA consumed.
  • Prompt tokens should include the completion estimate to avoid under-budgeting generation time.
  • Post-processing covers guardrails, formatting, and API egress; adjust to match your middleware stack.