RAG Latency Budget Calculator

Determine how much latency remains after retrieval and generation to hit retrieval-augmented generation service levels.

Operational budgeting tool. Validate the numbers against production observability dashboards before committing to SLAs.

Examples

Latency SLA 2,500 ms, retrieval 600 ms, 120 tokens, 80 tokens/s, post-processing 150 ms ⇒ Generation 1,500.00 ms, consumption 2,250.00 ms, headroom 250.00 ms.
Latency SLA 3,000 ms, retrieval 850 ms, 90 tokens, 120 tokens/s, post-processing blank ⇒ Generation 750.00 ms, consumption 1,680.00 ms, headroom 1,320.00 ms.

FAQ

How do I incorporate streaming tokens?

Use the average streaming throughput instead of offline batch throughput, and consider modelling partial delivery latency separately for first-token metrics.

Can I budget for multiple retrievers?

Yes. Sum the latency of each retriever stage—vector search, rerankers, graph lookups—before entering the total retrieval latency.

What about variability?

Track P95 or P99 latencies for each component and re-run the calculator with those values to ensure your SLA covers tail behaviour.

Additional Information

Result unit: milliseconds for latency components and percentage of SLA consumed.
Prompt tokens should include the completion estimate to avoid under-budgeting generation time.
Post-processing covers guardrails, formatting, and API egress; adjust to match your middleware stack.