GPU Inference Cost per 1K Requests Calculator
Quantify inference serving costs by combining GPU pricing, workload demand, and throughput ceiling. Enter hourly cost, request volume, tokens per request, and sustained tokens per second to see the cost per 1,000 requests, cost per million tokens, GPU hours burned each hour, utilization headroom or shortfall, and projected monthly spend across your billing window.
Results approximate infrastructure costs. Validate throughput measurements and cloud billing details before budgeting.
Examples
- $3.60/hr GPU, 900 requests/hr, 2,000 tokens each, 160,000 tokens/sec ⇒ Cost per 1,000 requests: $4.00 • Cost per million tokens: $2.00 • GPU hours consumed each hour: 1.00 • Approximate GPUs needed: 1.00 • Capacity utilization: 0.31% • Monthly GPU spend (@720 hours): $2,592.00 • Tokens served per dollar: 500.00 • Idle headroom remaining: 99.69%
- $7.25/hr GPU, 1,800 requests/hr, 4,000 tokens each, 45,000 tokens/sec, 500 billing hours ⇒ Cost per 1,000 requests: $8.06 • Cost per million tokens: $3.63 • GPU hours consumed each hour: 1.60 • Approximate GPUs needed: 1.60 • Throughput shortfall: requires 1.60 GPU-hours each hour • Monthly GPU spend (@500 hours): $5,800.00 • Tokens served per dollar: 248.28 • Consider sharding across more GPUs or trimming token counts to avoid latency penalties.
FAQ
How do I compare different GPU SKUs?
Run the calculator for each GPU option with its own hourly rate and throughput, then compare cost per 1,000 requests, tokens served per dollar, and monthly spend to choose the best fit for your latency target.
Can I incorporate networking or storage charges?
Add those costs to the GPU hourly rate before running the calculation, or prorate them separately and stack them on top of the monthly spend figure for an all-in infrastructure view.
What if I autoscale down when idle?
Reduce the billing hours per month input to reflect actual runtime. For example, if you shut down half the day, enter 360 hours instead of the 720-hour default, or model weekday/weekend regimes separately.
How should I treat burstable workloads with traffic spikes?
Model peak-hour demand to size capacity, then rerun the calculator with average demand to estimate utilization. The difference highlights headroom you might offset with queuing or request-level throttling.
Additional Information
- Cost per 1,000 requests assumes you provision enough GPU capacity to serve the full workload for the entire billing hour.
- Tokens per second throughput converts to an hourly ceiling so you can see whether latency constraints force additional GPUs.
- When throughput demand exceeds capacity, the calculator increases GPU hours proportionally to model scaling out the cluster.
- Monthly spend defaults to 720 billing hours (30 days × 24 hours) unless you supply a custom runtime or autoscaling schedule.
- Tokens served per dollar indicates how efficiently you convert GPU spend into usable inference output for finance partners.