GPU Spot vs On-Demand Breakeven

Decide whether GPU spot capacity still beats on-demand pricing after accounting for interruptions. Enter the hourly rates, planned runtime, and downtime you expect to cover with on-demand fallbacks. Optionally include a reserved rate to see when commitments outperform a mixed strategy and how much downtime erases the spot discount.

Costs exclude data transfer, premium storage, or orchestration overhead. Refresh pricing regularly because spot discounts fluctuate with regional demand.

Examples

$4.10 on-demand, $1.45 spot, 250 planned hours, 12 downtime hours, reserved blank ⇒ Spot plus fallback spending totals $411.70, or $1.65/hr across 250 planned hours. Going on-demand for all work would cost $1,025.00, so spot saves $613.30 (59.83%). Your effective uptime is 95.4%. You can absorb up to 161.6 downtime hours before spot loses parity. Add a reserved rate to benchmark against committed-use discounts.
$2.30 on-demand, $0.95 spot, 720 planned hours, 48 downtime hours, $1.05 reserved ⇒ Spot plus fallback spending totals $794.40, or $1.10/hr across 720 planned hours. Going on-demand would cost $1,656.00, so spot saves $861.60 (52.03%). Your effective uptime is 93.8%. You can absorb up to 422.6 downtime hours before spot loses parity. A reserved commitment at $1.05/hr would cost $756.00, undercutting the spot strategy by $38.40.

FAQ

How should I value interruption penalties for training jobs?

Include the time needed to reload checkpoints, replay epochs, or rebalance data sharding. Stateful training usually has higher downtime overhead than stateless inference jobs.

Can I blend spot and on-demand within a Kubernetes cluster?

Yes. Enter the spot hours you expect plus any on-demand fallback usage for high-priority pods. The breakeven downtime shows how much disruption you can tolerate before shifting more nodes to on-demand or reserved commitments.

Do preemptible GPU limits change the math?

Provider quotas cap the number of spot instances you can run simultaneously. Factor that into planned hours if quotas force a partial on-demand footprint.

How can I model diversification across regions or clouds?

Run the calculator for each market's pricing and interruption profile, then weight the results by how you plan to split training jobs. This surfaces where incremental spot capacity still beats reserved commitments.

Additional Information

Downtime hours model replays, checkpoint restores, or manual recovery when spot VMs are reclaimed.
Breakeven downtime solves for how many hours of interruptions erase the savings versus running 100% on-demand.
Reserved rate comparisons help decide between sustained-use commitments and opportunistic spot fleets.
If spot rates spike above on-demand, the calculator flags that only zero downtime keeps parity, signaling it's time to cap bids or shift workloads.