LLM Training Compute Cost Calculator

Project the total cost of fine-tuning large language models by combining GPU runtime, electricity, and contingency overheads. Enter your model size, dataset volume, measured throughput, and pricing assumptions to obtain a defensible training budget in USD.

Number of parameters being optimised, expressed in billions.
Total tokens across all epochs, expressed in billions.
Typical transformer fine-tuning uses ~6 × parameters FLOPs per token.
Measured sustained throughput per accelerator in teraFLOPs per second.
Accelerators working in parallel during the run.
Contracted price per GPU-hour including infrastructure markup.
Average percentage of theoretical throughput achieved.
Thermal design power per GPU in kilowatts.
Facility overhead multiplier covering cooling and power distribution.
Blended electricity price per kilowatt-hour.
Contingency applied to GPU spend for orchestration, storage, or labour premiums.

Educational information, not financial advice.

Examples

  • 7B parameters, 30B tokens, 90 TFLOPs/s, 32 GPUs at $1.80/hr, 85% utilisation, 0.3 kW TDP, PUE 1.2, $0.09/kWh, 10% overhead ⇒ $9,184.82
  • 65B parameters, 200B tokens, 180 TFLOPs/s, 128 GPUs at $3.50/hr, 88% utilisation, 0.5 kW TDP, PUE 1.35, $0.12/kWh, 20% overhead ⇒ $584,244.95

FAQ

How should I estimate effective throughput per GPU?

Use profiler measurements from trial runs or vendor benchmarks that reflect your framework, precision, and parallelism strategy rather than theoretical peak TFLOPs.

Can I model parameter-efficient fine-tuning techniques?

Yes. Reduce the Trainable Parameters field to the subset updated by adapters or LoRA weights so the FLOPs requirement reflects the smaller optimisation footprint.

What overheads belong in the contingency percentage?

Include orchestration software, checkpoint storage, experiment tracking, quality assurance, and on-call labour—items that scale with training but are not billed as GPU hours.

Additional Information

  • The model converts parameter and token counts into total training FLOPs, then divides by sustained throughput to estimate GPU-hours before applying your hourly rate.
  • Energy consumption accounts for utilisation, per-GPU thermal design power, total runtime, and facility PUE so electricity spend scales with operational reality.
  • Adjust the overhead percentage to capture orchestration tooling, checkpoint storage, evaluation clusters, or staffing costs not represented in direct GPU pricing.