LLM Training Compute Cost Calculator

Estimate the cash required to pre-fund an LLM training or fine-tuning run. Feed in model size, data volume, sustained GPU throughput, utilisation, energy pricing, and contingency buffers to forecast an all-in dollar budget before you reserve hardware.

Enter the number of billions of weights updated during training (e.g., 7 for a 7B model).
Billions of tokens processed across all epochs or fine-tuning passes.
Multiplier linking parameters and tokens to total floating-point operations (often 6 for dense transformers).
Sustained per-accelerator performance from profiling, not peak spec sheet numbers.
How many accelerators train simultaneously (round up to the nearest whole GPU).
Hourly rental or amortised cost per GPU, in your billing currency.
Average utilisation rate of each GPU; include pipeline bubbles, I/O waits, and comms overhead.
Average electrical load per GPU in kilowatts while training at the utilisation listed.
Power usage effectiveness for the facility (1.2–1.5 for modern hyperscalers, higher on-prem).
Utility or contract power cost per kilowatt-hour.
Percent uplift covering orchestration software, staff, storage, QA, and surprise reruns.

Educational information, not professional advice.

Examples

  • Adapter fine-tune: 7B parameters × 30B tokens, 32 H100s at $1.80/hr, 90 TFLOPs/s sustained, 85% utilisation, 0.30 kW, 1.2 PUE, $0.09/kWh, 10% overhead ⇒ $9,184.82 projected cost
  • Foundation training: 65B parameters × 200B tokens, 128 GPUs at $3.50/hr, 180 TFLOPs/s sustained, 88% utilisation, 0.50 kW, 1.35 PUE, $0.12/kWh, 20% overhead ⇒ $584,244.95 budget

FAQ

How should I estimate effective throughput per GPU?

Profile a smaller rehearsal run using the same model parallelism, precision, and kernels. Peak specs overstate real throughput by 10–30% because of networking, memory, and scheduler overhead.

Can I model parameter-efficient fine-tuning?

Yes. Lower the trainable-parameter input to reflect only the adapters or LoRA ranks you update so the FLOPs calculation matches the smaller optimisation footprint.

Which overheads belong in the contingency percentage?

Include orchestration software, checkpoint storage, experiment tracking, human evaluation time, compliance reviews, SRE coverage, and other costs that scale with training but are not billed as GPU hours.

How do I compare cloud versus on-prem costs?

Run the calculator twice—once with cloud rental rates and again with amortised hardware plus your facility PUE and electricity tariff—to surface the cheaper deployment strategy.

Additional Information

  • Total training compute ≈ Parameters × Tokens × FLOPs-per-token; 6× is common for dense transformer passes but adjust for sparsity or mixed-precision tricks.
  • Runtime divides compute demand by sustained TFLOPs, multiplied by utilisation and GPU count, so communication stalls or gradient accumulation inefficiencies reduce throughput automatically.
  • Energy cost factors in GPU draw, utilisation, unit count, and facility PUE to capture cooling and power-delivery overhead beyond the racks themselves.
  • Add or tweak the contingency percentage to cover orchestration tooling, data pipeline labour, human evaluation, checkpoint storage, or unexpected reruns.