LLM Training Compute Cost Calculator

Estimate the cash required to pre-fund an LLM training or fine-tuning run. Feed in model size, data volume, sustained GPU throughput, utilisation, energy pricing, and contingency buffers to forecast an all-in dollar budget before you reserve hardware.

Trainable parameters (billions)

Enter the number of billions of weights updated during training (e.g., 7 for a 7B model).

Training tokens (billions)

Billions of tokens processed across all epochs or fine-tuning passes.

FLOPs per token multiplier

Multiplier linking parameters and tokens to total floating-point operations (often 6 for dense transformers).

Effective throughput per GPU (TFLOPs/s)

Sustained per-accelerator performance from profiling, not peak spec sheet numbers.

GPU count

How many accelerators train simultaneously (round up to the nearest whole GPU).

GPU cost per hour (USD)

Hourly rental or amortised cost per GPU, in your billing currency.

Utilisation (%)

Average utilisation rate of each GPU; include pipeline bubbles, I/O waits, and comms overhead.

GPU power draw (kW)

Average electrical load per GPU in kilowatts while training at the utilisation listed.

Data center PUE

Power usage effectiveness for the facility (1.2–1.5 for modern hyperscalers, higher on-prem).

Electricity rate (USD/kWh)

Utility or contract power cost per kilowatt-hour.

Contingency overhead (%)

Percent uplift covering orchestration software, staff, storage, QA, and surprise reruns.

Educational information, not professional advice.

Examples

Adapter fine-tune: 7B parameters × 30B tokens, 32 H100s at $1.80/hr, 90 TFLOPs/s sustained, 85% utilisation, 0.30 kW, 1.2 PUE, $0.09/kWh, 10% overhead ⇒ $9,184.82 projected cost
Foundation training: 65B parameters × 200B tokens, 128 GPUs at $3.50/hr, 180 TFLOPs/s sustained, 88% utilisation, 0.50 kW, 1.35 PUE, $0.12/kWh, 20% overhead ⇒ $584,244.95 budget

FAQ

How should I estimate effective throughput per GPU?

Profile a smaller rehearsal run using the same model parallelism, precision, and kernels. Peak specs overstate real throughput by 10–30% because of networking, memory, and scheduler overhead.

Can I model parameter-efficient fine-tuning?

Yes. Lower the trainable-parameter input to reflect only the adapters or LoRA ranks you update so the FLOPs calculation matches the smaller optimisation footprint.

Which overheads belong in the contingency percentage?

Include orchestration software, checkpoint storage, experiment tracking, human evaluation time, compliance reviews, SRE coverage, and other costs that scale with training but are not billed as GPU hours.

How do I compare cloud versus on-prem costs?

Run the calculator twice—once with cloud rental rates and again with amortised hardware plus your facility PUE and electricity tariff—to surface the cheaper deployment strategy.

Additional Information

Total training compute ≈ Parameters × Tokens × FLOPs-per-token; 6× is common for dense transformer passes but adjust for sparsity or mixed-precision tricks.
Runtime divides compute demand by sustained TFLOPs, multiplied by utilisation and GPU count, so communication stalls or gradient accumulation inefficiencies reduce throughput automatically.
Energy cost factors in GPU draw, utilisation, unit count, and facility PUE to capture cooling and power-delivery overhead beyond the racks themselves.
Add or tweak the contingency percentage to cover orchestration tooling, data pipeline labour, human evaluation, checkpoint storage, or unexpected reruns.

Examples

FAQ

How should I estimate effective throughput per GPU?

Can I model parameter-efficient fine-tuning?

Which overheads belong in the contingency percentage?

How do I compare cloud versus on-prem costs?

Additional Information

Related calculators