AI Inference GPU Memory Headroom Calculator

Check whether your GPU has enough memory to serve a transformer model by summing weight, activation, and KV cache requirements and reporting remaining headroom.

Available device memory after reserving space for runtime overheads and drivers.
Total parameter count loaded per GPU replica. Use post-quantisation size if weights are compressed.
FP16/BF16 use 2 bytes, INT8 quantisation uses 1 byte, FP8 uses 1 byte.
Defaults to 15%. Capture key-value cache duplication, temporary activations, and runtime buffers.
Defaults to 0. Multiply concurrent sequences by max generated tokens to approximate KV cache length.
Defaults to 0. For transformer decoders approximate as 2 × hidden_size × precision bytes.

Validate memory estimates with profiler tools such as NVIDIA Nsight Systems or PyTorch CUDA memory stats before deploying latency-critical services.

Examples

  • 80 GB GPU, 13B params at 2 bytes, 15% overhead, 2,048 tokens at 1,536 bytes ⇒ 48.47 GB weights, 7.27 GB buffers, 2.93 GB KV, 21.33 GB headroom
  • 24 GB GPU, 7B params at 1 byte, 10% overhead, no KV cache ⇒ 6.52 GB total footprint, 17.48 GB headroom

FAQ

How do I derive KV cache bytes per token?

Multiply 2 × hidden_size × precision_bytes for transformer decoders. Include rotary embedding or quantisation scaling metadata if your runtime stores them per token.

Why use activation overhead as a percentage of weight memory?

Serving stacks often size activations proportionally to model width. Expressing buffers as a percentage lets you calibrate the factor using profiler traces without exposing every internal tensor shape.

What happens if headroom is negative?

A negative headroom indicates the model will OOM on the selected GPU. Reduce batch size, adopt tensor or pipeline parallelism, or increase quantisation to shrink the footprint.

Does this include framework runtime overheads?

The calculator assumes usable memory already excludes static runtime reservations. Profile first token latency to confirm actual headroom before production deployment.

Additional Information

  • Weight memory equals parameter count multiplied by precision bytes and converted from bytes to gigabytes.
  • Activation overhead scales weight memory to approximate attention buffers, residual activations, and temporary tensors.
  • KV cache memory depends on concurrent tokens and the bytes stored per token; leaving inputs blank assumes no persistent cache load.