AI Inference GPU Memory Headroom Calculator

Check whether your GPU has enough memory to serve a transformer model by summing weight, activation, and KV cache requirements and reporting remaining headroom.

Validate memory estimates with profiler tools such as NVIDIA Nsight Systems or PyTorch CUDA memory stats before deploying latency-critical services.

Examples

80 GB GPU, 13B params at 2 bytes, 15% overhead, 2,048 tokens at 1,536 bytes ⇒ 48.47 GB weights, 7.27 GB buffers, 2.93 GB KV, 21.33 GB headroom
24 GB GPU, 7B params at 1 byte, 10% overhead, no KV cache ⇒ 6.52 GB total footprint, 17.48 GB headroom

FAQ

How do I derive KV cache bytes per token?

Multiply 2 × hidden_size × precision_bytes for transformer decoders. Include rotary embedding or quantisation scaling metadata if your runtime stores them per token.

Why use activation overhead as a percentage of weight memory?

Serving stacks often size activations proportionally to model width. Expressing buffers as a percentage lets you calibrate the factor using profiler traces without exposing every internal tensor shape.

What happens if headroom is negative?

A negative headroom indicates the model will OOM on the selected GPU. Reduce batch size, adopt tensor or pipeline parallelism, or increase quantisation to shrink the footprint.

Does this include framework runtime overheads?

The calculator assumes usable memory already excludes static runtime reservations. Profile first token latency to confirm actual headroom before production deployment.

Additional Information

Weight memory equals parameter count multiplied by precision bytes and converted from bytes to gigabytes.
Activation overhead scales weight memory to approximate attention buffers, residual activations, and temporary tensors.
KV cache memory depends on concurrent tokens and the bytes stored per token; leaving inputs blank assumes no persistent cache load.