FABRICIO POLICARPO

Inference VRAM Calculator

Estimate GPU memory required to serve a model for inference. Select a preset or enter custom model parameters.

Model
Inference Settings
8,192
512128K
1
164
10%
5%25%
Memory Breakdown

18.8 GB

Total VRAM Required

Weights
Model Weights
16.0 GB
KV Cache
1.1 GB
Runtime Overhead (10%)
1.7 GB
GPU Compatibility
A100 40GB40 GB
1 GPU
A100 80GB80 GB
1 GPU
H100 80GB80 GB
1 GPU
H200 141GB141 GB
1 GPU
B200 192GB192 GB
1 GPU
B300 288GB288 GB
1 GPU
How It Works

Model Weights = parameters × bytes per parameter. FP16 uses 2 bytes, INT4 uses 0.5 bytes per parameter.

KV Cache = 2 × layers × KV heads × head dim × context length × batch size × 2 bytes (FP16). This scales linearly with context and batch size.

Overhead covers CUDA context, framework buffers (vLLM, TGI, etc.), and activation memory during generation.