Full architecture-level memory and latency estimation for LLM inference deployment. Accounts for tensor parallelism, KV cache scaling, and InfiniBand overhead across multi-node setups.
Architecture Details
Parameter Distribution
18.37 GB
Total VRAM (single GPU)
Fits on 1x H100 SXM
18.37 GB / 80 GB per GPU (23% utilization)
Model Weights
params × bytes_per_param
FP16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes per parameter.
KV Cache
2 × L × n_kv × d_h × seq × batch × kv_bytes
Scales linearly with sequence length, batch size, and number of layers. KV precision can be independent of weight precision.
Activations
batch × seq × hidden × ~24 bytes
Peak during prefill phase. Negligible during decode (single token at a time).
Prefill Latency (compute-bound)
2 × params × seq × batch / (TFLOPS × TP)
Assumes ~60% compute utilization. Scales with TP degree.
Decode Latency (memory-bound)
weight_size / (mem_bandwidth × 0.85) + kv_read_time
Each decode step reads all weights + KV cache from HBM. Bandwidth is the bottleneck.
IB All-Reduce Overhead
2 × (tp-1)/tp × msg / bw + latency × 2log(tp)
Two all-reduces per layer (post-attention, post-FFN). Only applies when TP spans multiple nodes.