FABRICIO POLICARPO

Advanced Inference Estimator

Full architecture-level memory and latency estimation for LLM inference deployment. Accounts for tensor parallelism, KV cache scaling, and InfiniBand overhead across multi-node setups.

Model Architecture

Architecture Details

Layers32
Hidden Dim4,096
FFN Dim14,336
Attn Heads32
KV Heads8
Head Dim128
Vocab Size128,256
Max Context131,072

Parameter Distribution

Attention (per layer)41.9M
FFN (per layer)176.2M
Per layer total218.1M
Embedding525.3M
LM Head525.3M
Computed Total8.03B
Precision
Sequence Configuration
4,096
128131,072
512
18,192
1
1128
Hardware
1x
116
Memory Breakdown

18.37 GB

Total VRAM (single GPU)

Memory composition
Weights
Model Weights
16.06 GB
KV Cache(4,608 tokens total)
618.5 MB
Activations (prefill)
412.3 MB
CUDA Context
512.0 MB
Framework Overhead
822.3 MB
Total
18.37 GB
Deployment

Fits on 1x H100 SXM

18.37 GB / 80 GB per GPU (23% utilization)

Min GPUs required1
Current TP degree1x
Nodes needed1
Latency Estimate
Time to first token110.86 ms
Decode per token5.84 ms
Decode throughput171.2 tok/s
Total (512 tokens)3.10 s
How This Is Calculated

Model Weights

params × bytes_per_param

FP16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes per parameter.

KV Cache

2 × L × n_kv × d_h × seq × batch × kv_bytes

Scales linearly with sequence length, batch size, and number of layers. KV precision can be independent of weight precision.

Activations

batch × seq × hidden × ~24 bytes

Peak during prefill phase. Negligible during decode (single token at a time).

Prefill Latency (compute-bound)

2 × params × seq × batch / (TFLOPS × TP)

Assumes ~60% compute utilization. Scales with TP degree.

Decode Latency (memory-bound)

weight_size / (mem_bandwidth × 0.85) + kv_read_time

Each decode step reads all weights + KV cache from HBM. Bandwidth is the bottleneck.

IB All-Reduce Overhead

2 × (tp-1)/tp × msg / bw + latency × 2log(tp)

Two all-reduces per layer (post-attention, post-FFN). Only applies when TP spans multiple nodes.