Advanced Inference Estimator

Full architecture-level memory and latency estimation for LLM inference deployment. Accounts for tensor parallelism, KV cache scaling, and InfiniBand overhead across multi-node setups.

Model Architecture

Preset

Architecture Details

Layers32

Hidden Dim4,096

FFN Dim14,336

Attn Heads32

KV Heads8

Head Dim128

Vocab Size128,256

Max Context131,072

Parameter Distribution

Attention (per layer)41.9M

FFN (per layer)176.2M

Per layer total218.1M

Embedding525.3M

LM Head525.3M

Computed Total8.03B

Precision

Weight Precision

KV Cache Precision

Sequence Configuration

Input Sequence Length4,096

128131,072

Output Length512

18,192

Batch Size1

1128

Hardware

Target GPU

Tensor Parallelism1x

116

Interconnect

Memory Breakdown

18.37 GB

Total VRAM (single GPU)

Memory composition

Weights

Model Weights

16.06 GB

KV Cache(4,608 tokens total)

618.5 MB

Activations (prefill)

412.3 MB

CUDA Context

512.0 MB

Framework Overhead

822.3 MB

Total

18.37 GB

Deployment

Fits on 1x H100 SXM

18.37 GB / 80 GB per GPU (23% utilization)

Min GPUs required1

Current TP degree1x

Nodes needed1

Latency Estimate

Time to first token110.86 ms

Decode per token5.84 ms

Decode throughput171.2 tok/s

Total (512 tokens)3.10 s

How This Is Calculated

Model Weights

params × bytes_per_param

FP16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes per parameter.

KV Cache

2 × L × n_kv × d_h × seq × batch × kv_bytes

Scales linearly with sequence length, batch size, and number of layers. KV precision can be independent of weight precision.

Activations

batch × seq × hidden × ~24 bytes

Peak during prefill phase. Negligible during decode (single token at a time).

Prefill Latency (compute-bound)

2 × params × seq × batch / (TFLOPS × TP)

Assumes ~60% compute utilization. Scales with TP degree.

Decode Latency (memory-bound)

weight_size / (mem_bandwidth × 0.85) + kv_read_time

Each decode step reads all weights + KV cache from HBM. Bandwidth is the bottleneck.

IB All-Reduce Overhead

2 × (tp-1)/tp × msg / bw + latency × 2log(tp)

Two all-reduces per layer (post-attention, post-FFN). Only applies when TP spans multiple nodes.