Reiner Pope – The math behind how LLMs are trained and served
Reiner Pope provides a rigorous breakdown of the hardware constraints that dictate the economics of large language model inference. The analysis establishes that optimal batch sizes are determined by the ratio of compute to memory bandwidth, specifically noting that sizes around 300 times the model's sparsity ratio allow labs to amortize weight fetch costs effectively. This insight is critical for engineering teams designing serving infrastructure, as it moves the discussion beyond abstract scaling laws to physical memory limits. The piece also highlights that Mixture of Experts models are currently bottlenecked by the all-to-all communication limits within a single GPU rack. This constraint is driving the industry toward larger scale-up domains, such as those offered by Nvidia's Blackwell architecture. Furthermore, the analysis clarifies that while pipeline parallelism reduces weight memory, it fails to reduce KV cache memory per GPU, making expert parallelism the superior strategy for serving. Infrastructure planners should use these metrics to evaluate the viability of current hardware deployments versus next-generation interconnects.