Batch Size, KV Cache, and the Hidden Costs of AI Inference

Batch size is the biggest lever for AI inference cost, according to Reiner Pope, CEO of MatX. Batching multiple users can slash costs by up to a thousand times, but it directly impacts latency. The relationship is linear: larger batches increase compute time, but memory latency has a fixed overhead.

For autoregressive models, the key is the KV cache. Each token must attend to all previous ones, a process dominated by memory fetches, not matrix math. That makes memory bandwidth more important than raw compute power.

Overall latency is determined by the slower of two curves: compute time versus memory fetch time. There is a hard lower limit-the time needed to read every parameter from memory. Context length shifts the balance between compute-limited and memory-limited scenarios.

On GPUs, the cost per token plummets as batch size grows, but only up to a point. Understanding this trade-off is essential for optimizing resource use and reducing expenses in large-scale AI deployments.