Batch size is the biggest lever for AI inference cost, according to Reiner Pope, CEO of MatX. Batching multiple users can slash costs by up to a thousand times, but it directly impacts latency. The relationship is linear: larger batches increase compute time, but memory latency has a fixed overhead.
For autoregressive models, the key is the KV cache. Each token must attend to all previous ones, a process dominated by memory fetches, not matrix math. That makes memory bandwidth more important than raw compute power.
Overall latency is determined by the slower of two curves: compute time versus memory fetch time. There is a hard lower limit-the time needed to read every parameter from memory. Context length shifts the balance between compute-limited and memory-limited scenarios.
On GPUs, the cost per token plummets as batch size grows, but only up to a point. Understanding this trade-off is essential for optimizing resource use and reducing expenses in large-scale AI deployments.