Google's Gemma 4 AI Models Get 3x Speed Boost by Predicting Future Tokens

Google has launched experimental Multi-Token Prediction (MTP) drafters for its Gemma 4 open AI models, promising up to a 3x speed boost in local generation. The new MTP technology uses a lightweight drafter model to predict multiple future tokens at once, bypassing the slower process of generating tokens one by one.

Standard large language models generate tokens autoregressively, requiring significant computing power and memory bandwidth. This is a bottleneck for local AI running on consumer GPUs. MTP drafters-available for the 26B (E2B) and 46B (E4B) models-share the main model's key value cache and use sparse decoding to quickly generate speculative tokens. The main model then verifies them in a single pass, drastically reducing compute cycles.

According to Google, MTP can cut inference time in half compared to standard generation, with identical output quality. Gemma 4 26B with MTP on an NVIDIA RTX PRO 6000 demonstrates this performance gain.

Gemma models are designed for local, on-device AI, giving users control over their data. The move to the Apache 2.0 license makes Gemma 4 more permissive and accessible for developers.