Experimental Model

DiffusionGemma

A groundbreaking parallel generation text model from Google DeepMind, designed for exceptionally fast, higher-throughput AI experiences.

By abandoning traditional token-by-token generation in favor of discrete text diffusion, DiffusionGemma dramatically accelerates inference. This opens a new frontier for real-time AI agents and high-throughput local deployment without compromising quality.

256

Tokens in Parallel

🚀

1,000

Tokens / Sec (Max)

💾

26B

Total Parameters

🌟

3.8B

Active Parameters

Parallel Text Generation

Traditional LLMs use autoregressive decoding, predicting text strictly one token after another. DiffusionGemma uses diffusion-based denoising to refine an entire block of text simultaneously. By processing up to 256 tokens at once, it breaks the bottleneck of sequential generation, delivering massive throughput improvements.

Standard LLM (Autoregressive)1 token / step
DiffusionGemma (Block-Denoising)256 tokens / step
Key Takeaway: Parallel block generation allows DiffusionGemma to reach speeds up to 1,000 tokens/sec on dedicated hardware, transforming what is possible for high-throughput applications.

🧠 MoE Efficiency

Built upon a robust Mixture-of-Experts (MoE) foundation, DiffusionGemma scales capability without destroying inference efficiency. While the total model footprint is 25.2 billion parameters, it leverages a sparse design. During each step, only 3.8 billion parameters are actively utilized for computation.

15%
Active
Key Takeaway: Activating only a fraction of its total parameters enables DiffusionGemma to maintain a low memory bandwidth requirement, making it highly optimized for local and workstation environments.