Gemma Scope 2
A comprehensive, open suite of interpretability tools designed for the Gemma 3 model collection, allowing researchers to analyze complex language model behaviors.
Gemma Scope 2 serves as a microscope for large language models. It provides Sparse Autoencoders (SAEs) and Transcoders trained on every layer of the Gemma 3 family. By decomposing high-dimensional activations into human-inspectable features, researchers can trace internal logic without relying solely on input-output analysis.
Microscope for AI
Trace complex behaviors by inspecting internal features across all Gemma 3 layers.
AI Safety
Investigate jailbreaks, hallucinations, and chatbot safety mechanisms at the feature level.
SAEs & Transcoders
Decomposes dense activations into sparse, interpretable sets of active concepts.
🔍 How It Works
Capture Activations
Gemma Scope 2 processes vast amounts of activation data from every layer of Gemma 3 as it handles various prompts.
Decompose Features
Using Sparse Autoencoders (SAEs), dense vectors are broken down into sparse, distinct features representing specific concepts (e.g., 'coding logic' or 'politeness').
Analyze & Debug
Researchers trace which features fire during unwanted behaviors (like hallucinations) to understand the model's internal reasoning steps.
🌐 Full Family Support
Gemma Scope 2 scales across the entire Gemma 3 family. This broad coverage is critical because complex, emergent behaviors often only appear at larger model scales.
Advancing Open AI Safety
By providing open access to the weights, code, and documentation for the Gemma 3 interpretability suite, Google DeepMind empowers the AI safety community to build safer, more transparent agents for the future.