Gemma 2, the latest release from Google DeepMind, brings several groundbreaking architectural changes to the table. Our comprehensive blog post delves into the innovative updates that enhance the model’s performance and efficiency. Key features include sliding window attention, which interleaves sliding window and full-quadratic attention for higher quality generation, and logit soft-capping, a mechanism to scale logits within a fixed range to improve training stability.
Additionally, the introduction of knowledge distillation leverages a larger teacher model to effectively train the smaller 9B model, and model merging allows the combination of two or more large language models (LLMs) into a single, robust model. Dive into our detailed analysis to discover how these advancements position Gemma 2 as a leader in the AI landscape.
- Models on the Hub
- Hugging Face Transformers integration
- Integration with Google Cloud & Inference Endpoints
What is Gemma 2?
Gemma 2 is Google’s latest iteration of open Large Language Models (LLMs), available in two sizes: 9 billion and 27 billion parameters, with both base (pre-trained) and instruction-tuned versions. Based on Google Deepmind’s Gemini, Gemma 2 boasts a context length of 8K tokens:
- gemma-2-9b: Base 9B model.
- gemma-2-9b-it: Instruction fine-tuned version of the base 9B model.
- gemma-2-27b: Base 27B model.
- gemma-2-27b-it: Instruction fine-tuned version of the base 27B model.
Gemma 2 models were trained on approximately twice the amount of data compared to their first iteration, with the 27B version trained on 13 trillion tokens and the 9B version on 8 trillion tokens, covering web data (primarily English), code, and math. Despite the lack of specific details on the training mix, it is clear that enhanced data curation played a significant role in boosting performance. The permissive license of Gemma 2 allows for redistribution, fine-tuning, commercial use, and derivative works.
Technical Advances in Gemma 2
Gemma 2 introduces several advancements while maintaining similarities with the original iteration, such as a context length of 8192 tokens and the use of Rotary Position Embedding (RoPE). Here are the four main advances:
Sliding Window Attention
Sliding window attention interleaves sliding window and full-quadratic attention to improve quality generation. In Gemma 2, a sliding window is applied to every other layer (local – 4096 tokens), while the intermediate layers use full quadratic global attention (8192 tokens), enhancing quality in long-context situations.
Logit Soft-Capping
Logit soft-capping prevents logits from growing excessively by scaling them to a fixed range, stabilizing training. Soft capping is applied to the final layer and every attention layer, ensuring that values remain within a controlled interval without significant information loss.
Knowledge Distillation
Knowledge distillation involves using a larger teacher model to train a smaller model, enriching the next-token prediction task with a distribution of token probabilities from the teacher. In Google Gemma 2, this technique was used to pre-train the 9B model, while the 27B model was pre-trained from scratch. Post-training involved generating completions from a teacher model and training the student models on this synthetic data.
Model Merging
Model merging combines two or more LLMs into a single new model. Gemma 2 employs Warp, a new merging technique that merges models in three distinct stages:
- Exponential Moving Average (EMA): Applied during reinforcement learning fine-tuning.
- Spherical Linear intERPolation (SLERP): Applied after RL fine-tuning.
- Linear Interpolation Towards Initialization (LITI): Applied after the SLERP stage.
These innovations make Gemma 2 a robust and versatile model, ready to tackle a variety of applications with enhanced performance and efficiency.
Gemma 2 Evaluation
How good are the Gemma models? Below are performance comparisons to other open models based on the Technical Report and the latest version of the Open LLM Leaderboard.
Technical Report Results
The Technical Report of Gemma 2 provides a detailed comparison of the performance of various open LLMs on the benchmarks used in the previous Open LLM Leaderboard.
Performance Comparison of Large Models
| Benchmark | Llama 3 (70B) | Qwen 1.5 (32B) | Gemma 2 (27B) |
|---|---|---|---|
| MMLU | 79.2 | 74.3 | 75.2 |
| GSM8K | 76.9 | 61.1 | 75.1 |
| ARC-c | 68.8 | 63.6 | 71.4 |
| HellaSwag | 88.0 | 85.0 | 86.4 |
| Winogrande | 85.3 | 81.5 | 83.7 |
Performance Comparison of Small Models
| Benchmark | Mistral (7B) | Llama 3 (8B) | Gemma (8B) | Gemma 2 (9B) |
|---|---|---|---|---|
| MMLU | 62.5 | 66.6 | 64.4 | 71.3 |
| GSM8K | 34.5 | 45.7 | 50.9 | 62.3 |
| ARC-C | 60.5 | 59.2 | 61.1 | 68.4 |
| HellaSwag | 83.0 | 82.0 | 82.3 | 81.9 |
| Winogrande | 78.5 | 78.5 | 79.0 | 80.6 |
These comparisons highlight Gemma 2’s competitive performance, especially in benchmarks like GSM8K and ARC-c, where it shows significant improvement over previous versions and other models in similar parameter ranges. The enhancements in training data and techniques, such as sliding window attention and knowledge distillation, contribute to these superior results.
Read other articles:

