PaliGemma 2: Powerful Vision-Language Models

A Deeper Look with VLM

PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, built on open components like the SigLIP vision model and the Gemma language model. Taking both images and text as inputs, PaliGemma performs deeper analysis to provide detailed answers, accurate captions, precise object detection, and reading text embedded within images.

VLM

Vision-Language

3 Sizes

3B, 10B, 28B

Open

Weights

Model Categories

Choose the right PaliGemma model for your use case, from raw pre-trained to ready-to-use mixtures.

PaliGemma PT

General purpose pre-trained models. Requires fine-tuning.

PaliGemma FT

Research-oriented models fine-tuned on specific datasets.

PaliGemma mix

Out-of-the-box models tuned to a mixture of common tasks.

Architecture

Combining powerful vision encoders with capable language decoders.

Images

Text

↓

VisionSigLIP Model

➔

LanguageGemma Model

↓

Insights, Captions, Detection

PaliGemma 2 Variants

Available in sizes based on Gemma 2 27B, 9B, and 2B models.

Supported Resolutions

PaliGemma handles different image input resolutions for varying task complexities.