A Deeper Look with VLM
PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, built on open components like the SigLIP vision model and the Gemma language model. Taking both images and text as inputs, PaliGemma performs deeper analysis to provide detailed answers, accurate captions, precise object detection, and reading text embedded within images.
Model Categories
Choose the right PaliGemma model for your use case, from raw pre-trained to ready-to-use mixtures.
PaliGemma PT
General purpose pre-trained models. Requires fine-tuning.
PaliGemma FT
Research-oriented models fine-tuned on specific datasets.
PaliGemma mix
Out-of-the-box models tuned to a mixture of common tasks.
Architecture
Combining powerful vision encoders with capable language decoders.
PaliGemma 2 Variants
Available in sizes based on Gemma 2 27B, 9B, and 2B models.
Supported Resolutions
PaliGemma handles different image input resolutions for varying task complexities.