Google Launches PaliGemma 2 Vision-Language Models

Google Launches PaliGemma 2 Vision-Language Models

Google has revealed the successor to its visual-language model PaliGemma, which was introduced in May 2024. The new version, PaliGemma 2, comes in a range of sizes, featuring parameter counts from 3 billion to 28 billion, and resolution options that go up to 896px.

Advanced Performance Features

According to the company, this model showcases "top-tier performance in recognizing chemical formulas, musical scores, spatial reasoning, and generating reports from chest X-rays."

Enhanced Captioning Abilities

Additionally, it boasts long captioning functionality, offering "thorough, contextually relevant captions for images that go beyond basic object recognition to include descriptions of actions, emotions, and the overall story of the scene."

Accessible and Flexible Options

The new models are designed to be a "drop-in replacement" across various sizes without the need for "significant code changes." Pre-trained versions can be found on platforms like Hugging Face and Kaggle, available for free to anyone interested in testing them. It also provides support for several frameworks like Hugging Face Transformers, Keras, PyTorch, JAX, and Gemma.cpp.

Google emphasizes that PaliGemma 2's "adaptability makes it easy to fine-tune for particular tasks and datasets, allowing you to customize its functions to meet your specific requirements."

Leave a Comment

Scroll to Top