American tech giant Google is expanding its generative AI catalog with PaliGemma, a brand-new AI model. Announced during the recently concluded Google I/O, PaliGemma is a vision-language model (VLM) that understands both visual and text prompts simultaneously.
“Today, we’re excited to further expand the Gemma family with the introduction of PaliGemma, a powerful open vision-language model (VLM)”, the company stated during the event. The model was inspired by PaLI-3, a small-scale VLM developed by Cornell University. It integrates open components from both SigLIP (Sigmoid Language Image Pre-training) and the Gemma language model.
See Related: OpenAI Launches ChatGPT Plus Subscription In India; Includes GPT-4
According to Google, the model is designed for “class-leading fine-tune performance” on several tasks including writing captions for images, answering visual questions, and understanding texts in images. Google further added, “We’re providing both pre-trained and fine-tuned checkpoints at multiple resolutions, as well as checkpoints specifically tuned to a mixture of tasks for immediate exploration”.
Unlike many of Google’s other AI models, PaliGemma is an open model. It is available to developers and researchers on various platforms such as GitHub, Hugging Face models, Kaggle, Vertex AI Model Garden, and ai.nvidia.com. Interested developers can also interact with the model via this Hugging Face Space. The launch of PaliGemma coincides with other AI tools released by Google like Gemma 2 and Gemini 1.5 Flash.