Google introduces PaliGemma 2 vision-language AI models

news

Dec 5, 20242 mins

Family of tunable vision-language models based on Gemma 2 generate long captions for images that describe actions, emotions, and narratives of the scene.

Big data and artificial intelligence concept. Machine learning and circuit board. Deep learning

Google has introduced a new family of PaliGemma vision-language models, offering scalable performance, long captioning, and support for specialized tasks.

PaliGemma 2 was announced December 5, nearly seven months after the initial version launched as the first vision-language model in the Gemma family. Building on Gemma 2, PaliGemma 2 models can see, understand, and interact with visual input, according to Google.

PaliGemma 2 makes it easier for developers to add more-sophisticated vision-language features to apps, Google said. It also enables more-sophisticated captioning abilities, including identifying emotions and actions in images. Scalable performance capabilities in PaliGemma 2 mean performance can be optimized for any task via multiple model sizes (3B, 10B, 28B parameters) and resolutions (224px, 448px, 896px). Long captioning in PaliGemma 2 generates detailed, contextually relevant captions for images, going beyond simple object identification to describe actions, emotions, and the overall narrative of the scene, Google said.

PaliGemma 2 can tackle specialized tasks with state-of-the-art performance, Google said, including accurate optical character recognition and understanding the structure and content of tables in documents. Google research has shown leading performance on chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation, the company added.

PaliGemma 2 is designed as a drop-in replacement for the existing PaliGemma model, offering a range of model sizes with performance gains on most tasks without major code modifications, Google said. Flexibility is offered for fine-tuning of specific tasks and data sets.

Generative AIArtificial IntelligenceSoftware Development

by Paul Krill

Editor at Large

Follow Paul Krill on X

Paul Krill is editor at large at InfoWorld. Paul has been covering computer technology as a news and feature reporter for more than 35 years, including 30 years at InfoWorld. He has specialized in coverage of software development tools and technologies since the 1990s, and he continues to lead InfoWorld’s news coverage of software development platforms including Java and .NET and programming languages including JavaScript, TypeScript, PHP, Python, Ruby, Rust, and Go. Long trusted as a reporter who prioritizes accuracy, integrity, and the best interests of readers, Paul is sought out by technology companies and industry organizations who want to reach InfoWorld’s audience of software developers and other information technology professionals. Paul has won a “Best Technology News Coverage” award from IDG.

Show me more

Topics

About

Policies

Our Network

More

Google introduces PaliGemma 2 vision-language AI models

Family of tunable vision-language models based on Gemma 2 generate long captions for images that describe actions, emotions, and narratives of the scene.

More from this author

Google adds vibe design to Stitch UI design tool

OpenAI buys Python tools builder Astral

Java future calls for boosts with records, primitives, classes

Edge.js launched to run Node.js for AI

Project Detroit, bridging Java, Python, JavaScript, moves forward

JDK 26: The new features in Java 26

Oracle unveils the Java Verified Portfolio

Gemini CLI introduces plan mode

Show me more

AI optimization: How we cut energy costs in social media recommendation systems

Cloud at 20: Cost, complexity, and control

OpenAI buys non-AI coding startup to help its AI to program

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)