Multimodal Module

The Multimodal module enables AI agents to process and generate content across multiple modalities: images, audio, and video.

Overview

from openstackai.multimodal import Image, Audio, Video, MultimodalContent

Key Components

Component	Description
ImageContent	Image processing and analysis
AudioContent	Audio file handling
VideoContent	Video processing
MultimodalContent	Mixed content container

Quick Start

Image Analysis

from openstackai import ask
from openstackai.multimodal import Image

# Analyze an image
image = Image.from_file("photo.jpg")
response = ask("What's in this image?", images=[image])
print(response)

Multiple Images

images = [
    Image.from_file("before.jpg"),
    Image.from_file("after.jpg")
]

response = ask(
    "Compare these two images and describe the differences",
    images=images
)

From URL

image = Image.from_url("https://example.com/image.jpg")
response = ask("Describe this image", images=[image])

Base64 Encoded

import base64

with open("image.png", "rb") as f:
    data = base64.b64encode(f.read()).decode()

image = Image.from_base64(data, media_type="image/png")

MultimodalContent

Combine multiple types of content:

from openstackai.multimodal import MultimodalContent, Image, Audio

content = MultimodalContent()
content.add_text("Please analyze this meeting recording and slides:")
content.add_image(Image.from_file("slides.png"))
content.add_audio(Audio.from_file("meeting.mp3"))

response = agent.run(content)

With Agents

from openstackai import Agent
from openstackai.multimodal import Image

agent = Agent(
    name="ImageAnalyzer",
    instructions="You are an expert at analyzing images.",
    model="gpt-4o"  # Vision-capable model
)

image = Image.from_file("diagram.png")
result = agent.run("Explain this diagram", images=[image])

Supported Formats

Images

PNG, JPEG, GIF, WebP
Max size varies by model (typically 20MB)
Auto-resizing available

Audio

MP3, WAV, M4A, FLAC, OGG
Transcription integration

Video

MP4, MOV, WebM
Frame extraction for analysis

Image Processing

from openstackai.multimodal import Image

image = Image.from_file("large_photo.jpg")

# Resize for API limits
image = image.resize(max_width=1024, max_height=1024)

# Convert format
image = image.convert(format="jpeg", quality=85)

# Get dimensions
print(f"Size: {image.width}x{image.height}")

Provider Support

Provider	Images	Audio	Video
OpenAI GPT-4o	✅	✅	✅
Anthropic Claude 3	✅	❌	❌
Google Gemini	✅	✅	✅

Overview​

Key Components​

Quick Start​

Image Analysis​

Multiple Images​

From URL​

Base64 Encoded​

MultimodalContent​

With Agents​

Supported Formats​

Images​

Audio​

Video​

Image Processing​

Provider Support​

See Also​

Overview

Key Components

Quick Start

Image Analysis

Multiple Images

From URL

Base64 Encoded

MultimodalContent

With Agents

Supported Formats

Images

Audio

Video

Image Processing

Provider Support

See Also