Skip to main content
Vision models accept images alongside text so the model can describe, classify, and answer questions about what it sees.

Quick start

Pass image file paths directly in the CLI:
ollama run gemma3 ./image.png "What's in this image?"
Or use image URLs:
ollama run gemma3 https://example.com/image.jpg "Describe this"

Supported vision models

Many models in the Ollama library support vision capabilities: Browse all vision models on Ollama.

Multiple images

Some models support analyzing multiple images in a single request:
from ollama import chat

response = chat(
  model='gemma3',
  messages=[
    {
      'role': 'user',
      'content': 'What are the differences between these images?',
      'images': ['./image1.jpg', './image2.jpg', './image3.jpg'],
    }
  ],
)

print(response.message.content)
Some models like llama3.2-vision only support one image per request. Check the model’s documentation for limitations.

Image formats

Ollama supports multiple ways to provide images:

File paths (SDKs only)

images=['/path/to/image.jpg']

URLs (SDKs only)

images=['https://example.com/image.jpg']

Base64-encoded strings (All clients)

import base64
from pathlib import Path

img_data = base64.b64encode(Path('image.jpg').read_bytes()).decode()
images=[img_data]

Raw bytes (Python SDK)

from pathlib import Path

img_data = Path('image.jpg').read_bytes()
images=[img_data]

Vision with chat history

Maintain conversation context across multiple turns:
from ollama import chat

messages = [
  {
    'role': 'user',
    'content': 'What is in this image?',
    'images': ['./chart.png']
  }
]

response = chat(model='gemma3', messages=messages)
messages.append(response.message)

# Follow-up question (no image needed)
messages.append({
  'role': 'user',
  'content': 'What is the trend in the data?'
})

response = chat(model='gemma3', messages=messages)
print(response.message.content)

Vision with structured outputs

Combine vision with JSON schema for structured image analysis:
from ollama import chat
from pydantic import BaseModel
from typing import Literal, Optional

class Object(BaseModel):
  name: str
  confidence: float
  attributes: str

class ImageDescription(BaseModel):
  summary: str
  objects: list[Object]
  scene: str
  colors: list[str]
  time_of_day: Literal['Morning', 'Afternoon', 'Evening', 'Night']
  setting: Literal['Indoor', 'Outdoor', 'Unknown']
  text_content: Optional[str] = None

response = chat(
  model='gemma3',
  messages=[{
    'role': 'user',
    'content': 'Describe this photo and list the objects you detect.',
    'images': ['path/to/image.jpg'],
  }],
  format=ImageDescription.model_json_schema(),
  options={'temperature': 0},
)

image_description = ImageDescription.model_validate_json(response.message.content)
print(image_description)
See Structured Outputs for more details.

Common use cases

Image captioning

response = chat(
  model='gemma3',
  messages=[{
    'role': 'user',
    'content': 'Write a brief caption for this image.',
    'images': ['./photo.jpg']
  }]
)

OCR (Text extraction)

response = chat(
  model='gemma3',
  messages=[{
    'role': 'user',
    'content': 'Extract all text from this image.',
    'images': ['./document.png']
  }]
)

Visual question answering

response = chat(
  model='gemma3',
  messages=[{
    'role': 'user',
    'content': 'How many people are in this image? What are they doing?',
    'images': ['./group_photo.jpg']
  }]
)

Image comparison

response = chat(
  model='gemma3',
  messages=[{
    'role': 'user',
    'content': 'What changed between these two images?',
    'images': ['./before.jpg', './after.jpg']
  }]
)

Tips

  • Use clear, specific prompts for better results
  • Provide high-quality images when possible (but models can handle various resolutions)
  • For OCR tasks, ensure text is clearly visible in the image
  • Use structured outputs for consistent, parseable results
  • Set temperature: 0 in options for more deterministic responses
  • Consider using multiple images when comparing or analyzing relationships