Skip to main content
Ollama provides compatibility with the OpenAI API to help you integrate Ollama with existing applications that use OpenAI’s API.

Quick Start

Simply point your OpenAI client to Ollama’s base URL and use any local model:
from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',  # required but ignored
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{
        'role': 'user',
        'content': 'Why is the sky blue?'
    }]
)
print(response.choices[0].message.content)

Supported Endpoints

/v1/chat/completions

Generate chat completions with conversational context.
Supported Features
  • ✅ Multi-turn conversations
  • ✅ Streaming responses
  • ✅ Vision (multimodal)
  • ✅ Tool/function calling
  • ✅ JSON mode & structured outputs
  • ✅ Reproducible outputs (seed)

Request Parameters

ParameterTypeDescriptionSupport
modelstringModel name (e.g., “llama3.2”)
messagesarrayConversation messages
temperaturenumberSampling temperature (0-2)
top_pnumberNucleus sampling
max_tokensintegerMaximum tokens to generate
streambooleanEnable streaming
stream_optionsobjectStreaming options
stopstring/arrayStop sequences
seedintegerRandom seed for reproducibility
frequency_penaltynumberPenalize frequent tokens
presence_penaltynumberPenalize existing tokens
response_formatobjectJSON mode or structured output
toolsarrayAvailable tools/functions
logprobsbooleanReturn log probabilities
top_logprobsintegerNumber of top logprobs
tool_choiceobjectForce specific tool
nintegerNumber of completions
userstringUser identifier
1
Basic Chat Completion
2
response = client.chat.completions.create(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'Explain quantum computing'}
    ]
)
3
Streaming Responses
4
stream = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Count to 10'}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='')
5
Vision (Multimodal)
6
response = client.chat.completions.create(
    model='llava',
    messages=[{
        'role': 'user',
        'content': [
            {'type': 'text', 'text': "What's in this image?"},
            {'type': 'image_url', 'image_url': 'data:image/png;base64,...'}
        ]
    }]
)
7
Image URLs not supported - Only base64-encoded images are currently supported.
8
Tool/Function Calling
9
tools = [{
    'type': 'function',
    'function': {
        'name': 'get_weather',
        'description': 'Get current weather for a location',
        'parameters': {
            'type': 'object',
            'properties': {
                'location': {'type': 'string', 'description': 'City name'}
            },
            'required': ['location']
        }
    }
}]

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'What is the weather in Tokyo?'}],
    tools=tools
)

# Check for tool calls
if response.choices[0].message.tool_calls:
    for tool_call in response.choices[0].message.tool_calls:
        print(f"Tool: {tool_call.function.name}")
        print(f"Args: {tool_call.function.arguments}")
10
JSON Mode
11
response = client.chat.completions.create(
    model='llama3.2',
    messages=[{
        'role': 'user',
        'content': 'List 3 colors in JSON format'
    }],
    response_format={'type': 'json_object'}
)
12
Structured Outputs
13
response = client.chat.completions.create(
    model='llama3.2',
    messages=[{
        'role': 'user',
        'content': 'Extract person info: John Doe is 30 years old'
    }],
    response_format={
        'type': 'json_schema',
        'json_schema': {
            'schema': {
                'type': 'object',
                'properties': {
                    'name': {'type': 'string'},
                    'age': {'type': 'integer'}
                },
                'required': ['name', 'age']
            }
        }
    }
)

/v1/completions

Generate text completions without conversational context.
Use /v1/chat/completions for conversational AI. Use /v1/completions for text generation, code completion, and fill-in-the-middle tasks.

Supported Parameters

ParameterSupport
model
prompt✅ (string only)
suffix
temperature
top_p
max_tokens
stream
stop
seed
frequency_penalty
presence_penalty
logprobs
Example:
response = client.completions.create(
    model='codellama:code',
    prompt='def calculate_fibonacci(',
    suffix='    return result',
    max_tokens=150
)
print(response.choices[0].text)

/v1/embeddings

Generate vector embeddings for text.

Supported Parameters

ParameterSupport
model
input✅ (string or array)
encoding_format✅ (float or base64)
dimensions
Example:
response = client.embeddings.create(
    model='nomic-embed-text',
    input='Why is the sky blue?'
)
print(response.data[0].embedding)
Batch Embeddings:
response = client.embeddings.create(
    model='nomic-embed-text',
    input=['First text', 'Second text', 'Third text']
)

/v1/models

List all available models. Example:
models = client.models.list()
for model in models.data:
    print(model.id)

/v1/models/{model}

Retrieve information about a specific model. Example:
model = client.models.retrieve('llama3.2')
print(model.id, model.created)

/v1/images/generations (Experimental)

Experimental - This endpoint is experimental and may change in future versions.
Generate images using image generation models.

Supported Parameters

ParameterSupport
model
prompt
size
response_format✅ (b64_json only)
Example:
response = client.images.generate(
    model='stable-diffusion',
    prompt='A serene mountain landscape',
    size='1024x1024'
)
print(response.data[0].b64_json[:50])

/v1/responses

Added in Ollama v0.13.3
Support for the OpenAI Responses API (non-stateful).

Supported Features

  • ✅ Streaming
  • ✅ Tool calling
  • ✅ Reasoning summaries (thinking models)
  • ❌ Stateful requests (previous_response_id, conversation)
Example:
response = client.responses.create(
    model='qwen3:8b',
    input='Write a poem about the ocean'
)
print(response.output_text)

Migration Guide

Switching from OpenAI to Ollama

1
Install Ollama
2
Download and install Ollama from ollama.com.
3
Pull a Model
4
ollama pull llama3.2
5
Update Your Code
6
Change only the base_url and api_key:
7
# Before (OpenAI)
client = OpenAI(
    api_key=os.environ['OPENAI_API_KEY']
)

# After (Ollama)
client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama'  # required but ignored
)
8
Use Local Models
9
Replace OpenAI model names with Ollama models:
10
# Before
model='gpt-4'

# After
model='llama3.2'

Model Name Aliases

For applications that expect default OpenAI model names:
# Create an alias
ollama cp llama3.2 gpt-4

# Now you can use 'gpt-4' as the model name

Configuration

Custom Context Size

Create a Modelfile to adjust context length:
FROM llama3.2
PARAMETER num_ctx 8192
Create the model:
ollama create mymodel -f Modelfile
Use the custom model:
response = client.chat.completions.create(
    model='mymodel',
    messages=[...]
)

Differences from OpenAI API

Behavior Differences

  • API Key: Accepted but not validated (use any string)
  • Model Names: Use Ollama model names (e.g., llama3.2, not gpt-4)
  • Token Counts: Based on the underlying model’s tokenizer
  • Timestamps: created field reflects model’s last modified time
  • Ownership: owned_by defaults to the Ollama username or “library”

Not Supported

Multiple Completions

n parameter for generating multiple choices

User Tracking

user parameter for tracking users

Tool Choice

tool_choice to force specific tool usage

Logit Bias

logit_bias for token-level bias

Best Practices

Select models based on your use case:
  • Chat: llama3.2, mistral
  • Code: codellama, deepseek-coder
  • Vision: llava, bakllava
  • Embeddings: nomic-embed-text, all-minilm
  • Use streaming for better UX with long responses
  • Set appropriate max_tokens to control response length
  • Adjust temperature for creativity vs. determinism
  • Use seed for reproducible outputs in testing
Handle errors gracefully:
try:
    response = client.chat.completions.create(...)
except Exception as e:
    print(f"Error: {e}")

Examples

Complete Chat Application

import openai

client = openai.OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama'
)

conversation = [
    {'role': 'system', 'content': 'You are a helpful coding assistant.'}
]

while True:
    user_input = input('You: ')
    if user_input.lower() == 'quit':
        break
    
    conversation.append({'role': 'user', 'content': user_input})
    
    response = client.chat.completions.create(
        model='llama3.2',
        messages=conversation,
        stream=True
    )
    
    assistant_message = ''
    print('Assistant: ', end='')
    for chunk in response:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            print(content, end='', flush=True)
            assistant_message += content
    print()
    
    conversation.append({'role': 'assistant', 'content': assistant_message})

Troubleshooting

Ensure Ollama is running:
ollama serve
Pull the model first:
ollama pull llama3.2
  • Check available VRAM
  • Use smaller models for faster inference
  • Reduce num_ctx in model configuration

Resources

Ollama Models

Browse available models

OpenAI SDK

OpenAI Python library

API Reference

Full API documentation

Community

Join the Ollama community