Quick Start
Simply point your OpenAI client to Ollama’s base URL and use any local model:Supported Endpoints
/v1/chat/completions
Generate chat completions with conversational context.
Supported Features
- ✅ Multi-turn conversations
- ✅ Streaming responses
- ✅ Vision (multimodal)
- ✅ Tool/function calling
- ✅ JSON mode & structured outputs
- ✅ Reproducible outputs (seed)
Request Parameters
| Parameter | Type | Description | Support |
|---|---|---|---|
model | string | Model name (e.g., “llama3.2”) | ✅ |
messages | array | Conversation messages | ✅ |
temperature | number | Sampling temperature (0-2) | ✅ |
top_p | number | Nucleus sampling | ✅ |
max_tokens | integer | Maximum tokens to generate | ✅ |
stream | boolean | Enable streaming | ✅ |
stream_options | object | Streaming options | ✅ |
stop | string/array | Stop sequences | ✅ |
seed | integer | Random seed for reproducibility | ✅ |
frequency_penalty | number | Penalize frequent tokens | ✅ |
presence_penalty | number | Penalize existing tokens | ✅ |
response_format | object | JSON mode or structured output | ✅ |
tools | array | Available tools/functions | ✅ |
logprobs | boolean | Return log probabilities | ✅ |
top_logprobs | integer | Number of top logprobs | ✅ |
tool_choice | object | Force specific tool | ❌ |
n | integer | Number of completions | ❌ |
user | string | User identifier | ❌ |
response = client.chat.completions.create(
model='llama3.2',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Explain quantum computing'}
]
)
stream = client.chat.completions.create(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Count to 10'}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end='')
response = client.chat.completions.create(
model='llava',
messages=[{
'role': 'user',
'content': [
{'type': 'text', 'text': "What's in this image?"},
{'type': 'image_url', 'image_url': 'data:image/png;base64,...'}
]
}]
)
tools = [{
'type': 'function',
'function': {
'name': 'get_weather',
'description': 'Get current weather for a location',
'parameters': {
'type': 'object',
'properties': {
'location': {'type': 'string', 'description': 'City name'}
},
'required': ['location']
}
}
}]
response = client.chat.completions.create(
model='llama3.2',
messages=[{'role': 'user', 'content': 'What is the weather in Tokyo?'}],
tools=tools
)
# Check for tool calls
if response.choices[0].message.tool_calls:
for tool_call in response.choices[0].message.tool_calls:
print(f"Tool: {tool_call.function.name}")
print(f"Args: {tool_call.function.arguments}")
response = client.chat.completions.create(
model='llama3.2',
messages=[{
'role': 'user',
'content': 'List 3 colors in JSON format'
}],
response_format={'type': 'json_object'}
)
response = client.chat.completions.create(
model='llama3.2',
messages=[{
'role': 'user',
'content': 'Extract person info: John Doe is 30 years old'
}],
response_format={
'type': 'json_schema',
'json_schema': {
'schema': {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'age': {'type': 'integer'}
},
'required': ['name', 'age']
}
}
}
)
/v1/completions
Generate text completions without conversational context.
Use
/v1/chat/completions for conversational AI. Use /v1/completions for text generation, code completion, and fill-in-the-middle tasks.Supported Parameters
| Parameter | Support |
|---|---|
model | ✅ |
prompt | ✅ (string only) |
suffix | ✅ |
temperature | ✅ |
top_p | ✅ |
max_tokens | ✅ |
stream | ✅ |
stop | ✅ |
seed | ✅ |
frequency_penalty | ✅ |
presence_penalty | ✅ |
logprobs | ✅ |
/v1/embeddings
Generate vector embeddings for text.
Supported Parameters
| Parameter | Support |
|---|---|
model | ✅ |
input | ✅ (string or array) |
encoding_format | ✅ (float or base64) |
dimensions | ✅ |
/v1/models
List all available models.
Example:
/v1/models/{model}
Retrieve information about a specific model.
Example:
/v1/images/generations (Experimental)
Generate images using image generation models.
Supported Parameters
| Parameter | Support |
|---|---|
model | ✅ |
prompt | ✅ |
size | ✅ |
response_format | ✅ (b64_json only) |
/v1/responses
Added in Ollama v0.13.3
Supported Features
- ✅ Streaming
- ✅ Tool calling
- ✅ Reasoning summaries (thinking models)
- ❌ Stateful requests (
previous_response_id,conversation)
Migration Guide
Switching from OpenAI to Ollama
Download and install Ollama from ollama.com.
# Before (OpenAI)
client = OpenAI(
api_key=os.environ['OPENAI_API_KEY']
)
# After (Ollama)
client = OpenAI(
base_url='http://localhost:11434/v1/',
api_key='ollama' # required but ignored
)
Model Name Aliases
For applications that expect default OpenAI model names:Configuration
Custom Context Size
Create a Modelfile to adjust context length:Differences from OpenAI API
Behavior Differences
- API Key: Accepted but not validated (use any string)
- Model Names: Use Ollama model names (e.g.,
llama3.2, notgpt-4) - Token Counts: Based on the underlying model’s tokenizer
- Timestamps:
createdfield reflects model’s last modified time - Ownership:
owned_bydefaults to the Ollama username or “library”
Not Supported
Multiple Completions
n parameter for generating multiple choicesUser Tracking
user parameter for tracking usersTool Choice
tool_choice to force specific tool usageLogit Bias
logit_bias for token-level biasBest Practices
Choosing the Right Model
Choosing the Right Model
Select models based on your use case:
- Chat:
llama3.2,mistral - Code:
codellama,deepseek-coder - Vision:
llava,bakllava - Embeddings:
nomic-embed-text,all-minilm
Optimizing Performance
Optimizing Performance
- Use streaming for better UX with long responses
- Set appropriate
max_tokensto control response length - Adjust
temperaturefor creativity vs. determinism - Use
seedfor reproducible outputs in testing
Error Handling
Error Handling
Handle errors gracefully:
Examples
Complete Chat Application
Troubleshooting
Connection Refused
Connection Refused
Ensure Ollama is running:
Model Not Found
Model Not Found
Pull the model first:
Slow Responses
Slow Responses
- Check available VRAM
- Use smaller models for faster inference
- Reduce
num_ctxin model configuration
Resources
Ollama Models
Browse available models
OpenAI SDK
OpenAI Python library
API Reference
Full API documentation
Community
Join the Ollama community