Quick Start
Configure your Anthropic client to point to Ollama:Supported Endpoints
/v1/messages
Send messages to models and receive responses in Anthropic format.
Supported Features
- ✅ Multi-turn conversations
- ✅ System prompts
- ✅ Streaming responses
- ✅ Vision (images)
- ✅ Tool/function calling
- ✅ Thinking/extended thinking
- ✅ Tool results
Request Parameters
| Parameter | Type | Description | Support |
|---|---|---|---|
model | string | Model name | ✅ |
max_tokens | integer | Maximum tokens to generate | ✅ |
messages | array | Conversation messages | ✅ |
system | string/array | System prompt | ✅ |
temperature | number | Sampling temperature (0-1) | ✅ |
top_p | number | Nucleus sampling | ✅ |
top_k | integer | Top-k sampling | ✅ |
stream | boolean | Enable streaming | ✅ |
stop_sequences | array | Stop sequences | ✅ |
tools | array | Available tools | ✅ |
thinking | object | Extended thinking config | ✅ |
tool_choice | object | Force specific tool | ❌ |
metadata | object | Request metadata | ❌ |
Response Format
message = client.messages.create(
model='llama3.2',
max_tokens=1024,
messages=[{
'role': 'user',
'content': 'Write a haiku about programming'
}]
)
print(message.content[0].text)
message = client.messages.create(
model='llama3.2',
max_tokens=1024,
system='You are an expert Python developer.',
messages=[{
'role': 'user',
'content': 'How do I read a file in Python?'
}]
)
message = client.messages.create(
model='llama3.2',
max_tokens=1024,
messages=[
{'role': 'user', 'content': 'What is 2+2?'},
{'role': 'assistant', 'content': '2+2 equals 4.'},
{'role': 'user', 'content': 'What about 2+3?'}
]
)
with client.messages.stream(
model='llama3.2',
max_tokens=1024,
messages=[{'role': 'user', 'content': 'Count from 1 to 10'}]
) as stream:
for text in stream.text_stream:
print(text, end='', flush=True)
import base64
with open('image.jpg', 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
message = client.messages.create(
model='llava',
max_tokens=1024,
messages=[{
'role': 'user',
'content': [
{'type': 'text', 'text': "What's in this image?"},
{
'type': 'image',
'source': {
'type': 'base64',
'media_type': 'image/jpeg',
'data': image_data
}
}
]
}]
)
tools = [{
'name': 'get_weather',
'description': 'Get the current weather for a location',
'input_schema': {
'type': 'object',
'properties': {
'location': {
'type': 'string',
'description': 'City name'
},
'unit': {
'type': 'string',
'enum': ['celsius', 'fahrenheit'],
'description': 'Temperature unit'
}
},
'required': ['location']
}
}]
message = client.messages.create(
model='llama3.2',
max_tokens=1024,
tools=tools,
messages=[{
'role': 'user',
'content': 'What is the weather in San Francisco?'
}]
)
# Check for tool calls
for block in message.content:
if block.type == 'tool_use':
print(f"Tool: {block.name}")
print(f"Input: {block.input}")
message = client.messages.create(
model='llama3.2',
max_tokens=1024,
tools=tools,
messages=[
{'role': 'user', 'content': 'What is the weather in Tokyo?'},
{
'role': 'assistant',
'content': [{
'type': 'tool_use',
'id': 'call_123',
'name': 'get_weather',
'input': {'location': 'Tokyo', 'unit': 'celsius'}
}]
},
{
'role': 'user',
'content': [{
'type': 'tool_result',
'tool_use_id': 'call_123',
'content': '22°C, partly cloudy'
}]
}
]
)
message = client.messages.create(
model='deepseek-reasoner',
max_tokens=2048,
thinking={'type': 'enabled'},
messages=[{
'role': 'user',
'content': 'Solve this logic puzzle: ...'
}]
)
# Access thinking content
for block in message.content:
if block.type == 'thinking':
print(f"Reasoning: {block.thinking}")
elif block.type == 'text':
print(f"Answer: {block.text}")
Using with Claude Code
Claude Code can use Ollama as its backend for local, private code assistance.Quick Setup
- Prompt you to select a model
- Configure Claude Code automatically
- Launch Claude Code
Manual Setup
Set environment variables and run Claude Code:Recommended Models for Coding
Qwen3 Coder
GLM-4.7 Cloud
MiniMax M2.1
DeepSeek Coder
Streaming Events
When streaming is enabled, Ollama sends Server-Sent Events (SSE) in Anthropic format:Event Types
| Event | Description |
|---|---|
message_start | Start of response with initial metadata |
content_block_start | Start of a content block (text, thinking, tool_use) |
content_block_delta | Incremental content update |
content_block_stop | End of a content block |
message_delta | Message-level updates (stop_reason, usage) |
message_stop | End of message |
ping | Keep-alive ping |
error | Error event |
Delta Types
| Delta Type | Field | Description |
|---|---|---|
text_delta | text | Text content chunk |
thinking_delta | thinking | Thinking/reasoning chunk |
input_json_delta | partial_json | Tool input JSON chunk |
Streaming Example
Migration Guide
Switching from Claude to Ollama
Download and install from ollama.com.
# Before (Anthropic Claude)
client = anthropic.Anthropic(
api_key=os.environ['ANTHROPIC_API_KEY']
)
# After (Ollama)
client = anthropic.Anthropic(
base_url='http://localhost:11434',
api_key='ollama' # required but ignored
)
Model Name Aliases
For applications expecting Claude model names:Differences from Anthropic API
Behavior Differences
- API Key: Accepted but not validated (use any string)
- Version Header:
anthropic-versionheader is accepted but not enforced - Token Counts: Approximations based on the model’s tokenizer
- Request IDs: Generated locally, not from Anthropic servers
Not Supported
Token Counting
/v1/messages/count_tokens endpointTool Choice
tool_choice parameter (auto/any/tool/none)Prompt Caching
cache_control blocks for cachingBatches API
/v1/messages/batches for async processingPDF Support
document content blocks with PDFsCitations
citations content blocksPartial Support
| Feature | Status |
|---|---|
| Image content | ✅ Base64 / ❌ URLs |
| Extended thinking | ✅ Basic / ⚠️ budget_tokens not enforced |
| Web search | ✅ Built-in tool (Ollama extension) |
Advanced Features
Web Search (Ollama Extension)
Ollama extends Anthropic’s API with built-in web search:Web search requires internet connectivity and is subject to Ollama’s cloud service availability.
Content Block Types
Ollama supports all standard Anthropic content block types:text- Text contentimage- Base64-encoded imagestool_use- Tool/function callstool_result- Tool execution resultsthinking- Reasoning/thinking contentserver_tool_use- Server-side tool calls (web_search)web_search_tool_result- Web search results
Best Practices
Choosing Models
Choosing Models
Select models based on your use case:
- Chat:
llama3.2,mistral,gpt-oss:20b - Coding:
qwen3-coder,deepseek-coder,codellama - Vision:
llava,bakllava - Reasoning:
deepseek-reasoner
Optimizing Context
Optimizing Context
- Use appropriate
max_tokensto control response length - Clear conversation history periodically for long sessions
- Use
systemprompts to set consistent behavior
Error Handling
Error Handling
Handle errors gracefully:
Examples
Complete Chatbot
Tool-based Weather Agent
Troubleshooting
Connection Errors
Connection Errors
Ensure Ollama is running:Check the service is accessible:
Model Not Found
Model Not Found
Pull the model first:List available models:
Slow Responses
Slow Responses
- Check GPU memory usage
- Use smaller models for faster inference
- Reduce
max_tokensfor shorter responses - Close other applications using GPU
Token Limit Errors
Token Limit Errors
The
max_tokens parameter is required in Anthropic API:Resources
Ollama Models
Browse available models
Anthropic SDK
Anthropic Python library
Claude Code
Claude’s coding assistant
Community
Join the Ollama community