Skip to main content
Chat enables interactive, multi-turn conversations with language models. The chat endpoint maintains conversation history through the messages array, allowing models to understand context across multiple exchanges.

Quick start

Start an interactive chat session:
ollama run llama3.2
The CLI automatically maintains conversation history until you exit.

Multi-turn conversations

Maintain conversation history by appending each message to the messages array:
from ollama import chat

messages = [
  {'role': 'user', 'content': 'What is the capital of France?'}
]

response = chat(model='llama3.2', messages=messages)
messages.append(response.message)

# Follow-up question
messages.append({
  'role': 'user',
  'content': 'What is its population?'
})

response = chat(model='llama3.2', messages=messages)
print(response.message.content)

Message roles

The chat API supports three message roles:
  • user: Messages from the user/human
  • assistant: Messages from the AI model
  • system: Instructions that guide the model’s behavior
  • tool: Results from tool/function calls (see Tool Calling)
from ollama import chat

messages = [
  {
    'role': 'system',
    'content': 'You are a helpful assistant that speaks like a pirate.'
  },
  {
    'role': 'user',
    'content': 'Tell me about Python programming.'
  }
]

response = chat(model='llama3.2', messages=messages)
print(response.message.content)

API parameters

model
string
required
The model name (e.g., llama3.2, qwen3)
messages
array
required
Array of message objects with role and content fields
stream
boolean
default:"true"
Enable streaming responses (see Streaming)
format
string | object
Response format: "json" for JSON mode or a JSON schema object (see Structured Outputs)
options
object
Model options like temperature, top_p, num_ctx, etc.
keep_alive
duration
default:"5m"
How long to keep the model loaded in memory
tools
array
List of tools available for the model to call (see Tool Calling)
think
boolean | string
Enable thinking/reasoning mode (see Thinking)

Response structure

When streaming is enabled (default), the response is a series of JSON objects:
{
  "model": "llama3.2",
  "created_at": "2024-12-09T21:07:55.186497Z",
  "message": {
    "role": "assistant",
    "content": "The "
  },
  "done": false
}
The final message includes metrics:
{
  "model": "llama3.2",
  "created_at": "2024-12-09T21:07:55.186497Z",
  "message": {
    "role": "assistant",
    "content": ""
  },
  "done": true,
  "total_duration": 4648158584,
  "load_duration": 4071084,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 107345000,
  "eval_count": 298,
  "eval_duration": 4289432000
}

Tips

  • Store the entire messages array to maintain full conversation context
  • Include a system message at the start to set the assistant’s behavior
  • Use keep_alive to keep models loaded for faster subsequent requests
  • Set temperature: 0 in options for more deterministic responses
  • See Streaming for real-time response rendering