Skip to main content
Understanding how Ollama manages conversation context and memory is crucial for building effective AI applications. This guide explains the underlying mechanisms and how to work with them.

What is Context?

Context refers to the information a language model has access to when generating a response. This includes:
  • Previous messages in the conversation
  • System instructions
  • The current user prompt
  • Any additional data (images, tool results, etc.)
The model uses this context to understand the conversation history and generate relevant, coherent responses.

Context Window

The context window (or context length) is the maximum number of tokens the model can process at once. It acts as the model’s “working memory.”
Tokens are chunks of text - typically words or word fragments. On average, 1 token ≈ 0.75 words in English.

Default Context Sizes

Ollama automatically sets context length based on available VRAM:

< 24 GB VRAM

4K tokens(~3,000 words)

24-48 GB VRAM

32K tokens(~24,000 words)

≥ 48 GB VRAM

256K tokens(~192,000 words)

Context Window Formula

Context Window = System Prompt + Message History + Current Prompt + Response Space
When the total exceeds the context window, older messages must be removed or the model will error.

Setting Context Length

App Configuration

Adjust the slider in Ollama app settings:
Context length setting

Environment Variable

Set the default context length globally:
OLLAMA_CONTEXT_LENGTH=64000 ollama serve

CLI Runtime Parameter

Change context during an interactive session:
/set parameter num_ctx 8192

API Request

Set context per request:
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain quantum mechanics",
  "options": {
    "num_ctx": 8192
  }
}'

Modelfile

Set default context for a custom model:
FROM llama3.2

PARAMETER num_ctx 8192

How Context Management Works

KV Cache

Ollama uses a KV (Key-Value) cache to efficiently manage context:
  1. Keys - Represent the context positions
  2. Values - Store the computed representations
This cache allows the model to:
  • Reuse previous computations
  • Avoid reprocessing the entire context
  • Generate responses faster
The KV cache is stored in GPU memory (VRAM) or system RAM, depending on model offloading.

Context Types

Ollama implements different caching strategies based on the model:
Standard cache for most models (Llama, Qwen, Mistral):
  • Stores all previous tokens in order
  • Linear growth with conversation length
  • Efficient for most use cases
Cache = kvcache.NewCausalCache(shift)

Context in Different APIs

Generate API

The /api/generate endpoint returns a context array that can be reused:
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?"
}'
Response includes:
{
  "model": "llama3.2",
  "response": "The sky appears blue due to Rayleigh scattering...",
  "done": true,
  "context": [1234, 5678, 9012, ...]
}
Reuse the context in the next request:
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "What other colors can it be?",
  "context": [1234, 5678, 9012, ...]
}'
The context array is a token representation of the conversation history. It’s model-specific and not human-readable.

Chat API

The /api/chat endpoint automatically manages context through the messages array:
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Hello!" },
    { "role": "assistant", "content": "Hi! How can I help you?" },
    { "role": "user", "content": "Tell me a joke." }
  ]
}'
Each message adds to the context automatically. The server:
  1. Converts messages to tokens using the model’s template
  2. Checks if total tokens fit in the context window
  3. Handles overflow with truncation or shifting strategies
  4. Generates a response

Context Overflow Strategies

When conversation history exceeds the context window, Ollama provides strategies to handle it:

Truncation

Removes older messages to fit within the context window:
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [...],
  "truncate": true
}'
Behavior:
  • System message is always preserved
  • Oldest user/assistant messages are removed first
  • Most recent messages are kept

Shifting

Moves the context window forward, keeping newer content:
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [...],
  "shift": true
}'
Behavior:
  • Maintains a sliding window of recent context
  • Uses num_keep parameter to preserve important tokens
  • More efficient than reprocessing entire history
Without truncation or shifting enabled, the API will error when context is exceeded.

Memory Management Best Practices

For Long Conversations

Set a larger context window for multi-turn conversations:
OLLAMA_CONTEXT_LENGTH=16384 ollama serve
Implement conversation summarization to compress older context:
# Periodically summarize old messages
if len(messages) > 10:
    summary = summarize_messages(messages[:5])
    messages = [{'role': 'system', 'content': summary}] + messages[5:]

For Code and Documents

Use extended context for large codebases or documents:
response = ollama.chat(
    model='llama3.2',
    messages=[...],
    options={'num_ctx': 32768}  # ~24K words
)
Very large contexts (> 128K) significantly increase memory usage and processing time.

For RAG Applications

Balance context between retrieved documents and conversation history:
context_budget = 8192
document_tokens = 4096  # Reserve for retrieved docs
conversation_tokens = 4096  # Reserve for chat history

For Agents and Tools

Allocate sufficient context for tool results and multi-step reasoning:
response = ollama.chat(
    model='llama3.2',
    messages=[...],
    options={'num_ctx': 64000}  # Large context for complex tasks
)

Monitoring Context Usage

Check Current Context

View loaded models and their context allocation:
ollama ps
Output:
NAME             ID           SIZE     PROCESSOR   CONTEXT   UNTIL
gemma3:latest    a2af6cc3eb   6.6 GB   100% GPU    65536     2 minutes from now
The CONTEXT column shows the allocated context length.

API Metrics

The API returns token counts in the response:
{
  "model": "llama3.2",
  "done": true,
  "prompt_eval_count": 50,    // Tokens in prompt
  "eval_count": 100,           // Tokens generated
  "total_duration": 1234567890
}
Monitor these to understand context usage:
  • prompt_eval_count - Input tokens (including history)
  • eval_count - Output tokens
  • Total ≈ current context usage

Context and Model Performance

Memory Usage

Context length directly impacts memory consumption:

4K Context

~4-6 GB VRAMFor a 7B parameter model

32K Context

~12-16 GB VRAMFor a 7B parameter model
Formula (approximate):
VRAM = Model Size + (Context Length × Hidden Dimensions × Layers × 2 bytes)

Speed Considerations

Larger context windows slow down:
  • Initial prompt processing (prompt eval)
  • Attention computation during generation
  • Memory bandwidth utilization
Benchmark example (Llama 3.2 7B):
ContextPrompt SpeedGeneration Speed
4K500 tok/s50 tok/s
16K300 tok/s40 tok/s
32K150 tok/s30 tok/s
Actual speeds vary by hardware, model, and implementation.

Advanced Topics

Multi-Slot Context Caching

Ollama’s runner supports multiple concurrent contexts (slots):
// Server configuration
type Config struct {
    NumSlots int  // Number of independent context slots
}
This enables:
  • Parallel request processing
  • Per-user context isolation
  • Efficient resource sharing

Context Shifting

When enabled, shifting preserves recent context efficiently:
response = ollama.chat(
    model='llama3.2',
    messages=[...],
    shift=True,
    options={
        'num_keep': 4,  # Keep first 4 tokens (usually system prompt)
    }
)
How it works:
  1. Calculate tokens needed for new input
  2. If exceeds capacity, remove oldest tokens (after num_keep)
  3. Shift KV cache forward
  4. Continue generation

Checkpointing (Advanced Models)

Some models (Qwen 3-Next) support checkpointing:
  • Save intermediate cache states
  • Restore from checkpoints
  • Branch conversations
  • Implement speculative decoding
if cc, ok := cache.(kvcache.CheckpointCache); ok {
    checkpoint := cc.Checkpoint()
    // Later restore
    cc.RestoreCheckpoint(checkpoint)
}

Troubleshooting

Symptoms: Model fails to load or crashes during generationSolutions:
  • Reduce num_ctx: PARAMETER num_ctx 2048
  • Use a smaller model or more quantized version
  • Enable context shifting: "shift": true
  • Check available memory: ollama ps
Symptoms: API returns error about context lengthSolutions:
  • Enable truncation: "truncate": true
  • Enable shifting: "shift": true
  • Increase context window: "options": {"num_ctx": 8192}
  • Reduce message history in your application
Symptoms: Responses take a long time to generateSolutions:
  • Reduce context length if not needed
  • Ensure model is fully on GPU: ollama ps
  • Use smaller model for faster inference
  • Check if CPU offloading is occurring
Symptoms: Model doesn’t remember earlier conversationSolutions:
  • Verify messages array includes full history
  • Check context isn’t being truncated too aggressively
  • Increase num_ctx to accommodate longer histories
  • Use Chat API which handles context automatically

Next Steps

Models

Learn about different model architectures and capabilities

Modelfile

Configure models with custom context settings

Context Length

Detailed guide on context length configuration

API Reference

Complete API documentation for context management