Context and Memory

Understanding how Ollama manages conversation context and memory is crucial for building effective AI applications. This guide explains the underlying mechanisms and how to work with them.

What is Context?

Context refers to the information a language model has access to when generating a response. This includes:

Previous messages in the conversation
System instructions
The current user prompt
Any additional data (images, tool results, etc.)

The model uses this context to understand the conversation history and generate relevant, coherent responses.

Context Window

The context window (or context length) is the maximum number of tokens the model can process at once. It acts as the model’s “working memory.”

Tokens are chunks of text - typically words or word fragments. On average, 1 token ≈ 0.75 words in English.

Default Context Sizes

Ollama automatically sets context length based on available VRAM:

< 24 GB VRAM

4K tokens(~3,000 words)

24-48 GB VRAM

32K tokens(~24,000 words)

≥ 48 GB VRAM

256K tokens(~192,000 words)

Context Window Formula

Context Window = System Prompt + Message History + Current Prompt + Response Space

When the total exceeds the context window, older messages must be removed or the model will error.

Setting Context Length

App Configuration

Adjust the slider in Ollama app settings:

Environment Variable

Set the default context length globally:

OLLAMA_CONTEXT_LENGTH=64000 ollama serve

CLI Runtime Parameter

Change context during an interactive session:

/set parameter num_ctx 8192

API Request

Set context per request:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain quantum mechanics",
  "options": {
    "num_ctx": 8192
  }
}'

Modelfile

Set default context for a custom model:

FROM llama3.2

PARAMETER num_ctx 8192

How Context Management Works

KV Cache

Ollama uses a KV (Key-Value) cache to efficiently manage context:

Keys - Represent the context positions
Values - Store the computed representations

This cache allows the model to:

Reuse previous computations
Avoid reprocessing the entire context
Generate responses faster

The KV cache is stored in GPU memory (VRAM) or system RAM, depending on model offloading.

Context Types

Ollama implements different caching strategies based on the model:

Causal Cache
Sliding Window Cache
Hybrid Cache

Standard cache for most models (Llama, Qwen, Mistral):

Stores all previous tokens in order
Linear growth with conversation length
Efficient for most use cases

Cache = kvcache.NewCausalCache(shift)

Used by models with sliding window attention (Mistral 3, Gemma 3):

Maintains a fixed-size window
Older tokens are automatically discarded
Bounded memory usage
Trade-off: loses very old context

Cache = kvcache.NewSWACache(windowSize, shift)

Advanced caching for models like Qwen 3-Next:

Combines multiple caching strategies
Optimizes for specific model architectures
Supports advanced features like checkpointing

Cache = NewHybridCache(...)

Context in Different APIs

Generate API

The /api/generate endpoint returns a context array that can be reused:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?"
}'

Response includes:

{
  "model": "llama3.2",
  "response": "The sky appears blue due to Rayleigh scattering...",
  "done": true,
  "context": [1234, 5678, 9012, ...]
}

Reuse the context in the next request:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "What other colors can it be?",
  "context": [1234, 5678, 9012, ...]
}'

The context array is a token representation of the conversation history. It’s model-specific and not human-readable.

Chat API

The /api/chat endpoint automatically manages context through the messages array:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Hello!" },
    { "role": "assistant", "content": "Hi! How can I help you?" },
    { "role": "user", "content": "Tell me a joke." }
  ]
}'

Each message adds to the context automatically. The server:

Converts messages to tokens using the model’s template
Checks if total tokens fit in the context window
Handles overflow with truncation or shifting strategies
Generates a response

Context Overflow Strategies

When conversation history exceeds the context window, Ollama provides strategies to handle it:

Truncation

Removes older messages to fit within the context window:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [...],
  "truncate": true
}'

Behavior:

System message is always preserved
Oldest user/assistant messages are removed first
Most recent messages are kept

Shifting

Moves the context window forward, keeping newer content:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [...],
  "shift": true
}'

Behavior:

Maintains a sliding window of recent context
Uses num_keep parameter to preserve important tokens
More efficient than reprocessing entire history

Without truncation or shifting enabled, the API will error when context is exceeded.

Memory Management Best Practices

For Long Conversations

Set a larger context window for multi-turn conversations:

OLLAMA_CONTEXT_LENGTH=16384 ollama serve

Implement conversation summarization to compress older context:

# Periodically summarize old messages
if len(messages) > 10:
    summary = summarize_messages(messages[:5])
    messages = [{'role': 'system', 'content': summary}] + messages[5:]

For Code and Documents

Use extended context for large codebases or documents:

response = ollama.chat(
    model='llama3.2',
    messages=[...],
    options={'num_ctx': 32768}  # ~24K words
)

Very large contexts (> 128K) significantly increase memory usage and processing time.

For RAG Applications

Balance context between retrieved documents and conversation history:

context_budget = 8192
document_tokens = 4096  # Reserve for retrieved docs
conversation_tokens = 4096  # Reserve for chat history

For Agents and Tools

Allocate sufficient context for tool results and multi-step reasoning:

response = ollama.chat(
    model='llama3.2',
    messages=[...],
    options={'num_ctx': 64000}  # Large context for complex tasks
)

Monitoring Context Usage

Check Current Context

View loaded models and their context allocation:

ollama ps

Output:

NAME             ID           SIZE     PROCESSOR   CONTEXT   UNTIL
gemma3:latest    a2af6cc3eb   6.6 GB   100% GPU    65536     2 minutes from now

The CONTEXT column shows the allocated context length.

API Metrics

The API returns token counts in the response:

{
  "model": "llama3.2",
  "done": true,
  "prompt_eval_count": 50,    // Tokens in prompt
  "eval_count": 100,           // Tokens generated
  "total_duration": 1234567890
}

Monitor these to understand context usage:

prompt_eval_count - Input tokens (including history)
eval_count - Output tokens
Total ≈ current context usage

Context and Model Performance

Memory Usage

Context length directly impacts memory consumption:

4K Context

~4-6 GB VRAMFor a 7B parameter model

32K Context

~12-16 GB VRAMFor a 7B parameter model

Formula (approximate):

VRAM = Model Size + (Context Length × Hidden Dimensions × Layers × 2 bytes)

Speed Considerations

Larger context windows slow down:

Initial prompt processing (prompt eval)
Attention computation during generation
Memory bandwidth utilization

Benchmark example (Llama 3.2 7B):

Context	Prompt Speed	Generation Speed
4K	500 tok/s	50 tok/s
16K	300 tok/s	40 tok/s
32K	150 tok/s	30 tok/s

Actual speeds vary by hardware, model, and implementation.

Advanced Topics

Multi-Slot Context Caching

Ollama’s runner supports multiple concurrent contexts (slots):

// Server configuration
type Config struct {
    NumSlots int  // Number of independent context slots
}

This enables:

Parallel request processing
Per-user context isolation
Efficient resource sharing

Context Shifting

When enabled, shifting preserves recent context efficiently:

response = ollama.chat(
    model='llama3.2',
    messages=[...],
    shift=True,
    options={
        'num_keep': 4,  # Keep first 4 tokens (usually system prompt)
    }
)

How it works:

Calculate tokens needed for new input
If exceeds capacity, remove oldest tokens (after num_keep)
Shift KV cache forward
Continue generation

Checkpointing (Advanced Models)

Some models (Qwen 3-Next) support checkpointing:

Save intermediate cache states
Restore from checkpoints
Branch conversations
Implement speculative decoding

if cc, ok := cache.(kvcache.CheckpointCache); ok {
    checkpoint := cc.Checkpoint()
    // Later restore
    cc.RestoreCheckpoint(checkpoint)
}

Troubleshooting

Out of memory errors

Symptoms: Model fails to load or crashes during generationSolutions:

Reduce num_ctx: PARAMETER num_ctx 2048
Use a smaller model or more quantized version
Enable context shifting: "shift": true
Check available memory: ollama ps

Context length exceeded

Symptoms: API returns error about context lengthSolutions:

Enable truncation: "truncate": true
Enable shifting: "shift": true
Increase context window: "options": {"num_ctx": 8192}
Reduce message history in your application

Slow generation

Symptoms: Responses take a long time to generateSolutions:

Reduce context length if not needed
Ensure model is fully on GPU: ollama ps
Use smaller model for faster inference
Check if CPU offloading is occurring

Lost conversation context

Symptoms: Model doesn’t remember earlier conversationSolutions:

Verify messages array includes full history
Check context isn’t being truncated too aggressively
Increase num_ctx to accommodate longer histories
Use Chat API which handles context automatically

Next Steps

Models

Learn about different model architectures and capabilities

Modelfile

Configure models with custom context settings

Context Length

Detailed guide on context length configuration

API Reference

Complete API documentation for context management

​What is Context?

​Context Window

​Default Context Sizes

< 24 GB VRAM

24-48 GB VRAM

≥ 48 GB VRAM

​Context Window Formula

​Setting Context Length

​App Configuration

​Environment Variable

​CLI Runtime Parameter

​API Request

​Modelfile

​How Context Management Works

​KV Cache

​Context Types

​Context in Different APIs

​Generate API

​Chat API

​Context Overflow Strategies

​Truncation

​Shifting

​Memory Management Best Practices

​For Long Conversations

​For Code and Documents

​For RAG Applications

​For Agents and Tools

​Monitoring Context Usage

​Check Current Context

​API Metrics

​Context and Model Performance

​Memory Usage

4K Context

32K Context

​Speed Considerations

​Advanced Topics

​Multi-Slot Context Caching

​Context Shifting

​Checkpointing (Advanced Models)

​Troubleshooting

​Next Steps

Models

Modelfile

Context Length

API Reference

What is Context?

Context Window

Default Context Sizes

Context Window Formula

Setting Context Length

App Configuration

Environment Variable

CLI Runtime Parameter

API Request

Modelfile

How Context Management Works

KV Cache

Context Types

Context in Different APIs

Generate API

Chat API

Context Overflow Strategies

Truncation

Shifting

Memory Management Best Practices

For Long Conversations

For Code and Documents

For RAG Applications

For Agents and Tools

Monitoring Context Usage

Check Current Context

API Metrics

Context and Model Performance

Memory Usage

Speed Considerations

Advanced Topics

Multi-Slot Context Caching

Context Shifting

Checkpointing (Advanced Models)

Troubleshooting

Next Steps