What is Context?
Context refers to the information a language model has access to when generating a response. This includes:- Previous messages in the conversation
- System instructions
- The current user prompt
- Any additional data (images, tool results, etc.)
Context Window
The context window (or context length) is the maximum number of tokens the model can process at once. It acts as the model’s “working memory.”Tokens are chunks of text - typically words or word fragments. On average, 1 token ≈ 0.75 words in English.
Default Context Sizes
Ollama automatically sets context length based on available VRAM:< 24 GB VRAM
4K tokens(~3,000 words)
24-48 GB VRAM
32K tokens(~24,000 words)
≥ 48 GB VRAM
256K tokens(~192,000 words)
Context Window Formula
Setting Context Length
App Configuration
Adjust the slider in Ollama app settings:
Environment Variable
Set the default context length globally:CLI Runtime Parameter
Change context during an interactive session:API Request
Set context per request:Modelfile
Set default context for a custom model:How Context Management Works
KV Cache
Ollama uses a KV (Key-Value) cache to efficiently manage context:- Keys - Represent the context positions
- Values - Store the computed representations
- Reuse previous computations
- Avoid reprocessing the entire context
- Generate responses faster
The KV cache is stored in GPU memory (VRAM) or system RAM, depending on model offloading.
Context Types
Ollama implements different caching strategies based on the model:- Causal Cache
- Sliding Window Cache
- Hybrid Cache
Standard cache for most models (Llama, Qwen, Mistral):
- Stores all previous tokens in order
- Linear growth with conversation length
- Efficient for most use cases
Context in Different APIs
Generate API
The/api/generate endpoint returns a context array that can be reused:
The
context array is a token representation of the conversation history. It’s model-specific and not human-readable.Chat API
The/api/chat endpoint automatically manages context through the messages array:
- Converts messages to tokens using the model’s template
- Checks if total tokens fit in the context window
- Handles overflow with truncation or shifting strategies
- Generates a response
Context Overflow Strategies
When conversation history exceeds the context window, Ollama provides strategies to handle it:Truncation
Removes older messages to fit within the context window:- System message is always preserved
- Oldest user/assistant messages are removed first
- Most recent messages are kept
Shifting
Moves the context window forward, keeping newer content:- Maintains a sliding window of recent context
- Uses
num_keepparameter to preserve important tokens - More efficient than reprocessing entire history
Memory Management Best Practices
For Long Conversations
Set a larger context window for multi-turn conversations:
For Code and Documents
Use extended context for large codebases or documents:
For RAG Applications
For Agents and Tools
Allocate sufficient context for tool results and multi-step reasoning:
Monitoring Context Usage
Check Current Context
View loaded models and their context allocation:CONTEXT column shows the allocated context length.
API Metrics
The API returns token counts in the response:prompt_eval_count- Input tokens (including history)eval_count- Output tokens- Total ≈ current context usage
Context and Model Performance
Memory Usage
Context length directly impacts memory consumption:4K Context
~4-6 GB VRAMFor a 7B parameter model
32K Context
~12-16 GB VRAMFor a 7B parameter model
Speed Considerations
Larger context windows slow down:
- Initial prompt processing (prompt eval)
- Attention computation during generation
- Memory bandwidth utilization
| Context | Prompt Speed | Generation Speed |
|---|---|---|
| 4K | 500 tok/s | 50 tok/s |
| 16K | 300 tok/s | 40 tok/s |
| 32K | 150 tok/s | 30 tok/s |
Actual speeds vary by hardware, model, and implementation.
Advanced Topics
Multi-Slot Context Caching
Ollama’s runner supports multiple concurrent contexts (slots):- Parallel request processing
- Per-user context isolation
- Efficient resource sharing
Context Shifting
When enabled, shifting preserves recent context efficiently:- Calculate tokens needed for new input
- If exceeds capacity, remove oldest tokens (after
num_keep) - Shift KV cache forward
- Continue generation
Checkpointing (Advanced Models)
Some models (Qwen 3-Next) support checkpointing:- Save intermediate cache states
- Restore from checkpoints
- Branch conversations
- Implement speculative decoding
Troubleshooting
Out of memory errors
Out of memory errors
Symptoms: Model fails to load or crashes during generationSolutions:
- Reduce
num_ctx:PARAMETER num_ctx 2048 - Use a smaller model or more quantized version
- Enable context shifting:
"shift": true - Check available memory:
ollama ps
Context length exceeded
Context length exceeded
Symptoms: API returns error about context lengthSolutions:
- Enable truncation:
"truncate": true - Enable shifting:
"shift": true - Increase context window:
"options": {"num_ctx": 8192} - Reduce message history in your application
Slow generation
Slow generation
Symptoms: Responses take a long time to generateSolutions:
- Reduce context length if not needed
- Ensure model is fully on GPU:
ollama ps - Use smaller model for faster inference
- Check if CPU offloading is occurring
Lost conversation context
Lost conversation context
Symptoms: Model doesn’t remember earlier conversationSolutions:
- Verify messages array includes full history
- Check context isn’t being truncated too aggressively
- Increase
num_ctxto accommodate longer histories - Use Chat API which handles context automatically
Next Steps
Models
Learn about different model architectures and capabilities
Modelfile
Configure models with custom context settings
Context Length
Detailed guide on context length configuration
API Reference
Complete API documentation for context management