Synopsis
Description
Theserve command starts the Ollama server, which handles model loading, inference requests, and API endpoints. The server must be running for other Ollama commands to work.
The server provides:
- RESTful API for model inference
- Model management and caching
- Automatic GPU/CPU resource allocation
- WebSocket support for streaming responses
Arguments
None. Theserve command takes no positional arguments.
Options
Theserve command is configured entirely through environment variables.
Environment Variables
Server Configuration
The host and port the server listens on
Comma-separated list of allowed CORS origins for API requests
Enable debug logging
Model Management
Directory where models are stored
How long to keep models loaded in memory after use
0- Unload immediately-1- Keep loaded indefinitely- Duration string:
5m,1h,30s
Maximum number of models to keep loaded simultaneously
Maximum number of requests to queue when all model slots are full
Disable automatic pruning of unused model layers
Model Context
Default context window size for models (in tokens)
Maximum number of parallel requests per model
Performance Tuning
Spread model layers across multiple GPUs when possible
Enable Flash Attention optimization (if supported by hardware)
KV cache type for optimization. Options:
f16, q8_0, q4_0Reserve GPU memory (in MB) for overhead when calculating model capacity
Maximum time to wait for a model to load
Advanced Options
Override LLM runtime library (advanced use only)
Disable cloud model features and connectivity
Examples
Start Server (Default)
Start the server with default settings:Listen on All Interfaces
Allow remote connections:Custom Port
Run on a different port:Multiple Models
Keep up to 3 models loaded:Debug Mode
Enable verbose logging:Production Configuration
Example production setup:Server Lifecycle
Starting the Server
The server performs these steps on startup:- Initialize keypair at
~/.ollama/id_ed25519(if not exists) - Bind to configured host and port
- Load model registry and cache
- Begin accepting requests
Graceful Shutdown
To stop the server gracefully:- Stop accepting new requests
- Wait for in-progress requests to complete
- Unload all models
- Exit cleanly
Health Check
Check if the server is running:Logs
Server logs are written to:- Linux: stdout/stderr (capture with journalctl if using systemd)
- macOS:
~/.ollama/logs/server.log - Windows: stdout/stderr
Troubleshooting
Port Already in Use
Permission Denied
API Access
Once the server is running, you can access the REST API:- Base URL:
http://localhost:11434 - API Documentation: See API Reference
Related Commands
ollama run- Run a model (requires server)ollama ps- List running modelsollama stop- Stop a model to free memory