Skip to main content

Synopsis

ollama serve

Description

The serve command starts the Ollama server, which handles model loading, inference requests, and API endpoints. The server must be running for other Ollama commands to work. The server provides:
  • RESTful API for model inference
  • Model management and caching
  • Automatic GPU/CPU resource allocation
  • WebSocket support for streaming responses

Arguments

None. The serve command takes no positional arguments.

Options

The serve command is configured entirely through environment variables.

Environment Variables

Server Configuration

OLLAMA_HOST
string
default:"127.0.0.1:11434"
The host and port the server listens on
OLLAMA_HOST=0.0.0.0:11434 ollama serve  # Listen on all interfaces
OLLAMA_ORIGINS
string
default:"*"
Comma-separated list of allowed CORS origins for API requests
OLLAMA_ORIGINS="http://localhost:3000,https://example.com" ollama serve
OLLAMA_DEBUG
boolean
default:"false"
Enable debug logging
OLLAMA_DEBUG=1 ollama serve

Model Management

OLLAMA_MODELS
string
default:"~/.ollama/models"
Directory where models are stored
OLLAMA_MODELS=/mnt/models ollama serve
OLLAMA_KEEP_ALIVE
duration
default:"5m"
How long to keep models loaded in memory after use
  • 0 - Unload immediately
  • -1 - Keep loaded indefinitely
  • Duration string: 5m, 1h, 30s
OLLAMA_MAX_LOADED_MODELS
integer
default:"1"
Maximum number of models to keep loaded simultaneously
OLLAMA_MAX_QUEUE
integer
default:"512"
Maximum number of requests to queue when all model slots are full
OLLAMA_NOPRUNE
boolean
default:"false"
Disable automatic pruning of unused model layers

Model Context

OLLAMA_CONTEXT_LENGTH
integer
default:"2048"
Default context window size for models (in tokens)
OLLAMA_CONTEXT_LENGTH=4096 ollama serve
OLLAMA_NUM_PARALLEL
integer
default:"1"
Maximum number of parallel requests per model

Performance Tuning

OLLAMA_SCHED_SPREAD
boolean
default:"false"
Spread model layers across multiple GPUs when possible
OLLAMA_FLASH_ATTENTION
boolean
default:"false"
Enable Flash Attention optimization (if supported by hardware)
OLLAMA_KV_CACHE_TYPE
string
KV cache type for optimization. Options: f16, q8_0, q4_0
OLLAMA_GPU_OVERHEAD
integer
default:"0"
Reserve GPU memory (in MB) for overhead when calculating model capacity
OLLAMA_LOAD_TIMEOUT
duration
default:"5m"
Maximum time to wait for a model to load

Advanced Options

OLLAMA_LLM_LIBRARY
string
Override LLM runtime library (advanced use only)
OLLAMA_NO_CLOUD
boolean
default:"false"
Disable cloud model features and connectivity

Examples

Start Server (Default)

Start the server with default settings:
ollama serve

Listen on All Interfaces

Allow remote connections:
OLLAMA_HOST=0.0.0.0:11434 ollama serve
Only expose Ollama to your network if you trust all users. There is no built-in authentication.

Custom Port

Run on a different port:
OLLAMA_HOST=127.0.0.1:8080 ollama serve

Multiple Models

Keep up to 3 models loaded:
OLLAMA_MAX_LOADED_MODELS=3 ollama serve

Debug Mode

Enable verbose logging:
OLLAMA_DEBUG=1 ollama serve

Production Configuration

Example production setup:
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_MODELS=/var/lib/ollama/models
export OLLAMA_KEEP_ALIVE=10m
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=4
export OLLAMA_FLASH_ATTENTION=1

ollama serve

Server Lifecycle

Starting the Server

The server performs these steps on startup:
  1. Initialize keypair at ~/.ollama/id_ed25519 (if not exists)
  2. Bind to configured host and port
  3. Load model registry and cache
  4. Begin accepting requests

Graceful Shutdown

To stop the server gracefully:
# Send SIGINT (Ctrl+C) or SIGTERM
killall ollama
The server will:
  1. Stop accepting new requests
  2. Wait for in-progress requests to complete
  3. Unload all models
  4. Exit cleanly

Health Check

Check if the server is running:
curl http://localhost:11434/api/version
Expected response:
{"version":"0.5.0"}

Logs

Server logs are written to:
  • Linux: stdout/stderr (capture with journalctl if using systemd)
  • macOS: ~/.ollama/logs/server.log
  • Windows: stdout/stderr

Troubleshooting

Port Already in Use

Error: listen tcp 127.0.0.1:11434: bind: address already in use
Solution: Change the port or stop the existing Ollama instance
OLLAMA_HOST=127.0.0.1:11435 ollama serve

Permission Denied

Error: failed to create directory: permission denied
Solution: Ensure write permissions for the models directory
sudo mkdir -p /var/lib/ollama/models
sudo chown -R $USER /var/lib/ollama

API Access

Once the server is running, you can access the REST API:
  • Base URL: http://localhost:11434
  • API Documentation: See API Reference