ollama serve - Ollama

Synopsis

ollama serve

Description

The serve command starts the Ollama server, which handles model loading, inference requests, and API endpoints. The server must be running for other Ollama commands to work. The server provides:

RESTful API for model inference
Model management and caching
Automatic GPU/CPU resource allocation
WebSocket support for streaming responses

Arguments

None. The serve command takes no positional arguments.

Options

The serve command is configured entirely through environment variables.

Environment Variables

Server Configuration

OLLAMA_HOST

string

default:"127.0.0.1:11434"

The host and port the server listens on

OLLAMA_HOST=0.0.0.0:11434 ollama serve  # Listen on all interfaces

OLLAMA_ORIGINS

string

default:"*"

Comma-separated list of allowed CORS origins for API requests

OLLAMA_ORIGINS="http://localhost:3000,https://example.com" ollama serve

OLLAMA_DEBUG

boolean

default:"false"

Enable debug logging

OLLAMA_DEBUG=1 ollama serve

Model Management

OLLAMA_MODELS

string

default:"~/.ollama/models"

Directory where models are stored

OLLAMA_MODELS=/mnt/models ollama serve

OLLAMA_KEEP_ALIVE

duration

default:"5m"

How long to keep models loaded in memory after use

0 - Unload immediately
-1 - Keep loaded indefinitely
Duration string: 5m, 1h, 30s

OLLAMA_MAX_LOADED_MODELS

integer

default:"1"

Maximum number of models to keep loaded simultaneously

OLLAMA_MAX_QUEUE

integer

default:"512"

Maximum number of requests to queue when all model slots are full

OLLAMA_NOPRUNE

boolean

default:"false"

Disable automatic pruning of unused model layers

Model Context

OLLAMA_CONTEXT_LENGTH

integer

default:"2048"

Default context window size for models (in tokens)

OLLAMA_CONTEXT_LENGTH=4096 ollama serve

OLLAMA_NUM_PARALLEL

integer

default:"1"

Maximum number of parallel requests per model

Performance Tuning

OLLAMA_SCHED_SPREAD

boolean

default:"false"

Spread model layers across multiple GPUs when possible

OLLAMA_FLASH_ATTENTION

boolean

default:"false"

Enable Flash Attention optimization (if supported by hardware)

OLLAMA_KV_CACHE_TYPE

string

KV cache type for optimization. Options: f16, q8_0, q4_0

OLLAMA_GPU_OVERHEAD

integer

default:"0"

Reserve GPU memory (in MB) for overhead when calculating model capacity

OLLAMA_LOAD_TIMEOUT

duration

default:"5m"

Maximum time to wait for a model to load

Advanced Options

OLLAMA_LLM_LIBRARY

string

Override LLM runtime library (advanced use only)

OLLAMA_NO_CLOUD

boolean

default:"false"

Disable cloud model features and connectivity

Examples

Start Server (Default)

Start the server with default settings:

ollama serve

Listen on All Interfaces

Allow remote connections:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Only expose Ollama to your network if you trust all users. There is no built-in authentication.

Custom Port

Run on a different port:

OLLAMA_HOST=127.0.0.1:8080 ollama serve

Multiple Models

Keep up to 3 models loaded:

OLLAMA_MAX_LOADED_MODELS=3 ollama serve

Debug Mode

Enable verbose logging:

OLLAMA_DEBUG=1 ollama serve

Production Configuration

Example production setup:

export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_MODELS=/var/lib/ollama/models
export OLLAMA_KEEP_ALIVE=10m
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=4
export OLLAMA_FLASH_ATTENTION=1

ollama serve

Server Lifecycle

Starting the Server

The server performs these steps on startup:

Initialize keypair at ~/.ollama/id_ed25519 (if not exists)
Bind to configured host and port
Load model registry and cache
Begin accepting requests

Graceful Shutdown

To stop the server gracefully:

# Send SIGINT (Ctrl+C) or SIGTERM
killall ollama

The server will:

Stop accepting new requests
Wait for in-progress requests to complete
Unload all models
Exit cleanly

Health Check

Check if the server is running:

curl http://localhost:11434/api/version

Expected response:

{"version":"0.5.0"}

Logs

Server logs are written to:

Linux: stdout/stderr (capture with journalctl if using systemd)
macOS: ~/.ollama/logs/server.log
Windows: stdout/stderr

Troubleshooting

Port Already in Use

Error: listen tcp 127.0.0.1:11434: bind: address already in use

Solution: Change the port or stop the existing Ollama instance

OLLAMA_HOST=127.0.0.1:11435 ollama serve

Permission Denied

Error: failed to create directory: permission denied

Solution: Ensure write permissions for the models directory

sudo mkdir -p /var/lib/ollama/models
sudo chown -R $USER /var/lib/ollama

API Access

Once the server is running, you can access the REST API:

Base URL: http://localhost:11434
API Documentation: See API Reference

ollama run - Run a model (requires server)
ollama ps - List running models
ollama stop - Stop a model to free memory

​Synopsis

​Description

​Arguments

​Options

​Environment Variables

​Server Configuration

​Model Management

​Model Context

​Performance Tuning

​Advanced Options

​Examples

​Start Server (Default)

​Listen on All Interfaces

​Custom Port

​Multiple Models

​Debug Mode

​Production Configuration

​Server Lifecycle

​Starting the Server

​Graceful Shutdown

​Health Check

​Logs

​Troubleshooting

​Port Already in Use

​Permission Denied

​API Access

​Related Commands

Synopsis

Description

Arguments

Options

Environment Variables

Server Configuration

Model Management

Model Context

Performance Tuning

Advanced Options

Examples

Start Server (Default)

Listen on All Interfaces

Custom Port

Multiple Models

Debug Mode

Production Configuration

Server Lifecycle

Starting the Server

Graceful Shutdown

Health Check

Logs

Troubleshooting

Port Already in Use

Permission Denied

API Access

Related Commands