Model Quantization

Quantization reduces model precision to decrease memory usage and increase inference speed, with a trade-off in accuracy. This allows running larger models on consumer hardware.

What is Quantization?

Quantization converts high-precision model weights (FP32 or FP16) to lower-precision formats (4-bit, 8-bit). This process:

Reduces Memory

Models use 4-8x less VRAM and storage

Increases Speed

Faster inference on consumer GPUs

Enables Larger Models

Run models that wouldn’t fit in VRAM

Trade-off Quality

Slight accuracy loss for major size reduction

Quantizing Models with Ollama

Quantize FP16 or FP32 models during import using the --quantize flag.

Prepare an FP16/FP32 model

Create a Modelfile pointing to your high-precision model:

FROM /path/to/my/model/fp16

Create with quantization

Use ollama create with the --quantize or -q flag:

ollama create --quantize q4_K_M mymodel

The output shows quantization progress:

transferring model data
quantizing F16 model to Q4_K_M
creating new layer sha256:735e246cc1ab...
creating new layer sha256:0853f0ad24e5...
writing manifest
success

Test the quantized model

ollama run mymodel

Supported Quantization Levels

Ollama supports several quantization formats with different size/quality trade-offs.

4-bit Quantization (K-means)

Q4_K_M - Medium (Recommended)

Best balance of size and quality

Size reduction: ~4x smaller than FP16
Quality: Good - minimal perceptible accuracy loss
Speed: Fast inference
Use case: Default choice for most models

ollama create --quantize q4_K_M mymodel

Q4_K_S - Small

Maximum compression

Size reduction: ~4.5x smaller than FP16
Quality: Acceptable - some quality loss
Speed: Very fast inference
Use case: When storage/memory is critical

ollama create --quantize q4_K_S mymodel

8-bit Quantization

Q8_0 - High Quality

Highest quality quantization

Size reduction: ~2x smaller than FP16
Quality: Excellent - virtually no quality loss
Speed: Moderate speedup
Use case: When quality is paramount

ollama create --quantize q8_0 mymodel

Comparison Table

Example: A 7B parameter model in FP16 format (~14GB) after quantization:

Quantization	Approx. Size	Quality	Speed	Best For
FP16 (Original)	~14 GB	100%	Baseline	Maximum accuracy
Q8_0	~7 GB	99%	1.2-1.5x	High quality, moderate savings
Q4_K_M	~3.5 GB	95-97%	1.5-2x	Recommended default
Q4_K_S	~3 GB	93-95%	2-2.5x	Maximum compression

Actual sizes and quality depend on model architecture. Perplexity increases (quality decreases) by 1-5% for Q4_K_M compared to FP16.

Choosing a Quantization Level

Select based on your priorities:

Quality Priority
Balanced
Size Priority

Use Q8_0 when:

Accuracy is critical
You have sufficient VRAM
Minimal quality loss is required

ollama create --quantize q8_0 mymodel

Use Q4_K_M when:

You want the best balance
Storage/VRAM is limited
Slight quality loss is acceptable

ollama create --quantize q4_K_M mymodel

This is the recommended default for most use cases.

Use Q4_K_S when:

VRAM/storage is severely limited
Running on consumer hardware
Maximum model size is needed

ollama create --quantize q4_K_S mymodel

Technical Details

K-means Quantization

The “K” in Q4_K and Q8_K refers to K-means quantization, which groups similar weights and quantizes them together. This provides better quality than naive quantization.

Q4_K_M: Uses K-means with medium-sized quantization blocks
Q4_K_S: Uses smaller quantization blocks for more compression

Supported Input Formats

Ollama can quantize models from:

FP16 Models

Half-precision floating-point (most common)

FP32 Models

Full-precision floating-point

Cannot quantize already-quantized models. You must start with FP16 or FP32 weights.

Examples

Quantize a Hugging Face Model

# Download FP16 model from Hugging Face
huggingface-cli download username/model-name --local-dir ./model

# Create Modelfile
echo "FROM ./model" > Modelfile

# Quantize to Q4_K_M
ollama create --quantize q4_K_M my-quantized-model

# Test
ollama run my-quantized-model

Quantize a Safetensors Model

# Create Modelfile pointing to Safetensors directory
echo "FROM ./safetensors-model" > Modelfile

# Quantize with Q8_0 for high quality
ollama create --quantize q8_0 high-quality-model

Batch Quantize Multiple Levels

# Create different quantization levels for comparison
ollama create --quantize q8_0 mymodel-q8
ollama create --quantize q4_K_M mymodel-q4m
ollama create --quantize q4_K_S mymodel-q4s

# Test and compare
ollama run mymodel-q8 "Explain quantum computing"
ollama run mymodel-q4m "Explain quantum computing"
ollama run mymodel-q4s "Explain quantum computing"

Using Pre-Quantized Models

Many models on ollama.com are available in multiple quantization levels:

# Full precision (if available)
ollama pull llama3.2:70b-fp16

# 4-bit quantized (default)
ollama pull llama3.2:70b

# Different quantization levels
ollama pull llama3.2:70b-q8_0
ollama pull llama3.2:70b-q4_K_M

When no quantization is specified, most Ollama models default to Q4_K_M.

KV Cache Quantization

The key-value cache can also be quantized to save memory during inference:

# Use 8-bit quantization for KV cache
export OLLAMA_KV_CACHE_TYPE=q8_0

# Use 4-bit quantization (more aggressive)
export OLLAMA_KV_CACHE_TYPE=q4_0

See Environment Variables for more details.

Performance Tips

VRAM is limited

Use Q4_K_S or Q4_K_M to fit larger models in limited VRAM:

ollama create --quantize q4_K_S mymodel

Quality is critical

Use Q8_0 for minimal quality loss:

ollama create --quantize q8_0 mymodel

Running on CPU

Quantization still helps on CPU by reducing memory bandwidth:

ollama create --quantize q4_K_M mymodel

Multiple GPUs

Smaller quantized models may fit on a single GPU instead of splitting across multiple GPUs, improving performance:

ollama create --quantize q4_K_M mymodel

Measuring Quality Loss

Evaluate quantization quality with benchmarks:

# Compare outputs from different quantization levels
for model in mymodel-fp16 mymodel-q8 mymodel-q4m; do
  echo "Testing $model:"
  ollama run $model "Solve: 2x + 5 = 15"
  echo "---"
done

For most conversational and coding tasks, Q4_K_M provides excellent results indistinguishable from FP16.

Troubleshooting

Error: Model is already quantized

You cannot quantize an already-quantized model. Download or convert the model to FP16/FP32 first.

Quantization takes a long time

Quantization is CPU-intensive and scales with model size. Large 70B+ models may take 10-30 minutes.

Quality loss is too high

Try a higher quantization level:

Switch from Q4_K_S to Q4_K_M
Switch from Q4_K_M to Q8_0
Use the original FP16 model if VRAM allows

Importing Models

Import and quantize models from various formats

GPU Configuration

Optimize GPU memory usage

Environment Variables

Configure KV cache quantization and more

​What is Quantization?

Reduces Memory

Increases Speed

Enables Larger Models

Trade-off Quality

​Quantizing Models with Ollama

​Supported Quantization Levels

​4-bit Quantization (K-means)

​8-bit Quantization

​Comparison Table

​Choosing a Quantization Level

​Technical Details

​K-means Quantization

​Supported Input Formats

FP16 Models

FP32 Models

​Examples

​Quantize a Hugging Face Model

​Quantize a Safetensors Model

​Batch Quantize Multiple Levels

​Using Pre-Quantized Models

​KV Cache Quantization

​Performance Tips

​Measuring Quality Loss

​Troubleshooting

​Related

Importing Models

GPU Configuration

Environment Variables

What is Quantization?

Quantizing Models with Ollama

Supported Quantization Levels

4-bit Quantization (K-means)

8-bit Quantization

Comparison Table

Choosing a Quantization Level

Technical Details

K-means Quantization

Supported Input Formats

Examples

Quantize a Hugging Face Model

Quantize a Safetensors Model

Batch Quantize Multiple Levels

Using Pre-Quantized Models

KV Cache Quantization

Performance Tips

Measuring Quality Loss

Troubleshooting

Related