Skip to main content
Quantization reduces model precision to decrease memory usage and increase inference speed, with a trade-off in accuracy. This allows running larger models on consumer hardware.

What is Quantization?

Quantization converts high-precision model weights (FP32 or FP16) to lower-precision formats (4-bit, 8-bit). This process:

Reduces Memory

Models use 4-8x less VRAM and storage

Increases Speed

Faster inference on consumer GPUs

Enables Larger Models

Run models that wouldn’t fit in VRAM

Trade-off Quality

Slight accuracy loss for major size reduction

Quantizing Models with Ollama

Quantize FP16 or FP32 models during import using the --quantize flag.
1

Prepare an FP16/FP32 model

Create a Modelfile pointing to your high-precision model:
FROM /path/to/my/model/fp16
2

Create with quantization

Use ollama create with the --quantize or -q flag:
ollama create --quantize q4_K_M mymodel
The output shows quantization progress:
transferring model data
quantizing F16 model to Q4_K_M
creating new layer sha256:735e246cc1ab...
creating new layer sha256:0853f0ad24e5...
writing manifest
success
3

Test the quantized model

ollama run mymodel

Supported Quantization Levels

Ollama supports several quantization formats with different size/quality trade-offs.

4-bit Quantization (K-means)

Q4_K_M - Medium (Recommended)

Maximum compression
  • Size reduction: ~4.5x smaller than FP16
  • Quality: Acceptable - some quality loss
  • Speed: Very fast inference
  • Use case: When storage/memory is critical
ollama create --quantize q4_K_S mymodel

8-bit Quantization

Highest quality quantization
  • Size reduction: ~2x smaller than FP16
  • Quality: Excellent - virtually no quality loss
  • Speed: Moderate speedup
  • Use case: When quality is paramount
ollama create --quantize q8_0 mymodel

Comparison Table

Example: A 7B parameter model in FP16 format (~14GB) after quantization:
QuantizationApprox. SizeQualitySpeedBest For
FP16 (Original)~14 GB100%BaselineMaximum accuracy
Q8_0~7 GB99%1.2-1.5xHigh quality, moderate savings
Q4_K_M~3.5 GB95-97%1.5-2xRecommended default
Q4_K_S~3 GB93-95%2-2.5xMaximum compression
Actual sizes and quality depend on model architecture. Perplexity increases (quality decreases) by 1-5% for Q4_K_M compared to FP16.

Choosing a Quantization Level

Select based on your priorities:
Use Q8_0 when:
  • Accuracy is critical
  • You have sufficient VRAM
  • Minimal quality loss is required
ollama create --quantize q8_0 mymodel

Technical Details

K-means Quantization

The “K” in Q4_K and Q8_K refers to K-means quantization, which groups similar weights and quantizes them together. This provides better quality than naive quantization.
  • Q4_K_M: Uses K-means with medium-sized quantization blocks
  • Q4_K_S: Uses smaller quantization blocks for more compression

Supported Input Formats

Ollama can quantize models from:

FP16 Models

Half-precision floating-point (most common)

FP32 Models

Full-precision floating-point
Cannot quantize already-quantized models. You must start with FP16 or FP32 weights.

Examples

Quantize a Hugging Face Model

# Download FP16 model from Hugging Face
huggingface-cli download username/model-name --local-dir ./model

# Create Modelfile
echo "FROM ./model" > Modelfile

# Quantize to Q4_K_M
ollama create --quantize q4_K_M my-quantized-model

# Test
ollama run my-quantized-model

Quantize a Safetensors Model

# Create Modelfile pointing to Safetensors directory
echo "FROM ./safetensors-model" > Modelfile

# Quantize with Q8_0 for high quality
ollama create --quantize q8_0 high-quality-model

Batch Quantize Multiple Levels

# Create different quantization levels for comparison
ollama create --quantize q8_0 mymodel-q8
ollama create --quantize q4_K_M mymodel-q4m
ollama create --quantize q4_K_S mymodel-q4s

# Test and compare
ollama run mymodel-q8 "Explain quantum computing"
ollama run mymodel-q4m "Explain quantum computing"
ollama run mymodel-q4s "Explain quantum computing"

Using Pre-Quantized Models

Many models on ollama.com are available in multiple quantization levels:
# Full precision (if available)
ollama pull llama3.2:70b-fp16

# 4-bit quantized (default)
ollama pull llama3.2:70b

# Different quantization levels
ollama pull llama3.2:70b-q8_0
ollama pull llama3.2:70b-q4_K_M
When no quantization is specified, most Ollama models default to Q4_K_M.

KV Cache Quantization

The key-value cache can also be quantized to save memory during inference:
# Use 8-bit quantization for KV cache
export OLLAMA_KV_CACHE_TYPE=q8_0

# Use 4-bit quantization (more aggressive)
export OLLAMA_KV_CACHE_TYPE=q4_0
See Environment Variables for more details.

Performance Tips

Use Q4_K_S or Q4_K_M to fit larger models in limited VRAM:
ollama create --quantize q4_K_S mymodel
Use Q8_0 for minimal quality loss:
ollama create --quantize q8_0 mymodel
Quantization still helps on CPU by reducing memory bandwidth:
ollama create --quantize q4_K_M mymodel
Smaller quantized models may fit on a single GPU instead of splitting across multiple GPUs, improving performance:
ollama create --quantize q4_K_M mymodel

Measuring Quality Loss

Evaluate quantization quality with benchmarks:
# Compare outputs from different quantization levels
for model in mymodel-fp16 mymodel-q8 mymodel-q4m; do
  echo "Testing $model:"
  ollama run $model "Solve: 2x + 5 = 15"
  echo "---"
done
For most conversational and coding tasks, Q4_K_M provides excellent results indistinguishable from FP16.

Troubleshooting

You cannot quantize an already-quantized model. Download or convert the model to FP16/FP32 first.
Quantization is CPU-intensive and scales with model size. Large 70B+ models may take 10-30 minutes.
Try a higher quantization level:
  • Switch from Q4_K_S to Q4_K_M
  • Switch from Q4_K_M to Q8_0
  • Use the original FP16 model if VRAM allows

Importing Models

Import and quantize models from various formats

GPU Configuration

Optimize GPU memory usage

Environment Variables

Configure KV cache quantization and more