What is Quantization?
Quantization converts high-precision model weights (FP32 or FP16) to lower-precision formats (4-bit, 8-bit). This process:Reduces Memory
Models use 4-8x less VRAM and storage
Increases Speed
Faster inference on consumer GPUs
Enables Larger Models
Run models that wouldnât fit in VRAM
Trade-off Quality
Slight accuracy loss for major size reduction
Quantizing Models with Ollama
Quantize FP16 or FP32 models during import using the--quantize flag.
Create with quantization
Use The output shows quantization progress:
ollama create with the --quantize or -q flag:Supported Quantization Levels
Ollama supports several quantization formats with different size/quality trade-offs.4-bit Quantization (K-means)
Q4_K_M - Medium (Recommended)
Q4_K_M - Medium (Recommended)
Best balance of size and quality
- Size reduction: ~4x smaller than FP16
- Quality: Good - minimal perceptible accuracy loss
- Speed: Fast inference
- Use case: Default choice for most models
Q4_K_S - Small
Q4_K_S - Small
Maximum compression
- Size reduction: ~4.5x smaller than FP16
- Quality: Acceptable - some quality loss
- Speed: Very fast inference
- Use case: When storage/memory is critical
8-bit Quantization
Q8_0 - High Quality
Q8_0 - High Quality
Highest quality quantization
- Size reduction: ~2x smaller than FP16
- Quality: Excellent - virtually no quality loss
- Speed: Moderate speedup
- Use case: When quality is paramount
Comparison Table
Example: A 7B parameter model in FP16 format (~14GB) after quantization:
| Quantization | Approx. Size | Quality | Speed | Best For |
|---|---|---|---|---|
| FP16 (Original) | ~14 GB | 100% | Baseline | Maximum accuracy |
| Q8_0 | ~7 GB | 99% | 1.2-1.5x | High quality, moderate savings |
| Q4_K_M | ~3.5 GB | 95-97% | 1.5-2x | Recommended default |
| Q4_K_S | ~3 GB | 93-95% | 2-2.5x | Maximum compression |
Actual sizes and quality depend on model architecture. Perplexity increases (quality decreases) by 1-5% for Q4_K_M compared to FP16.
Choosing a Quantization Level
Select based on your priorities:- Quality Priority
- Balanced
- Size Priority
Use Q8_0 when:
- Accuracy is critical
- You have sufficient VRAM
- Minimal quality loss is required
Technical Details
K-means Quantization
The âKâ in Q4_K and Q8_K refers to K-means quantization, which groups similar weights and quantizes them together. This provides better quality than naive quantization.- Q4_K_M: Uses K-means with medium-sized quantization blocks
- Q4_K_S: Uses smaller quantization blocks for more compression
Supported Input Formats
Ollama can quantize models from:FP16 Models
Half-precision floating-point (most common)
FP32 Models
Full-precision floating-point
Examples
Quantize a Hugging Face Model
Quantize a Safetensors Model
Batch Quantize Multiple Levels
Using Pre-Quantized Models
Many models on ollama.com are available in multiple quantization levels:KV Cache Quantization
The key-value cache can also be quantized to save memory during inference:Performance Tips
VRAM is limited
VRAM is limited
Use Q4_K_S or Q4_K_M to fit larger models in limited VRAM:
Quality is critical
Quality is critical
Use Q8_0 for minimal quality loss:
Running on CPU
Running on CPU
Quantization still helps on CPU by reducing memory bandwidth:
Multiple GPUs
Multiple GPUs
Smaller quantized models may fit on a single GPU instead of splitting across multiple GPUs, improving performance:
Measuring Quality Loss
Evaluate quantization quality with benchmarks:For most conversational and coding tasks, Q4_K_M provides excellent results indistinguishable from FP16.
Troubleshooting
Error: Model is already quantized
Error: Model is already quantized
You cannot quantize an already-quantized model. Download or convert the model to FP16/FP32 first.
Quantization takes a long time
Quantization takes a long time
Quantization is CPU-intensive and scales with model size. Large 70B+ models may take 10-30 minutes.
Quality loss is too high
Quality loss is too high
Try a higher quantization level:
- Switch from Q4_K_S to Q4_K_M
- Switch from Q4_K_M to Q8_0
- Use the original FP16 model if VRAM allows
Related
Importing Models
Import and quantize models from various formats
GPU Configuration
Optimize GPU memory usage
Environment Variables
Configure KV cache quantization and more