Overview
Ollama’s API supports streaming responses for generation endpoints (/api/generate and /api/chat). Streaming allows you to receive model output progressively as it’s generated, rather than waiting for the complete response.
By default, streaming is enabled for all generation endpoints. You can disable it by setting "stream": false in your request.
How Streaming Works
Ollama uses newline-delimited JSON (NDJSON) to stream responses. Each line in the response is a complete JSON object representing a chunk of the model’s output.Content Type
When streaming is enabled, responses use the content type:Response Format
Each streaming chunk is a JSON object sent as a separate line. The client reads these chunks sequentially until thedone field is true.
Streaming with Generate
Request
Streaming Response
Each chunk contains partial output:Final Chunk
The last chunk has"done": true and includes usage metrics:
Indicates if this is the final chunk in the stream
Reason generation stopped. See Done Reasons below
The generated text chunk. Empty in the final response when streaming
Total time in nanoseconds for the entire request
Time in nanoseconds spent loading the model
Number of tokens in the prompt
Time in nanoseconds evaluating the prompt
Number of tokens generated
Time in nanoseconds generating the response
Streaming with Chat
Request
Streaming Response
Final Chunk
Done Reasons
Thedone_reason field indicates why generation stopped:
The
done_reason field is only present in the final chunk when done is true.| Reason | Description |
|---|---|
stop | Model finished naturally (hit EOS token or stop sequence) |
length | Reached maximum token limit (num_predict or context length) |
load | Model was only loaded, not run (empty prompt) |
unload | Model was unloaded from memory (keep_alive: 0) |
Examples
Disabling Streaming
To receive the entire response in a single JSON object, set"stream": false:
Request
Response
Single JSON object with complete response:Implementation Details
From the source code (api/client.go:170-263):
- Client sends request with JSON body to streaming endpoint
- Server opens HTTP connection and sets headers:
Content-Type: application/x-ndjsonAccept: application/x-ndjson
- Server streams chunks as newline-delimited JSON objects
- Client uses bufio.Scanner to read line-by-line with 8MB buffer
- Each line is unmarshaled into the response struct
- Callback function invoked for each chunk
- Stream ends when the chunk with
"done": trueis received
Buffer Size
The client uses a maximum buffer size of 8 MB (maxBufferSize = 8 * 1048576) to handle large responses.
Client Implementation Examples
Performance Tips
Calculating Tokens per Second
Use the metrics in the final chunk:When to Use Streaming
- ✅ Use streaming for: Interactive chat interfaces, real-time UIs, long responses
- ❌ Disable streaming for: Batch processing, automated testing, structured output validation
Error Handling
Errors during streaming are returned as JSON with anerror field: