Skip to main content

Overview

Ollama’s API supports streaming responses for generation endpoints (/api/generate and /api/chat). Streaming allows you to receive model output progressively as it’s generated, rather than waiting for the complete response. By default, streaming is enabled for all generation endpoints. You can disable it by setting "stream": false in your request.

How Streaming Works

Ollama uses newline-delimited JSON (NDJSON) to stream responses. Each line in the response is a complete JSON object representing a chunk of the model’s output.

Content Type

When streaming is enabled, responses use the content type:
application/x-ndjson
When streaming is disabled, responses use:
application/json; charset=utf-8

Response Format

Each streaming chunk is a JSON object sent as a separate line. The client reads these chunks sequentially until the done field is true.

Streaming with Generate

Request

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?"
}'

Streaming Response

Each chunk contains partial output:
{"model":"llama3.2","created_at":"2023-08-04T08:52:19.385406455-07:00","response":"The","done":false}
{"model":"llama3.2","created_at":"2023-08-04T08:52:19.427063241-07:00","response":" sky","done":false}
{"model":"llama3.2","created_at":"2023-08-04T08:52:19.469304761-07:00","response":" appears","done":false}

Final Chunk

The last chunk has "done": true and includes usage metrics:
{
  "model": "llama3.2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "response": "",
  "done": true,
  "done_reason": "stop",
  "context": [1, 2, 3],
  "total_duration": 10706818083,
  "load_duration": 6338219291,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 130079000,
  "eval_count": 259,
  "eval_duration": 4232710000
}
done
boolean
required
Indicates if this is the final chunk in the stream
done_reason
string
Reason generation stopped. See Done Reasons below
response
string
The generated text chunk. Empty in the final response when streaming
total_duration
integer
Total time in nanoseconds for the entire request
load_duration
integer
Time in nanoseconds spent loading the model
prompt_eval_count
integer
Number of tokens in the prompt
prompt_eval_duration
integer
Time in nanoseconds evaluating the prompt
eval_count
integer
Number of tokens generated
eval_duration
integer
Time in nanoseconds generating the response

Streaming with Chat

Request

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ]
}'

Streaming Response

{"model":"llama3.2","created_at":"2023-08-04T08:52:19.385406455-07:00","message":{"role":"assistant","content":"The"},"done":false}
{"model":"llama3.2","created_at":"2023-08-04T08:52:19.427063241-07:00","message":{"role":"assistant","content":" sky"},"done":false}

Final Chunk

{
  "model": "llama3.2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "message": {
    "role": "assistant",
    "content": ""
  },
  "done": true,
  "done_reason": "stop",
  "total_duration": 4883583458,
  "load_duration": 1334875,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 342546000,
  "eval_count": 282,
  "eval_duration": 4535599000
}

Done Reasons

The done_reason field indicates why generation stopped:
The done_reason field is only present in the final chunk when done is true.
ReasonDescription
stopModel finished naturally (hit EOS token or stop sequence)
lengthReached maximum token limit (num_predict or context length)
loadModel was only loaded, not run (empty prompt)
unloadModel was unloaded from memory (keep_alive: 0)

Examples

{
  "done": true,
  "done_reason": "stop",
  "response": "That's the final answer."
}

Disabling Streaming

To receive the entire response in a single JSON object, set "stream": false:

Request

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Response

Single JSON object with complete response:
{
  "model": "llama3.2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "response": "The sky appears blue because of a phenomenon called Rayleigh scattering...",
  "done": true,
  "done_reason": "stop",
  "context": [1, 2, 3],
  "total_duration": 5043500667,
  "load_duration": 5025959,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 325953000,
  "eval_count": 290,
  "eval_duration": 4709213000
}

Implementation Details

From the source code (api/client.go:170-263):
  1. Client sends request with JSON body to streaming endpoint
  2. Server opens HTTP connection and sets headers:
    • Content-Type: application/x-ndjson
    • Accept: application/x-ndjson
  3. Server streams chunks as newline-delimited JSON objects
  4. Client uses bufio.Scanner to read line-by-line with 8MB buffer
  5. Each line is unmarshaled into the response struct
  6. Callback function invoked for each chunk
  7. Stream ends when the chunk with "done": true is received

Buffer Size

The client uses a maximum buffer size of 8 MB (maxBufferSize = 8 * 1048576) to handle large responses.

Client Implementation Examples

package main

import (
	"context"
	"fmt"
	"github.com/ollama/ollama/api"
)

func main() {
	client, _ := api.ClientFromEnvironment()
	
	req := &api.GenerateRequest{
		Model:  "llama3.2",
		Prompt: "Why is the sky blue?",
	}
	
	err := client.Generate(context.Background(), req, func(resp api.GenerateResponse) error {
		fmt.Print(resp.Response)
		return nil
	})
	
	if err != nil {
		fmt.Println("Error:", err)
	}
}

Performance Tips

Streaming provides a better user experience for interactive applications by showing progress immediately.

Calculating Tokens per Second

Use the metrics in the final chunk:
tokensPerSecond = (eval_count / eval_duration) * 1e9
Example:
// eval_count = 259, eval_duration = 4232710000
tokensPerSecond = (259 / 4232710000) * 1000000000
// Result: ~61.2 tokens/second

When to Use Streaming

  • Use streaming for: Interactive chat interfaces, real-time UIs, long responses
  • Disable streaming for: Batch processing, automated testing, structured output validation

Error Handling

Errors during streaming are returned as JSON with an error field:
{
  "error": "model not found"
}
For HTTP errors, check the response status code before parsing:
if response.status_code != 200:
    error = response.json()
    print(f"Error: {error.get('error')}")
See Error Handling for complete error reference.