You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: demos/continuous_batching/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -99,7 +99,7 @@ Meta-Llama-3-8B-Instruct
99
99
100
100
The default configuration of the `LLMExecutor` should work in most cases but the parameters can be tunned inside the `node_options` section in the `graph.pbtxt` file.
101
101
Note that the `models_path` parameter in the graph file can be an absolute path or relative to the `base_path` from `config.json`.
102
-
Check the [LLM calculator documentation](./llm_calculator.md) to learn about configuration options.
102
+
Check the [LLM calculator documentation](../../docs/llm/reference.md) to learn about configuration options.
103
103
104
104
> **Note:** The parameter `cache_size` in the graph represents KV cache size in GB. Reduce the value if you don't have enough RAM on the host.
"content": "OpenVINO is a software development kit (SDK) for machine learning (ML) and deep learning (DL) applications. It is developed",
123
+
"content": "OpenVINO is a software toolkit developed by Intel that enables developers to accelerate the training and deployment of deep learning models on Intel hardware.",
123
124
"role": "assistant"
124
125
}
125
126
}
126
127
],
127
-
"created": 1718401064,
128
+
"created": 1718607923,
128
129
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
129
130
"object": "chat.completion"
130
131
}
131
-
132
132
```
133
133
**Note:** If you want to get the response chunks streamed back as they are generated change `stream` parameter in the request to `true`.
Copy file name to clipboardExpand all lines: docs/llm/reference.md
+9-13Lines changed: 9 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,13 +73,7 @@ node: {
73
73
}
74
74
```
75
75
76
-
Above node configuration should be used as a template since user is not expected to change most of it's content. Fields that can be safely changed are:
77
-
-`name`
78
-
-`input_stream: "HTTP_REQUEST_PAYLOAD:input"` - in case you want to change input name
79
-
-`output_stream: "HTTP_RESPONSE_PAYLOAD:output"` - in case you want to change input name
80
-
-`node_options`
81
-
82
-
From this options only `node_options` really requires user attention as they specify LLM engine parameters. The rest of them can remain unchanged.
76
+
Above node configuration should be used as a template since user is not expected to change most of it's content. Actually only `node_options` requires user attention as it specifies LLM engine parameters. The rest of the configuration can remain unchanged.
83
77
84
78
The calculator supports the following `node_options` for tuning the pipeline configuration:
85
79
-`required string models_path` - location of the model directory (can be relative);
@@ -109,10 +103,12 @@ In node configuration we set `models_path` indicating location of the directory
109
103
├── template.jinja
110
104
```
111
105
112
-
Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing. Chat template is used only on `/chat/completions` endpoint. Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`.
106
+
Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing.
113
107
114
108
### Chat template
115
109
110
+
Chat template is used only on `/chat/completions` endpoint. Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`.
111
+
116
112
Loading chat template proceeds as follows:
117
113
1. If `tokenizer.jinja` is present, try to load template from it.
118
114
2. If there is no `tokenizer.jinja` and `tokenizer_config.json` exists, try to read template from its `chat_template` field. If it's not present, use default template.
@@ -134,12 +130,12 @@ When default template is loaded, servable accepts `/chat/completions` calls when
134
130
135
131
As it's in preview, this feature has set of limitations:
136
132
137
-
- Limited support for [API parameters](./model_server_rest_api_chat.md#request),
133
+
- Limited support for [API parameters](../model_server_rest_api_chat.md#request),
138
134
- Only one node with LLM calculator can be deployed at once,
139
135
- Metrics related to text generation - they are planned to be added later,
140
136
- Improvements in stability and recovery mechanisms are also expected
0 commit comments