Docs fixes (#2502)

mzegla · dtrawins · mzegla · commit 25a47cdb4b0f · 2024-06-17T11:44:50.000+02:00
Co-authored-by: Dariusz Trawinski &lt;Dariusz.Trawinski@intel.com&gt;
diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md
@@ -99,7 +99,7 @@ Meta-Llama-3-8B-Instruct
 
 The default configuration of the `LLMExecutor` should work in most cases but the parameters can be tunned inside the `node_options` section in the `graph.pbtxt` file. 
 Note that the `models_path` parameter in the graph file can be an absolute path or relative to the `base_path` from `config.json`.
-Check the [LLM calculator documentation](./llm_calculator.md) to learn about configuration options.
+Check the [LLM calculator documentation](../../docs/llm/reference.md) to learn about configuration options.
 
 > **Note:** The parameter `cache_size` in the graph represents KV cache size in GB. Reduce the value if you don't have enough RAM on the host.
 
diff --git a/docs/llm/quickstart.md b/docs/llm/quickstart.md
@@ -52,7 +52,7 @@ node: {
     }
   }
 }
-' > TinyLlama-1.1B-Chat-v1.0/graph.pbtxt
+' >> TinyLlama-1.1B-Chat-v1.0/graph.pbtxt
 ```
 
 4. Create server `config.json` file:
@@ -67,7 +67,7 @@ echo '
         }
     ]
 }
-' > config.json
+' >> config.json
 ```
 5. Deploy:
 
@@ -113,28 +113,28 @@ curl -s http://localhost:8000/v3/chat/completions \
   }'| jq .
 ```
 ```json
+{
   "choices": [
     {
       "finish_reason": "stop",
       "index": 0,
       "logprobs": null,
       "message": {
-        "content": "OpenVINO is a software development kit (SDK) for machine learning (ML) and deep learning (DL) applications. It is developed",
+        "content": "OpenVINO is a software toolkit developed by Intel that enables developers to accelerate the training and deployment of deep learning models on Intel hardware.",
         "role": "assistant"
       }
     }
   ],
-  "created": 1718401064,
+  "created": 1718607923,
   "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
   "object": "chat.completion"
 }
-
 ```
 **Note:** If you want to get the response chunks streamed back as they are generated change `stream` parameter in the request to `true`.
 
 
-## References:
-- [Efficient LLM Serving - reference](./reference.md)
-- [Chat Completions API](./model_server_rest_api_chat.md)
-- [Completions API](./model_server_rest_api_completions.md)
-- [Demo with Llama3 serving](./../demos/continuous_batching/)
+## References
+- [Efficient LLM Serving - reference](reference.md)
+- [Chat Completions API](../model_server_rest_api_chat.md)
+- [Completions API](../model_server_rest_api_completions.md)
+- [Demo with Llama3 serving](../../demos/continuous_batching/README.md)
diff --git a/docs/llm/reference.md b/docs/llm/reference.md
@@ -73,13 +73,7 @@ node: {
 }
 ```
 
-Above node configuration should be used as a template since user is not expected to change most of it's content. Fields that can be safely changed are:
- - `name`
- - `input_stream: "HTTP_REQUEST_PAYLOAD:input"` - in case you want to change input name
- - `output_stream: "HTTP_RESPONSE_PAYLOAD:output"` - in case you want to change input name
-- `node_options`
-
-From this options only `node_options` really requires user attention as they specify LLM engine parameters. The rest of them can remain unchanged. 
+Above node configuration should be used as a template since user is not expected to change most of it's content. Actually only `node_options` requires user attention as it specifies LLM engine parameters. The rest of the configuration can remain unchanged. 
 
 The calculator supports the following `node_options` for tuning the pipeline configuration:
 -    `required string models_path` - location of the model directory (can be relative);
@@ -109,10 +103,12 @@ In node configuration we set `models_path` indicating location of the directory
 ├── template.jinja
 ```
 
-Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing. Chat template is used only on `/chat/completions` endpoint. Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`. 
+Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing.
 
 ### Chat template
 
+Chat template is used only on `/chat/completions` endpoint. Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`. 
+
 Loading chat template proceeds as follows:
 1. If `tokenizer.jinja` is present, try to load template from it.
 2. If there is no `tokenizer.jinja` and `tokenizer_config.json` exists, try to read template from its `chat_template` field. If it's not present, use default template.
@@ -134,12 +130,12 @@ When default template is loaded, servable accepts `/chat/completions` calls when
 
 As it's in preview, this feature has set of limitations:
 
-- Limited support for [API parameters](./model_server_rest_api_chat.md#request),
+- Limited support for [API parameters](../model_server_rest_api_chat.md#request),
 - Only one node with LLM calculator can be deployed at once,
 - Metrics related to text generation - they are planned to be added later,
 - Improvements in stability and recovery mechanisms are also expected
 
-## References:
-- [Chat Completions API](./model_server_rest_api_chat.md)
-- [Completions API](./model_server_rest_api_completions.md)
-- [Demo](./../demos/continuous_batching/)
+## References
+- [Chat Completions API](../model_server_rest_api_chat.md)
+- [Completions API](../model_server_rest_api_completions.md)
+- [Demo](../../demos/continuous_batching/README.md)

Original file line number	Diff line number	Diff line change
`@@ -52,7 +52,7 @@ node: {`
`52`	`52`	`}`
`53`	`53`	`}`
`54`	`54`	`}`
`55`		`-' > TinyLlama-1.1B-Chat-v1.0/graph.pbtxt`
	`55`	`+' >> TinyLlama-1.1B-Chat-v1.0/graph.pbtxt`
`56`	`56`	```
`57`	`57`
`58`	`58`	4. Create server `config.json` file:
`@@ -67,7 +67,7 @@ echo '`
`67`	`67`	`}`
`68`	`68`	`]`
`69`	`69`	`}`
`70`		`-' > config.json`
	`70`	`+' >> config.json`
`71`	`71`	```
`72`	`72`	`5. Deploy:`
`73`	`73`
`@@ -113,28 +113,28 @@ curl -s http://localhost:8000/v3/chat/completions \`
`113`	`113`	`}'\| jq .`
`114`	`114`	```
`115`	`115`	```json
	`116`	`+{`
`116`	`117`	`"choices": [`
`117`	`118`	`{`
`118`	`119`	`"finish_reason": "stop",`
`119`	`120`	`"index": 0,`
`120`	`121`	`"logprobs": null,`
`121`	`122`	`"message": {`
`122`		`- "content": "OpenVINO is a software development kit (SDK) for machine learning (ML) and deep learning (DL) applications. It is developed",`
	`123`	`+ "content": "OpenVINO is a software toolkit developed by Intel that enables developers to accelerate the training and deployment of deep learning models on Intel hardware.",`
`123`	`124`	`"role": "assistant"`
`124`	`125`	`}`
`125`	`126`	`}`
`126`	`127`	`],`
`127`		`- "created": 1718401064,`
	`128`	`+ "created": 1718607923,`
`128`	`129`	`"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",`
`129`	`130`	`"object": "chat.completion"`
`130`	`131`	`}`
`131`		`-`
`132`	`132`	```
`133`	`133`	Note: If you want to get the response chunks streamed back as they are generated change `stream` parameter in the request to `true`.
`134`	`134`
`135`	`135`
`136`		`-## References:`
`137`		`-- [Efficient LLM Serving - reference](./reference.md)`
`138`		`-- [Chat Completions API](./model_server_rest_api_chat.md)`
`139`		`-- [Completions API](./model_server_rest_api_completions.md)`
`140`		`-- [Demo with Llama3 serving](./../demos/continuous_batching/)`
	`136`	`+## References`
	`137`	`+- [Efficient LLM Serving - reference](reference.md)`
	`138`	`+- [Chat Completions API](../model_server_rest_api_chat.md)`
	`139`	`+- [Completions API](../model_server_rest_api_completions.md)`
	`140`	`+- [Demo with Llama3 serving](../../demos/continuous_batching/README.md)`