@@ -38,7 +38,7 @@ pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/hugging
3838Run optimum-cli to download and quantize the model:
3939``` bash
4040cd demos/continuous_batching
41- optimum-cli export openvino --disable-convert-tokenizer --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int8 Meta-Llama-3-8B-Instruct
41+ optimum-cli export openvino --disable-convert-tokenizer --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 Meta-Llama-3-8B-Instruct
4242convert_tokenizer -o Meta-Llama-3-8B-Instruct --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens meta-llama/Meta-Llama-3-8B-Instruct
4343```
4444
@@ -287,36 +287,31 @@ It can be demonstrated using benchmarking app from vLLM repository:
287287``` bash
288288git clone https://github.com/vllm-project/vllm
289289cd vllm
290- pip3 install wheel packaging ninja " setuptools>=49.4.0" numpy
291290pip3 install -r requirements-cpu.txt
292- export VLLM_TARGET_DEVICE=cpu
293- python setup.py install
294291cd benchmarks
295- sed -i -e ' s|v1/chat/completions|v3/chat/completions|g' backend_request_func.py # allows calls to endpoint with v3 instead of v1 like in vLLM
296292wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
297- python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 1
293+ python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf
298294
299- Namespace(backend=' openai-chat' , version=' N/A' , base_url=None, host=' localhost' , port=8000, endpoint=' /v3/chat/completions' , dataset=' ShareGPT_V3_unfiltered_cleaned_split.json' , model=' meta-llama/Meta-Llama-3-8B-Instruct' , tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, request_rate=1 .0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False)
300- Traffic request rate: 1.0
295+ Namespace(backend=' openai-chat' , version=' N/A' , base_url=None, host=' localhost' , port=8000, endpoint=' /v3/chat/completions' , dataset=' ShareGPT_V3_unfiltered_cleaned_split.json' , model=' meta-llama/Meta-Llama-3-8B-Instruct' , tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, request_rate=inf .0, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False)
296+ Traffic request rate: inf
301297100%| ██████████████████████████████████████████████████| 1000/1000 [17:17< 00:00, 1.04s/it]
302298============ Serving Benchmark Result ============
303299Successful requests: 1000
304- Benchmark duration (s): 1037.78
305- Total input tokens: 245995
306- Total generated tokens: 195504
307- Request throughput (req/s): 0.96
308- Input token throughput (tok/s): 237.04
309- Output token throughput (tok/s): 188.39
300+ Benchmark duration (s): 447.62
301+ Total input tokens: 215201
302+ Total generated tokens: 198588
303+ Request throughput (req/s): 2.23
304+ Input token throughput (tok/s): 480.76
305+ Output token throughput (tok/s): 443.65
310306---------------Time to First Token----------------
311- Mean TTFT (ms): 693.63
312- Median TTFT (ms): 570.60
313- P99 TTFT (ms): 2187.77
307+ Mean TTFT (ms): 171999.94
308+ Median TTFT (ms): 170699.21
309+ P99 TTFT (ms): 360941.40
314310-----Time per Output Token (excl. 1st token)------
315- Mean TPOT (ms): 132.96
316- Median TPOT (ms): 143.28
317- P99 TPOT (ms): 234.14
311+ Mean TPOT (ms): 211.31
312+ Median TPOT (ms): 223.79
313+ P99 TPOT (ms): 246.48
318314==================================================
319-
320315```
321316
322317## RAG with Model Server
0 commit comments