-
Notifications
You must be signed in to change notification settings - Fork 391
Open
Description
so, i ran some benchmark with VLLM (I actually modified some parts of the code so it would work with vllm on tpu), and the logs show something like this:
Processed prompts: 16%|▏| 1456/9263 [37:28<11:01:04, 5.08s/it, est. speed input: 86.99 toks/s, outp ProceProcessed prompts: 64%|████████████████████████████████████████████████████████
| 5971/9263 [2:16:18<1:17:49, 1.42s/it, est. speed input: 72.62 toks/s, output: 2683.22 toks/s]
the input tok/s is much worse compared to the output tok/s, do you think this could be a bottleneck?
Metadata
Metadata
Assignees
Labels
No labels