-
Notifications
You must be signed in to change notification settings - Fork 294
Open
Description
Hello
I quantized the Qwen3 235B MoE model using FP8 Dynamic and FP8 Static (weights, activations and KV cache) using the llm-compressor library, and benchmarked the online serving performance using vLLM benchmarking scripts on a H200 GPU. I was hoping to see a significant performance improvement, but:
- For model serving with tensor_parallel_size 8, there is barely any throughput/latency improvement, in fact latencies are slightly worse.
- For model serving with tensor_parallel_size 4, throughput improvement is seen in very specific scenarios where inputs are in the range of 5k-7k tokens and outputs in the range of 1k tokens. There are no latency improvements.
Considering that the weights are in half the precision, many activations are in half the precision and KV cache is also stored in half the precision, shouldn't there be a significant performance improvements?
Could you help me understand where my understanding could be off.
thanks!
Metadata
Metadata
Assignees
Labels
No labels