Skip to content

Commit 2bb4435

Browse files
[Doc]: fix typos in various files (#28567)
Signed-off-by: Didier Durand <[email protected]>
1 parent 07cadab commit 2bb4435

File tree

5 files changed

+6
-6
lines changed

5 files changed

+6
-6
lines changed

docs/design/moe_kernel_features.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ Modular kernels are supported by the following `FusedMoEMethodBase` classes.
6868

6969
## Fused MoE Experts Kernels
7070

71-
The are a number of MoE experts kernel implementations for different quantization types and architectures. Most follow the general API of the base Triton [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts] function. Many have modular kernel adatpers so they can be used with compatible all2all backends. This table lists each experts kernel and its particular properties.
71+
The are a number of MoE experts kernel implementations for different quantization types and architectures. Most follow the general API of the base Triton [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts] function. Many have modular kernel adapters so they can be used with compatible all2all backends. This table lists each experts kernel and its particular properties.
7272

7373
Each kernel must be provided with one of the supported input activation formats. Some flavors of kernels support both standard and batched formats through different entry points, e.g. `TritonExperts` and `BatchedTritonExperts`. Batched format kernels are currently only needed for matching with certain all2all backends, e.g. `pplx`, `DeepEPLLPrepareAndFinalize`.
7474

docs/features/quantization/quark.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -298,7 +298,7 @@ There are two steps to generate and deploy a mixed precision model quantized wit
298298

299299
Firstly, the layerwise mixed-precision configuration for a given LLM model is searched and then quantized using AMD Quark. We will provide a detailed tutorial with Quark APIs later.
300300

301-
As examples, we provide some ready-to-use quantized mixed precision model to show the usage in vLLM and the accuracy benifits. They are:
301+
As examples, we provide some ready-to-use quantized mixed precision model to show the usage in vLLM and the accuracy benefits. They are:
302302

303303
- amd/Llama-2-70b-chat-hf-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8
304304
- amd/Mixtral-8x7B-Instruct-v0.1-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8

vllm/compilation/compiler_interface.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,7 @@ def initialize_cache(
299299
self.base_cache_dir = cache_dir[: -len(prefix)] if prefix else cache_dir
300300
if disable_cache:
301301
return
302-
# redirect the cache directory to a sub-directory
302+
# redirect the cache directory to a subdirectory
303303
# set flags so that Inductor and Triton store their cache
304304
# in the cache_dir, then users only need to copy the cache_dir
305305
# to another machine to reuse the cache.

vllm/compilation/decorators.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,7 @@ def forward(self, x: torch.Tensor, y: Optional[torch.Tensor]): ...
159159
160160
`mark_unbacked_dims` is a dictionary that maps argument names with a dynamic
161161
dim to be decorated with `mark_unbacked`. This is useful if we would like to
162-
enforce that dynamo do not specialize on 0/1 values in the case of dummy input
162+
enforce that dynamo does not specialize on 0/1 values in the case of dummy input
163163
such as for vision model compilation
164164
"""
165165

@@ -483,7 +483,7 @@ def maybe_use_cudagraph_partition_wrapper(vllm_config: VllmConfig):
483483
Context manager to set/unset customized cudagraph partition wrappers.
484484
485485
If we're using Inductor-based graph partitioning, we currently have the
486-
whole `fx.Graph` before Inductor lowering and and the piecewise
486+
whole `fx.Graph` before Inductor lowering and the piecewise
487487
splitting happens after all graph passes and fusions. Here, we add
488488
a custom hook for Inductor to wrap each partition with our static
489489
graph wrapper class to maintain more control over static graph

vllm/v1/worker/gpu_model_runner.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2871,7 +2871,7 @@ def propose_draft_token_ids(
28712871
"gpu_model_runner: set_async_sampled_token_ids"
28722872
):
28732873
# Save ref of sampled_token_ids CPU tensor if the batch contains
2874-
# any requests with sampling params that that require output ids.
2874+
# any requests with sampling params that require output ids.
28752875
self.input_batch.set_async_sampled_token_ids(
28762876
async_output.sampled_token_ids_cpu,
28772877
async_output.async_copy_ready_event,

0 commit comments

Comments
 (0)