vllm-project
diff --git a/‎README.md‎
Lines changed: 3 additions & 0 deletions b/‎README.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/getting-started/compress.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/getting-started/compress.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/autoround/README.md‎
Lines changed: 141 additions & 0 deletions b/‎examples/autoround/README.md‎
Lines changed: 141 additions & 0 deletions
diff --git a/‎setup.py‎
Lines changed: 1 addition & 2 deletions b/‎setup.py‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎src/llmcompressor/entrypoints/utils.py‎
Lines changed: 1 addition & 1 deletion b/‎src/llmcompressor/entrypoints/utils.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/llmcompressor/modifiers/autoround/base.py‎
Lines changed: 8 additions & 11 deletions b/‎src/llmcompressor/modifiers/autoround/base.py‎
Lines changed: 8 additions & 11 deletions
diff --git a/‎src/llmcompressor/modifiers/awq/mappings.py‎
Lines changed: 1 addition & 0 deletions b/‎src/llmcompressor/modifiers/awq/mappings.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/llmcompressor/modifiers/quantization/quantization/mixin.py‎
Lines changed: 4 additions & 8 deletions b/‎src/llmcompressor/modifiers/quantization/quantization/mixin.py‎
Lines changed: 4 additions & 8 deletions
diff --git a/‎src/llmcompressor/modifiers/smoothquant/base.py‎
Lines changed: 26 additions & 27 deletions b/‎src/llmcompressor/modifiers/smoothquant/base.py‎
Lines changed: 26 additions & 27 deletions
@@ -37,6 +37,7 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
 
 Some of the exciting new features include:
 
+* **AutoRound Quantization Support**: Added [`AutoRoundModifier`](examples/autoround/llama3_example.py) for quantization using [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced post-training algorithm that optimizes rounding and clipping ranges through sign-gradient descent. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance.
 * **Qwen3 Next and Qwen3 VL MoE Quantization Support**: Quantize the Qwen3 Next and Qwen3 VL MoE models and seamlessly run the models in vLLM. Examples for [NVFP4](examples/quantization_w4a4_fp4/qwen3_next_example.py) and [FP8](examples/quantization_w8a8_fp8/qwen3_next_example.py) Quantization have been added for the Qwen3-Next-80B-A3B-Instruct. For the Qwen3 VL MoE, support has been added for the datafree pathway, specifically [FP8 Quantization](examples/quantization_w8a8_fp8/qwen3_vl_moe_fp8_example.py) (e.g channel-wise and block-wise quantization). NOTE: these models are not supported in tranformers<=4.56.2. You may need to install transformers from source.
 * **Quantization with Multiple Modifiers**: Multiple quantization modifiers can now be applied to the same model for mixed-precision quantization, for example applying AWQ W4A16 to a model's `self_attn` layers and GPTQ W8A8 to its `mlp` layers. This is an advanced usage of `llm-compressor` and an active area of research. See the [non-uniform quantization support](examples/quantization_non_uniform) section for more detail and [example usage](examples/quantization_non_uniform/quantization_multiple_modifiers.py).
 * **QuIP and SpinQuant-style Transforms**: The newly added [`QuIPModifier`](examples/transform/quip_example.py) and [`SpinQuantModifier`](examples/transform/spinquant_example.py) allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization.
@@ -55,6 +56,7 @@ Some of the exciting new features include:
 * AWQ
 * SmoothQuant
 * SparseGPT
+* AutoRound
 
 ### When to Use Which Optimization
 
@@ -78,6 +80,7 @@ Applying quantization with `llmcompressor`:
 * [Weight only quantization to `fp4`](examples/quantization_w4a16_fp4/llama3_example.py)
 * [Weight only quantization to `int4` using GPTQ](examples/quantization_w4a16/README.md)
 * [Weight only quantization to `int4` using AWQ](examples/awq/README.md)
+* [Weight only quantization to `int4` using AutoRound](examples/autoround/README.md)
 * [Quantizing MoE LLMs](examples/quantizing_moe/README.md)
 * [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)
 * [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)
 
@@ -33,6 +33,7 @@ Compression schemes use quantization methods including the following:
 | **AWQ** | Uses channelwise scaling to better preserve important outliers in weights and activations | Better accuracy recovery with faster runtime than GPTQ |
 | **SmoothQuant** | Smooths outliers in activations by folding them into weights, ensuring better accuracy for weight and activation quantized models | Good accuracy recovery with minimal calibration time; composable with other methods |
 | **Round-To-Nearest (RTN)** | Simple quantization technique that rounds each value to the nearest representable level in the target precision. | Provides moderate accuracy recovery in most scenarios. Computationally cheap and fast to implement, making it suitable for real-time or resource-constrained environments. |
+| **AutoRound** | AutoRound optimizes rounding and clipping ranges via sign-gradient descent. | Delivers leading 4-bit and superior sub-4-bit accuracy compared to GPTQ/AWQ, with runtime faster than GPTQ and on par with AWQ. |
 
 For this guide, we'll use `GPTQ` composed with `SmoothQuant` to create an `INT W8A8` quantized model. This combination provides a good balance for performance, accuracy, and compatability across a wide range of hardware.
 
 
@@ -0,0 +1,141 @@
+# `AutoRound` Quantization
+
+`llm-compressor` supports [AutoRound](https://aclanthology.org/2024.findings-emnlp.662.pdf), an advanced quantization technique that delivers **high-accuracy**, **low-bit quantization**. The quantized results are fully compatible with `compressed-tensors` and can be served directly with vLLM.
+
+AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tune these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance.
+
+## Installation
+
+To get started, install:
+
+```bash
+git clone https://github.com/vllm-project/llm-compressor.git
+cd llm-compressor
+pip install -e .
+```
+
+## Quickstart
+
+The example includes an end-to-end script for applying the AutoRound quantization algorithm.
+
+```bash
+python3 llama3_example.py
+```
+
+The resulting model `Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound` is ready to be loaded into vLLM.
+
+## Code Walkthrough
+
+Now, we will step through the code in the example. There are four steps:
+1) Load model
+2) Prepare calibration data
+3) Apply quantization
+4) Evaluate accuracy in vLLM
+
+### 1) Load Model
+
+Load the model using `AutoModelForCausalLM` for handling quantized saving and loading. 
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+```
+
+### 2) Prepare Calibration Data
+
+When quantizing model weights with AutoRound, you’ll need a small set of sample data to run the algorithm. By default, we are using [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) as our calibration dataset.
+Recommended starting points:
+- 128 samples — typically sufficient for stable calibration (increase if accuracy degrades).
+- 2048 sequence length — a good baseline for most LLMs.
+- 200 tuning steps — usually enough to converge (increase if accuracy drops).
+
+```python
+# Select calibration dataset.
+from auto_round.calib_dataset import get_dataset
+
+NUM_CALIBRATION_SAMPLES = 128
+MAX_SEQUENCE_LENGTH = 2048
+
+# Get aligned calibration dataset.
+ds = get_dataset(
+    tokenizer=tokenizer,
+    seqlen=MAX_SEQUENCE_LENGTH,
+    nsamples=NUM_CALIBRATION_SAMPLES,
+)
+```
+
+### 3) Apply Quantization
+
+With the dataset ready, we will now apply AutoRound quantization to the model.
+
+```python
+from llmcompressor import oneshot
+from llmcompressor.modifiers.autoround import AutoRoundModifier
+
+# Configure the quantization algorithm to run.
+recipe = AutoRoundModifier(
+    targets="Linear", scheme="W4A16", ignore=["lm_head"], iters=200
+)
+
+# Apply quantization.
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+    # disable shuffling to get slightly better mmlu score
+    shuffle_calibration_samples=False,
+)
+
+
+# Save to disk compressed.
+SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
+```
+
+We have successfully created an `int4` model!
+
+### 4) Evaluate Accuracy
+
+With the model created, we can now load and run in vLLM (after installing).
+
+```python
+from vllm import LLM
+model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound")
+```
+
+We can evaluate accuracy with `lm_eval` (`pip install lm-eval==0.4.9.1`):
+> Note: quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
+
+Run the following to test accuracy on GSM-8K:
+
+```bash
+lm_eval --model vllm \
+  --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128-AutoRound",add_bos_token=true \
+  --tasks gsm8k \
+  --num_fewshot 5 \
+  --limit 1000 \
+  --batch_size 'auto'
+```
+
+We can see the resulting scores look good!
+
+```bash
+| Tasks | Version | Filter           | n-shot | Metric      |     | Value |     | Stderr |
+| ----- | ------: | ---------------- | -----: | ----------- | --- | ----: | --- | -----: |
+| gsm8k |       3 | flexible-extract |      5 | exact_match | ↑   | 0.737 | ±   | 0.0139 |
+|       |         | strict-match     |      5 | exact_match | ↑   | 0.736 | ±   | 0.0139 |
+```
+> Note: quantized model accuracy may vary slightly due to nondeterminism.
+
+### Known Issues
+Currently, `llm-compressor` supports applying AutoRound only on the `wNa16` quantization schemes. Support for additional schemes is planned. You can follow progress in the [RFC](https://github.com/vllm-project/llm-compressor/issues/1968).
+
+### Questions or Feature Request?
+
+Please open up an issue on [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) or [intel/auto-round](https://github.com/intel/auto-round).
@@ -144,8 +144,7 @@ def localversion_func(version: ScmVersion) -> str:
             if BUILD_TYPE == "release"
             else "compressed-tensors>=0.12.3a2"
         ),
-        # TODO: replace it with the release version
-        ("auto_round @ git+https://github.com/intel/auto-round.git@llmc"),
+        ("auto-round==0.9.1"),
     ],
     extras_require={
         "dev": [
 
@@ -29,12 +29,12 @@
 from llmcompressor.pytorch.model_load.helpers import parse_dtype
 from llmcompressor.transformers.compression.compressed_tensors_utils import (
     modify_save_pretrained,
-    untie_word_embeddings,
 )
 from llmcompressor.transformers.utils.helpers import (
     is_model_ct_quantized_from_path,
 )
 from llmcompressor.typing import Processor
+from llmcompressor.utils import untie_word_embeddings
 from llmcompressor.utils.fsdp.helpers import is_fsdp_model
 
 
 
@@ -20,10 +20,8 @@
 from llmcompressor.modifiers import Modifier
 from llmcompressor.modifiers.quantization.calibration import apply_calibration_status
 from llmcompressor.modifiers.quantization.quantization import QuantizationMixin
-from llmcompressor.transformers.compression.compressed_tensors_utils import (
-    untie_if_target_shared_embedding,
-)
-from llmcompressor.utils.pytorch.module import get_no_split_params
+from llmcompressor.utils import targets_embeddings, untie_word_embeddings
+from llmcompressor.utils.pytorch import get_no_split_params
 
 __all__ = ["AutoRoundModifier"]
 
@@ -111,9 +109,9 @@ class AutoRoundModifier(Modifier, QuantizationMixin):
     # AutoRound modifier arguments
     iters: int = 200
     enable_torch_compile: bool = True
+    batch_size: int = 8
 
     # private variables
-    _module_names: Dict[torch.nn.Module, str] = PrivateAttr(default_factory=dict)
     _all_module_input: Dict[str, List[Tuple]] = PrivateAttr(default_factory=dict)
     _q_input: Optional[torch.Tensor] = PrivateAttr(default=None)
 
@@ -128,10 +126,6 @@ def on_initialize(self, state: State, **kwargs) -> bool:
             QuantizationMixin.initialize_quantization(self, state.model)
 
         # prepare module names
-        self._module_names = {
-            m: name
-            for name, m in match_named_modules(state.model, self.targets, self.ignore)
-        }
         self._add_temporary_names(state.model)
         # freeze all model parameters
         for _, param in state.model.named_parameters():
@@ -146,7 +140,9 @@ def start_calibration(self, model: torch.nn.Module):
 
         :param model: model to prepare for calibration
         """
-        untie_if_target_shared_embedding(model, self._module_names.values())
+        targets = match_named_modules(model, self.targets, self.ignore)
+        if targets_embeddings(model, targets):
+            untie_word_embeddings(model)
 
         for _, module in match_named_modules(model, self.targets, self.ignore):
             # Note: No need to register observers for auto-round
@@ -227,6 +223,7 @@ def apply_autoround(self, state, subgraph):
                 scheme=ar_quant_scheme,
                 iters=self.iters,
                 enable_torch_compile=self.enable_torch_compile,
+                batch_size=self.batch_size,
             )
             # TODO: configure layer-wise config based on self.resolved_config
             ar.configure_layer_config(enable_gguf_official_mixed=False)
@@ -240,7 +237,7 @@ def apply_autoround(self, state, subgraph):
                 block=decoding_layer,
                 inputs=cur_inputs,
                 q_input=self._q_input,
-                device=device,
+                device=str(device),
                 # Leave offload for LLMC
                 auto_offload=False,
             )
 
@@ -166,6 +166,7 @@ class AWQMapping:
     "Llama4ForConditionalGeneration": _default_mappings,
     "Mistral3ForConditionalGeneration": _default_mappings,
     "MistralForCausalLM": _default_mappings,
+    "Olmo3ForCausalLM": _exaone4_mappings,
     "Phi3ForCausalLM": _phi_mappings,
     "Phi3VForCausalLM": _phi_mappings,
     "Qwen2ForCausalLM": _default_mappings,
 
@@ -34,9 +34,7 @@
     reset_quantization_status,
 )
 from llmcompressor.modifiers.utils.hooks import HooksMixin
-from llmcompressor.transformers.compression.compressed_tensors_utils import (
-    untie_if_target_shared_embedding,
-)
+from llmcompressor.utils import targets_embeddings, untie_word_embeddings
 
 __all__ = ["QuantizationMixin"]
 
@@ -184,11 +182,9 @@ def start_calibration(self, model: torch.nn.Module):
 
         :param model: model to prepare for calibration
         """
-
-        matched_module_generator = (
-            x[1] for x in match_named_modules(model, self.resolved_targets, self.ignore)
-        )
-        untie_if_target_shared_embedding(model, matched_module_generator)
+        targets = match_named_modules(model, self.resolved_targets, self.ignore)
+        if targets_embeddings(model, targets):
+            untie_word_embeddings(model)
 
         for _, module in match_named_modules(model, self.resolved_targets, self.ignore):
             self._initialize_observers(module)
 
@@ -2,7 +2,7 @@
 from typing import Callable, Dict, List, Optional, Tuple, Union
 
 import torch
-from compressed_tensors.utils import align_module_device
+from compressed_tensors.utils import align_module_device, match_named_modules
 from loguru import logger
 from pydantic import ConfigDict, Field
 from torch.nn import Module
@@ -14,11 +14,7 @@
     handle_mapping_resolution_errors,
 )
 from llmcompressor.utils.fsdp.helpers import get_fsdp_parent
-from llmcompressor.utils.pytorch.module import (
-    get_layers,
-    get_matching_layer,
-    match_targets,
-)
+from llmcompressor.utils.pytorch.module import get_layer_by_name
 
 MINIMUM_SMOOTHING_SCALE = 1e-5
 
@@ -196,31 +192,34 @@ def _resolve_mappings(self, model: Module) -> List[SmoothQuantMapping]:
         Transforms the list of activations to smooth and their corresponding weights
         into SmoothQuantMapping objects, resolving regular expressions.
 
-        For each activation in the mapping list, we find the corresponding weight to
-        balance by searching for the longest substring. For instance, if our balance
-        weight is ".*re:.*q_proj" and the activation is "re:.*self_attn_layer_norm" we
-        would match model.layer.0.p_proj to model.layer.0.self_attn_layer_norm and
-        repeat for model.layer.1 and so on
+        For each activation in the mapping list, we find ALL corresponding weights to
+        balance by matching within the parent scope. This ensures all matching layers
+        are included, which is critical for MoE models where multiple experts need to
+        be balanced.
         """
         resolved_mappings = []
         for to_balance, to_smooth in self.mappings:
-            to_smooth_layers = get_layers(to_smooth, model)
-            for layer_name, smooth_layer in to_smooth_layers.items():
-                if not match_targets(layer_name, self.ignore)[0]:
-                    balance_layers = []
-                    for balance_suffix in to_balance:
-                        # find the submodule that matches the activation layer
-                        _, balance_layer = get_matching_layer(
-                            balance_suffix, layer_name, model
-                        )
-                        if balance_layer:
-                            balance_layers.append(balance_layer)
-                    # each mapping can contain multiple layers to balance, but only
-                    # one layer to smooth
-                    mapping = SmoothQuantMapping(
-                        layer_name, smooth_layer, balance_layers
+            to_smooth_list = [to_smooth] if isinstance(to_smooth, str) else to_smooth
+
+            for smooth_name, smooth_layer in match_named_modules(
+                model, to_smooth_list, self.ignore
+            ):
+                # Search for balance layers within the parent scope
+                smooth_parent_name = ".".join(smooth_name.split(".")[:-1])
+                smooth_parent = get_layer_by_name(smooth_parent_name, model)
+
+                balance_layers = [
+                    balance_layer
+                    for _, balance_layer in match_named_modules(
+                        smooth_parent, to_balance, self.ignore
+                    )
+                ]
+
+                if balance_layers:
+                    resolved_mappings.append(
+                        SmoothQuantMapping(smooth_name, smooth_layer, balance_layers)
                     )
-                    resolved_mappings.append(mapping)
+
         return resolved_mappings
 
     def _setup_scale_hooks(self):
Original file line number	Diff line number	Diff line change
`@@ -29,12 +29,12 @@`
`29`	`29`	`from llmcompressor.pytorch.model_load.helpers import parse_dtype`
`30`	`30`	`from llmcompressor.transformers.compression.compressed_tensors_utils import (`
`31`	`31`	`modify_save_pretrained,`
`32`		`- untie_word_embeddings,`
`33`	`32`	`)`
`34`	`33`	`from llmcompressor.transformers.utils.helpers import (`
`35`	`34`	`is_model_ct_quantized_from_path,`
`36`	`35`	`)`
`37`	`36`	`from llmcompressor.typing import Processor`
	`37`	`+from llmcompressor.utils import untie_word_embeddings`
`38`	`38`	`from llmcompressor.utils.fsdp.helpers import is_fsdp_model`
`39`	`39`
`40`	`40`