llmcompressor.modifiers.awq.base
Classes:
-
AWQModifier
–Implements the AWQ (Activation-Weighted Quantization) algorithm,
AWQModifier
Bases: Modifier
, QuantizationMixin
Implements the AWQ (Activation-Weighted Quantization) algorithm, as described in https://arxiv.org/pdf/2306.00978. The algorithm significantly reduces quantization error by protecting only 1% of the most salient weight channels.
Instead of relying on raw weight values, AWQ identifies important channels by analyzing activation patterns, focusing on the channels in the weight tensor that are most responsive to the input. To reduce quantization error, it scales these channels in a way that preserves the model's original behavior, using scaling factors computed offline from activation statistics.
Because this modifier manipulates the weights of the model, it can only be used in in one-shot and not during training. Activation ranges are determined by running a small set of calibration data through the model.
example recipe:
AWQModifier:
mappings:
- smooth_layer: "re:.*self_attn_layer_norm"
balance_layers: ["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"]
- smooth_layer: "re:.*final_layer_norm"
balance_layers: ["re:.*fc1"]
]
ignore: ["lm_head"]
config_groups:
group_0:
targets:
- "Linear"
input_activations: null
output_activations: null
weights:
num_bits: 4
type: int
symmetric: false
strategy: group
group_size: 128
Lifecycle: - on_initialize - resolve mappings - capture kwargs needed for forward passes into modules - on_start - set up activation cache hooks to capture input activations to balance layers - on sequential epoch end - apply smoothing to each smoothing layer - consume cached activations across all batches - clear cached activations as they are used - find best smoothing scale for each smoothing layer - apply to model weights - raise error if any unused activations remain - on_end - re-run logic of sequential epoch end (in case of basic pipeline) - set scales and zero points - remove activation hooks - on_finalize - clear resolved mappings and captured activations
Parameters:
-
sequential_targets
list of module names to compress in the same calibration pass
-
mappings
list activation layers to smooth, and which layers to scale the output such that activations are smoothed. Each entry of the mapping list should be a list itself, in which the first entry is a list of layers who share the same input activation (the one to be to smoothed) and the second entry is the layer whose output is scaled to achieve the smoothing. If regex is used, it matches layers with the largest overlap in module name.
-
ignore
list of layers to ignore, even if they match a regex in mappings. It should match the name of layers whose outputs are scaled to achieve smoothing (the second entry of the mappings list).
-
offload_device
offload cached args to this device, which reduces memory requirements but requires more time to move data between cpu and execution device. Defaults to None, so cached args are not offloaded. Consider setting to torch.device("cpu") if you are encountering OOM errors
-
duo_scaling
whether to use duo scaling, which uses both input activations and weights to determine the scaling factor
Methods:
-
on_end
–Finish calibrating by setting scales and zero-points,
-
on_finalize
–Clean up by clearing the activations and mapping data
-
on_initialize
–Initialize AWQ on the given state
-
validate_model_after
–Confirm only one configuration for group_size, symmetric, and num_bits,
on_end
Finish calibrating by setting scales and zero-points, removing observers and calibration hooks
Source code in llmcompressor/modifiers/awq/base.py
on_finalize
Clean up by clearing the activations and mapping data
Parameters:
-
state
State
) –unused
Returns:
-
bool
–True
Source code in llmcompressor/modifiers/awq/base.py
on_initialize
Initialize AWQ on the given state Initialize quantization, resolve mappings, cache module kwargs
Parameters:
-
state
State
) –state to run AWQ on
Returns:
-
bool
–True on a successful run, False otherwise
Source code in llmcompressor/modifiers/awq/base.py
validate_model_after
Confirm only one configuration for group_size, symmetric, and num_bits, as AWQ algorithm depends on it Confirm no activation quantization, as AWQ only works with WNA16
Source code in llmcompressor/modifiers/awq/base.py
get_lowest_common_parent
Given a list of names, returns the lowest-scope common parent.
NOTE: function excludes parents of type ModuleList, which don't play nicely with hooks because their forward method is never directly called for MoE models. See Qwen3MoeSparseMoeBlock for example, experts are selected based on router output and their forward method is called. https://github.com/huggingface/transformers/blob/v4.52.4/src/transformers/models/qwen3_moe/modeling_qwen3_moe.py#L233
Returns name of parent and pointer to parent module
Implementation is a small alteration of os.path.commonprefix https://docs.python.org/3/library/os.path.html#os.path.commonprefix