Skip to content

llmcompressor.modifiers.awq

Modules:

Classes:

  • AWQMapping

    Dataclass storing config of activation mappings to smooth

  • AWQModifier

    Implements the AWQ (Activation-Weighted Quantization) algorithm,

Functions:

AWQMapping dataclass

AWQMapping(smooth_layer: str, balance_layers: list[str])

Dataclass storing config of activation mappings to smooth The output activations of smooth_layer are input activations into the balance_layers

AWQMappings are resolved into ResolvedMappings, which retain pointers to the actual torch.nn.Modules and additional metadata at runtime

AWQModifier

Bases: Modifier, QuantizationMixin

Implements the AWQ (Activation-Weighted Quantization) algorithm, as described in https://arxiv.org/pdf/2306.00978. The algorithm significantly reduces quantization error by protecting only 1% of the most salient weight channels.

Instead of relying on raw weight values, AWQ identifies important channels by analyzing activation patterns, focusing on the channels in the weight tensor that are most responsive to the input. To reduce quantization error, it scales these channels in a way that preserves the model's original behavior, using scaling factors computed offline from activation statistics.

Because this modifier manipulates the weights of the model, it can only be used in in one-shot and not during training. Activation ranges are determined by running a small set of calibration data through the model.

example recipe:

AWQModifier:
  mappings:
    - smooth_layer: "re:.*self_attn_layer_norm"
      balance_layers: ["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"]
    - smooth_layer: "re:.*final_layer_norm"
      balance_layers: ["re:.*fc1"]
  ]
  ignore: ["lm_head"]
  config_groups:
    group_0:
      targets:
        - "Linear"
      input_activations: null
      output_activations: null
      weights:
        num_bits: 4
        type: int
        symmetric: false
        strategy: group
        group_size: 128

Lifecycle: - on_initialize - resolve mappings - capture kwargs needed for forward passes into modules - on_start - set up activation cache hooks to capture input activations to balance layers - on sequential epoch end - apply smoothing to each smoothing layer - consume cached activations across all batches - clear cached activations as they are used - find best smoothing scale for each smoothing layer - apply to model weights - raise error if any unused activations remain - on_end - re-run logic of sequential epoch end (in case of basic pipeline) - set scales and zero points - remove activation hooks - on_finalize - clear resolved mappings and captured activations

Parameters:

  • sequential_targets

    list of module names to compress in the same calibration pass

  • mappings

    list activation layers to smooth, and which layers to scale the output such that activations are smoothed. Each entry of the mapping list should be a list itself, in which the first entry is a list of layers who share the same input activation (the one to be to smoothed) and the second entry is the layer whose output is scaled to achieve the smoothing. If regex is used, it matches layers with the largest overlap in module name.

  • ignore

    list of layers to ignore, even if they match a regex in mappings. It should match the name of layers whose outputs are scaled to achieve smoothing (the second entry of the mappings list).

  • offload_device

    offload cached args to this device, which reduces memory requirements but requires more time to move data between cpu and execution device. Defaults to None, so cached args are not offloaded. Consider setting to torch.device("cpu") if you are encountering OOM errors

  • duo_scaling

    whether to use duo scaling, which uses both input activations and weights to determine the scaling factor

Methods:

  • on_end

    Finish calibrating by setting scales and zero-points,

  • on_finalize

    Clean up by clearing the activations and mapping data

  • on_initialize

    Initialize AWQ on the given state

  • validate_model_after

    Confirm only one configuration for group_size, symmetric, and num_bits,

on_end

on_end(state: State, event: Event, **kwargs)

Finish calibrating by setting scales and zero-points, removing observers and calibration hooks

Source code in llmcompressor/modifiers/awq/base.py
def on_end(self, state: State, event: Event, **kwargs):
    """
    Finish calibrating by setting scales and zero-points,
     removing observers and calibration hooks
    """
    self._assert_all_activations_consumed()

    self.ended_ = True

    modules = list(state.model.modules())
    for module in tqdm(modules, desc="Calibrating weights"):
        update_weight_zp_scale(module)

    QuantizationMixin.end_calibration(self, state.model)

    # remove activation hooks
    self.remove_hooks()

on_finalize

on_finalize(state: State, **kwargs) -> bool

Clean up by clearing the activations and mapping data

Parameters:

  • state

    (State) –

    unused

Returns:

  • bool

    True

Source code in llmcompressor/modifiers/awq/base.py
def on_finalize(self, state: State, **kwargs) -> bool:
    """
    Clean up by clearing the activations and mapping data

    :param state: unused
    :return: True
    """
    if not self.ended_:
        self.on_end(state, None)

    self._parent_args_cache.clear()
    self._smooth_activation_means.clear()
    self._resolved_mappings.clear()

    return True

on_initialize

on_initialize(state: State, **kwargs) -> bool

Initialize AWQ on the given state Initialize quantization, resolve mappings, cache module kwargs

Parameters:

  • state

    (State) –

    state to run AWQ on

Returns:

  • bool

    True on a successful run, False otherwise

Source code in llmcompressor/modifiers/awq/base.py
def on_initialize(self, state: State, **kwargs) -> bool:
    """
    Initialize AWQ on the given state
    Initialize quantization, resolve mappings, cache module kwargs

    :param state: state to run AWQ on
    :return: True on a successful run, False otherwise
    """

    # apply config to model and prepare calibration hooks
    if QuantizationMixin.has_config(self):
        QuantizationMixin.initialize_quantization(self, state.model)

    if self.mappings is None:
        logger.info("No AWQModifier.mappings provided, inferring from model...")
        self.mappings = get_layer_mappings_from_architecture(
            architecture=state.model.__class__.__name__
        )

    self._set_resolved_mappings(state.model)

    return True

validate_model_after

validate_model_after(model: AWQModifier) -> AWQModifier

Confirm only one configuration for group_size, symmetric, and num_bits, as AWQ algorithm depends on it Confirm no activation quantization, as AWQ only works with WNA16

Source code in llmcompressor/modifiers/awq/base.py
@model_validator(mode="after")
def validate_model_after(model: "AWQModifier") -> "AWQModifier":
    """
    Confirm only one configuration for group_size, symmetric, and num_bits,
    as AWQ algorithm depends on it
    Confirm no activation quantization, as AWQ only works with WNA16
    """
    config = model.resolve_quantization_config()

    num_bits_set = set(
        group.weights.num_bits
        for group in config.config_groups.values()
        if group.weights is not None
    )
    assert (
        len(num_bits_set) == 1
    ), "In AWQ, all config groups must use the same configuration for num_bits"

    model._num_bits = next(iter(num_bits_set))

    symmetric_set = set(
        group.weights.symmetric
        for group in config.config_groups.values()
        if group.weights is not None
    )
    assert (
        len(symmetric_set) == 1
    ), "In AWQ, all config groups must use the same configuration for symmetric"

    model._symmetric = next(iter(symmetric_set))

    group_size_set = set(
        group.weights.group_size
        for group in config.config_groups.values()
        if group.weights is not None
    )
    assert (
        len(group_size_set) == 1
    ), "In AWQ, all config groups must use the same configuration for group_size"

    model._group_size = next(iter(group_size_set))

    in_num_bits_set = set(
        group.input_activations.num_bits
        for group in config.config_groups.values()
        if group.input_activations is not None
    )
    assert len(in_num_bits_set) == 0 or in_num_bits_set == {16}, (
        "AWQ activations must be 16-bit precision, "
        f"input activations {in_num_bits_set} not allowed"
    )

    out_num_bits_set = set(
        group.output_activations.num_bits
        for group in config.config_groups.values()
        if group.output_activations is not None
    )
    assert len(out_num_bits_set) == 0 or out_num_bits_set == {16}, (
        "AWQ activations must be 16-bit precision, "
        f"output activations {out_num_bits_set} not allowed"
    )

    return model

get_layer_mappings_from_architecture

get_layer_mappings_from_architecture(
    architecture: str,
) -> List[AWQMapping]

Parameters:

  • architecture

    (str) –

    str: The architecture of the model

Returns:

  • List[AWQMapping]

    list: The layer mappings for the given architecture

Source code in llmcompressor/modifiers/awq/mappings.py
def get_layer_mappings_from_architecture(architecture: str) -> List[AWQMapping]:
    """
    :param architecture: str: The architecture of the model
    :return: list: The layer mappings for the given architecture
    """

    if architecture not in AWQ_MAPPING_REGISTRY:
        logger.info(
            f"Architecture {architecture} not found in mappings. "
            f"Using default mappings: {_default_mappings}"
        )

    return AWQ_MAPPING_REGISTRY.get(architecture, _default_mappings)