llmcompressor.modifiers.quantization

Modules:

cache –

Quantized key-value cache implementation for efficient inference.
calibration –
gptq –
quantization –

Classes:

GPTQModifier –

Implements the GPTQ algorithm from https://arxiv.org/abs/2210.17323. This modifier
Observer –

Base Observer class to be subclassed for specific implementation.
QuantizationMixin –

Mixin which enables a Modifier to act as a quantization config, attching observers,
QuantizationModifier –

Enables post training quantization (PTQ) and quantization aware training (QAT) for a
QuantizedKVParameterCache –

Quantized KV cache used in the forward call based on HF's dynamic cache.

GPTQModifier

Bases: Modifier, QuantizationMixin

Implements the GPTQ algorithm from https://arxiv.org/abs/2210.17323. This modifier uses activations to calibrate a hessian matrix, which is then used to determine optimal quantizion values and orderings for the model weights.

Lifecycle: - on_initialize - apply config to model - on_start - add activation calibration hooks - add gptq weight calibration hooks - on_sequential_epoch_end - quantize_weight - on_finalize - remove_hooks() - model.apply(freeze_module_quantization)

Parameters:

sequential_targets
–

list of layer names to compress during GPTQ, or 'ALL' to compress every layer in the model
block_size
–

Used to determine number of columns to compress in one pass
dampening_frac
–

Amount of dampening to apply to H, as a fraction of the diagonal norm
actorder
–

order in which weight columns are quantized. Defaults to "static" activation ordering, which achieves best accuracy recovery with no runtime cost. For more information, see https://github.com/vllm-project/vllm/pull/8135
offload_hessians
–

Set to True for decreased memory usage but increased runtime.
config_groups
–

dictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized.
targets
–

list of layer names to quantize if a scheme is provided. Defaults to Linear layers
ignore
–

optional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list.
scheme
–

a single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level. Can also be set to a dictionary of the format preset_scheme_name: targets for example: W8A8: ['Linear'] for weight and activation 8-bit.
kv_cache_scheme
–

optional QuantizationArgs, that specify the quantization of the kv cache. If None, kv cache is not quantized. When applying kv cache quantization to transformer AutoModelForCausalLM, the kv_cache_scheme gets converted into a QuantizationScheme that: - targets the q_proj and k_proj modules of the model. The outputs of those modules are the keys and values that might be cached - quantizes the outputs of the aformentioned layers, so that keys and values are compressed before storing them in the cache There is an explicit assumption that the model contains modules with k_proj and v_proj in their names. If this is not the case and kv_cache_scheme != None, the quantization of kv cache will fail

Methods:

calibrate_module –

Calibration hook used to accumulate the hessian of the input to the module
compress_modules –

Quantize modules which have been calibrated
on_end –

Finish calibrating by removing observers and calibration hooks
on_finalize –

disable the quantization observers used by the OBCQ algorithm
on_initialize –

Initialize and run the GPTQ algorithm on the current state

calibrate_module

calibrate_module(
    module: Module,
    args: Tuple[Tensor, ...],
    _output: Tensor,
)

Calibration hook used to accumulate the hessian of the input to the module

Parameters:

module
(Module) –

module being calibrated
args
(Tuple[Tensor, ...]) –

inputs to the module, the first element of which is the cannonical input
_output
(Tensor) –

uncompressed module output, unused

Source code in llmcompressor/modifiers/quantization/gptq/base.py

def calibrate_module(
    self,
    module: torch.nn.Module,
    args: Tuple[torch.Tensor, ...],
    _output: torch.Tensor,
):
    """
    Calibration hook used to accumulate the hessian of the input to the module

    :param module: module being calibrated
    :param args: inputs to the module, the first element of which is the
        cannonical input
    :param _output: uncompressed module output, unused
    """
    # Assume that first argument is the input
    inp = args[0]

    # Initialize hessian if not present
    if module not in self._num_samples:
        init_device = (
            "cpu" if self.offload_hessians else get_execution_device(module)
        )
        self._hessians[module] = make_empty_hessian(module, device=init_device)
        self._num_samples[module] = 0

    # Accumulate hessian with input with optional offloading
    with self._maybe_onload_hessian(module):
        self._hessians[module], self._num_samples[module] = accumulate_hessian(
            inp,
            module,
            self._hessians[module],
            self._num_samples[module],
        )

compress_modules

compress_modules()

Quantize modules which have been calibrated

Source code in llmcompressor/modifiers/quantization/gptq/base.py

def compress_modules(self):
    """
    Quantize modules which have been calibrated
    """
    for module in list(self._num_samples.keys()):
        name = self._module_names[module]
        num_samples = self._num_samples[module]
        quant_args = getattr_chain(module, "quantization_scheme.weights")

        logger.info(f"Quantizing {name} using {num_samples} samples")
        with torch.no_grad(), align_module_device(
            module
        ), self._maybe_onload_hessian(module), CompressionLogger(
            module
        ) as comp_logger:
            loss, quantized_weight, scale, zero_point, g_idx = quantize_weight(
                module=module,
                quant_args=quant_args,
                hessians_dict=self._hessians,
                blocksize=self.block_size,
                percdamp=self.dampening_frac,
            )
            comp_logger.set_loss(loss)

        update_offload_parameter(module, "weight", quantized_weight)
        update_offload_parameter(module, "weight_scale", scale)
        update_offload_parameter(module, "weight_zero_point", zero_point)
        if g_idx is not None:
            update_offload_parameter(module, "weight_g_idx", g_idx)

        # self._hessians[module] already deleted by quantize_weight
        del self._num_samples[module]

on_end

on_end(state: State, event: Event, **kwargs)

Finish calibrating by removing observers and calibration hooks

Source code in llmcompressor/modifiers/quantization/gptq/base.py

def on_end(self, state: State, event: Event, **kwargs):
    """
    Finish calibrating by removing observers and calibration hooks
    """
    self.ended_ = True
    QuantizationMixin.end_calibration(self, state.model)
    self.remove_hooks()  # remove gptq hooks

on_finalize

on_finalize(state: State, **kwargs) -> bool

disable the quantization observers used by the OBCQ algorithm

Parameters:

state
(State) –

session state storing input model and calibration data

Source code in llmcompressor/modifiers/quantization/gptq/base.py

def on_finalize(self, state: State, **kwargs) -> bool:
    """
    disable the quantization observers used by the OBCQ algorithm

    :param state: session state storing input model and calibration data
    """
    if not self.ended_:
        self.on_end(state, None)

    if len(self._num_samples) > 0:
        raise ValueError(f"Failed to compress {len(self._num_samples)} modules")

    self._hessians = dict()
    self._num_samples = dict()

    return True

on_initialize

on_initialize(state: State, **kwargs) -> bool

Initialize and run the GPTQ algorithm on the current state

Parameters:

state
(State) –

session state storing input model and calibration data

Source code in llmcompressor/modifiers/quantization/gptq/base.py

def on_initialize(self, state: State, **kwargs) -> bool:
    """
    Initialize and run the GPTQ algorithm on the current state

    :param state: session state storing input model and calibration data
    """
    # apply config to model and prepare calibration hooks
    if QuantizationMixin.has_config(self):
        QuantizationMixin.initialize_quantization(self, state.model)

    # prepare module names
    self._module_names = {m: name for name, m in state.model.named_modules()}

    return True

Observer

Observer(quantization_args: QuantizationArgs)

Bases: InternalModule, RegistryMixin

Base Observer class to be subclassed for specific implementation. Subclasses should override calculate_qparams to return a scale, zero_point pair

Methods:

calculate_gparam –

:param observed: observed tensor to calculate quantization parameters for
calculate_qparams –

:param observed: observed tensor to calculate quantization parameters for
forward –

maps directly to get_qparams
get_gparam –

Function to derive a global scale parameter
get_qparams –

Convenience function to wrap overwritten calculate_qparams
post_calculate_qparams –

Run any logic specific to its observers after running calculate_qparams
record_observed_tokens –

Counts the number of tokens observed during the
reset –

Reset the state of the observer

Source code in llmcompressor/observers/base.py

def __init__(
    self,
    quantization_args: QuantizationArgs,
):
    self.quantization_args: QuantizationArgs = quantization_args
    super().__init__()
    self._scale = None
    self._zero_point = None
    self._num_observed_tokens = None

calculate_gparam

calculate_gparam(observed: Tensor) -> torch.Tensor

Parameters:

observed
(Tensor) –

observed tensor to calculate quantization parameters for

Returns:

Tensor –

global scale derived from the observed tensor

Source code in llmcompressor/observers/base.py

def calculate_gparam(
    self,
    observed: Tensor,
) -> torch.Tensor:
    """
    :param observed: observed tensor to calculate quantization parameters for
    :return: global scale derived from the observed tensor
    """
    raise NotImplementedError(f"{self.__class__} must implement calculate_gparam")

calculate_qparams

calculate_qparams(
    observed: Tensor,
    reduce_dims: Optional[Tuple[int]] = None,
    tensor_id: Optional[Any] = None,
    global_scale: Optional[Tensor] = None,
) -> Tuple[FloatTensor, IntTensor]

Parameters:

observed
(Tensor) –

observed tensor to calculate quantization parameters for
reduce_dims
(Optional[Tuple[int]], default: None ) –

optional tuple of dimensions to reduce along, returned scale and zero point will be shaped (1,) along the reduced dimensions
tensor_id
(Optional[Any], default: None ) –

optional id for tracking separate statistics when different ranges of observed tensors are passed, useful for sharding tensors by group_size or block quantization
global_scale
(Optional[Tensor], default: None ) –

optional scale to further scale local quantization scales

Returns:

Tuple[FloatTensor, IntTensor] –

tuple of scale and zero point derived from the observed tensor

Source code in llmcompressor/observers/base.py

def calculate_qparams(
    self,
    observed: Tensor,
    reduce_dims: Optional[Tuple[int]] = None,
    tensor_id: Optional[Any] = None,
    global_scale: Optional[Tensor] = None,
) -> Tuple[FloatTensor, IntTensor]:
    """
    :param observed: observed tensor to calculate quantization parameters for
    :param reduce_dims: optional tuple of dimensions to reduce along,
        returned scale and zero point will be shaped (1,) along the
        reduced dimensions
    :param tensor_id: optional id for tracking separate statistics when different
        ranges of observed tensors are passed, useful for sharding tensors by
        group_size or block quantization
    :param global_scale: optional scale to further scale local quantization scales
    :return: tuple of scale and zero point derived from the observed tensor
    """
    raise NotImplementedError(f"{self.__class__} must implement calculate_qparams")

forward

forward(
    observed: Tensor,
    g_idx: Optional[Tensor] = None,
    global_scale: Optional[Tensor] = None,
    should_calculate_gparam: bool = False,
) -> Tuple[FloatTensor, IntTensor]

maps directly to get_qparams

Parameters:

observed
(Tensor) –

optional observed tensor from which to calculate quantization parameters
g_idx
(Optional[Tensor], default: None ) –

optional mapping from column index to group index
global_scale
(Optional[Tensor], default: None ) –

optional scale to further scale local quantization scales

Returns:

Tuple[FloatTensor, IntTensor] –

tuple of scale and zero point based on last observed value

Source code in llmcompressor/observers/base.py

@torch.no_grad()
def forward(
    self,
    observed: Tensor,
    g_idx: Optional[Tensor] = None,
    global_scale: Optional[Tensor] = None,
    should_calculate_gparam: bool = False,
) -> Tuple[FloatTensor, IntTensor]:
    """
    maps directly to get_qparams
    :param observed: optional observed tensor from which to calculate
        quantization parameters
    :param g_idx: optional mapping from column index to group index
    :param global_scale: optional scale to further scale local quantization scales
    :return: tuple of scale and zero point based on last observed value
    """
    self.record_observed_tokens(observed)
    if should_calculate_gparam:
        return self.get_gparam(observed=observed)
    return self.get_qparams(
        observed=observed,
        g_idx=g_idx,
        global_scale=global_scale,
    )

get_gparam

get_gparam(observed: Tensor)

Function to derive a global scale parameter

Parameters:

observed
(Tensor) –

observed tensor to calculate global parameters from

Returns:

–

derived global scale

Source code in llmcompressor/observers/base.py

def get_gparam(self, observed: Tensor):
    """
    Function to derive a global scale parameter
    :param observed: observed tensor to calculate global parameters
        from
    :return: derived global scale
    """
    if self.quantization_args.strategy == QuantizationStrategy.TENSOR_GROUP:
        return self.calculate_gparam(observed)
    raise NotImplementedError(
        "global parameter generation is only supported for TENSOR_GROUP"
    )

get_qparams

get_qparams(
    observed: Optional[Tensor] = None,
    g_idx: Optional[Tensor] = None,
    global_scale: Optional[Tensor] = None,
) -> Tuple[FloatTensor, IntTensor]

Convenience function to wrap overwritten calculate_qparams adds support to make observed tensor optional and support for tracking latest calculated scale and zero point

Parameters:

observed
(Optional[Tensor], default: None ) –

optional observed tensor to calculate quantization parameters from
g_idx
(Optional[Tensor], default: None ) –

optional mapping from column index to group index
global_scale
(Optional[Tensor], default: None ) –

optional scale to further scale local quantization scales

Returns:

Tuple[FloatTensor, IntTensor] –

tuple of scale and zero point based on last observed value

Source code in llmcompressor/observers/base.py

def get_qparams(
    self,
    observed: Optional[Tensor] = None,
    g_idx: Optional[Tensor] = None,
    global_scale: Optional[Tensor] = None,
) -> Tuple[FloatTensor, IntTensor]:
    """
    Convenience function to wrap overwritten calculate_qparams
    adds support to make observed tensor optional and support for tracking latest
    calculated scale and zero point

    :param observed: optional observed tensor to calculate quantization parameters
        from
    :param g_idx: optional mapping from column index to group index
    :param global_scale: optional scale to further scale local quantization scales
    :return: tuple of scale and zero point based on last observed value
    """
    if observed is not None:
        group_size = self.quantization_args.group_size

        if self.quantization_args.strategy == QuantizationStrategy.TENSOR:
            # re-calculate scale and zero point, update the stored value
            self._scale, self._zero_point = self.calculate_qparams(observed)

        elif self.quantization_args.strategy in (
            QuantizationStrategy.TENSOR_GROUP,
            QuantizationStrategy.GROUP,
        ):
            rows = observed.shape[0]
            columns = observed.shape[1]
            num_groups = int(ceil(columns / group_size))
            if num_groups * group_size != columns:
                logger.bind(log_once=True).warning(
                    "Attempting to quantize a module weight whose columns "
                    f"({columns}) are not divisible by group_size ({group_size}). "
                    "This scheme is not supported by vLLM, please consider "
                    "adjusting the group_size for modules with this number of "
                    "columns",
                )

            self._scale = torch.empty(
                (rows, num_groups), dtype=observed.dtype, device=observed.device
            )
            if is_fp4(quantization_args=self.quantization_args):
                zp_dtype = FP8_E4M3_DATA.dtype
            else:
                zp_dtype = self.quantization_args.pytorch_dtype()

            self._zero_point = torch.empty(
                (rows, num_groups), dtype=zp_dtype, device=observed.device
            )

            # support column-order (default) quantization as well as other orderings
            # such as activation ordering. Below checks if g_idx has initialized
            is_column_order = g_idx is None or -1 in g_idx
            if is_column_order:
                group_sizes = torch.full((num_groups,), group_size, dtype=torch.int)
            else:
                group_indices, group_sizes = torch.unique(g_idx, return_counts=True)
                group_sizes = group_sizes[torch.argsort(group_indices)]

                perm = torch.argsort(g_idx)
                observed = safe_permute(observed, perm, dim=1)

            # TODO: experiment with vectorizing for loop for performance
            end = 0
            for group_index, group_count in enumerate(group_sizes):
                start = end
                end = start + group_count
                scale, zero_point = self.get_qparams_along_dim(
                    observed[:, start:end],
                    0,
                    tensor_id=group_index,
                    global_scale=global_scale,
                )

                self._scale[:, group_index] = scale.squeeze(1)
                self._zero_point[:, group_index] = zero_point.squeeze(1)

        elif self.quantization_args.strategy == QuantizationStrategy.CHANNEL:
            # assume observed is transposed, because its the output, hence use dim 0
            self._scale, self._zero_point = self.get_qparams_along_dim(observed, 0)

        elif self.quantization_args.strategy == QuantizationStrategy.TOKEN:
            # use dim 1, assume the obsersed.shape = [batch, token, hidden]
            # should be batch, token
            self._scale, self._zero_point = self.get_qparams_along_dim(
                observed,
                dim={0, 1},
            )

        elif self.quantization_args.strategy == QuantizationStrategy.BLOCK:
            # Block-wise quantization: one scale/zero_point per block of shape
            # [block_rows, block_cols]
            rows, cols = observed.shape[:2]
            bs = self.quantization_args.block_structure
            if not (
                isinstance(bs, (list, tuple))
                and len(bs) == 2
                and all(isinstance(x, int) for x in bs)
            ):
                raise ValueError(
                    f"Invalid block_structure '{bs}'. "
                    f"Must be a list of two ints [rows, cols]."
                )
            block_rows, block_cols = bs
            num_br = int(ceil(rows / block_rows))
            num_bc = int(ceil(cols / block_cols))

            # allocate per-block scale and zero_point
            self._scale = torch.empty(
                (num_br, num_bc), dtype=observed.dtype, device=observed.device
            )

            # Use same dtype logic as GROUP strategy for zero_point
            if is_fp4(quantization_args=self.quantization_args):
                zp_dtype = FP8_E4M3_DATA.dtype
            else:
                zp_dtype = self.quantization_args.pytorch_dtype()

            self._zero_point = torch.empty(
                (num_br, num_bc), dtype=zp_dtype, device=observed.device
            )

            # compute qparams for each block
            for i in range(num_br):
                r0 = i * block_rows
                r1 = min((i + 1) * block_rows, rows)
                for j in range(num_bc):
                    c0 = j * block_cols
                    c1 = min((j + 1) * block_cols, cols)
                    # reduce across both dims to get one scale and zp per block
                    # Use unique tensor_id for each block to maintain separate stats
                    block_tensor_id = f"block_{i}_{j}"
                    scale_bp, zp_bp = self.calculate_qparams(
                        observed[r0:r1, c0:c1],
                        reduce_dims=(0, 1),
                        tensor_id=block_tensor_id,
                    )
                    self._scale[i, j] = scale_bp
                    self._zero_point[i, j] = zp_bp

    return self._scale, self._zero_point

post_calculate_qparams

post_calculate_qparams() -> None

Run any logic specific to its observers after running calculate_qparams

Source code in llmcompressor/observers/base.py

def post_calculate_qparams(self) -> None:
    """
    Run any logic specific to its observers after running calculate_qparams
    """

record_observed_tokens

record_observed_tokens(batch_tensor: Tensor)

Counts the number of tokens observed during the forward passes. The count is aggregated in the _num_observed_tokens attribute of the class.

Note: The batch_tensor is expected to have two dimensions (batch_size * sequence_length, num_features). This is the general shape expected by the forward pass of the expert layers in a MOE model. If the input tensor does not have two dimensions, the _num_observed_tokens attribute will be set to None.

Source code in llmcompressor/observers/base.py

def record_observed_tokens(self, batch_tensor: Tensor):
    """
    Counts the number of tokens observed during the
    forward passes. The count is aggregated in the
    _num_observed_tokens attribute of the class.

    Note: The batch_tensor is expected to have two dimensions
        (batch_size * sequence_length, num_features). This is the
        general shape expected by the forward pass of the expert
        layers in a MOE model. If the input tensor does not have
        two dimensions, the _num_observed_tokens attribute will be set
        to None.
    """
    if not isinstance(batch_tensor, Tensor):
        raise ValueError(f"Expected value to be a tensor, got {type(batch_tensor)}")

    if batch_tensor.ndim != 2:
        logger.debug(
            "The input tensor is expected to have two dimensions "
            "(batch_size * sequence_length, num_features). "
            f"The input tensor has {batch_tensor.ndim} dimensions."
        )
        return

    if self._num_observed_tokens is None:
        # initialize the count
        self._num_observed_tokens = 0

    # batch_tensor (batch_size * sequence_length, num_features)
    # observed_tokens (batch_size * sequence_length)
    observed_tokens, _ = batch_tensor.shape
    self._num_observed_tokens += observed_tokens

reset

reset()

Reset the state of the observer

Source code in llmcompressor/observers/base.py

def reset(self):
    """
    Reset the state of the observer
    """
    self._num_observed_tokens = None
    self._scale = None
    self._zero_point = None

QuantizationMixin

Bases: HooksMixin

Mixin which enables a Modifier to act as a quantization config, attching observers, calibration hooks, and compression wrappers to modifiers

Lifecycle: - on_initialize: QuantizationMixin.initialize_quantization - Attach schemes to modules - Attach observers to modules - Disable quantization until calibration starts/finishes - on_start: QuantizationMixin.start_calibration - Attach calibration hooks - Apply calibration status - Enable quantization during calibration - on_end: QuantizationMixin.end_calibration - Remove calibration hooks - Apply freeze status - Keep quantization enabled for future steps NOTE: QuantizationMixin does not update scales and zero-points on its own, as this is not desired for all Modifiers inheriting from it. Modifier must explicitly call update_weight_zp_scale. See QuantizationModifier.on_start method for example

Parameters:

config_groups
–

dictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized.
targets
–

list of layer names to quantize if a scheme is provided. Defaults to Linear layers
ignore
–

optional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list.
scheme
–

a single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level. Can also be set to a dictionary of the format preset_scheme_name: targets for example: W8A8: ['Linear'] for weight and activation 8-bit.
kv_cache_scheme
–

optional QuantizationArgs, that specify the quantization of the kv cache. If None, kv cache is not quantized. When applying kv cache quantization to transformer AutoModelForCausalLM, the kv_cache_scheme gets converted into a QuantizationScheme that: - targets the q_proj and k_proj modules of the model. The outputs of those modules are the keys and values that might be cached - quantizes the outputs of the aformentioned layers, so that keys and values are compressed before storing them in the cache There is an explicit assumption that the model contains modules with k_proj and v_proj in their names. If this is not the case and kv_cache_scheme != None, the quantization of kv cache will fail

Methods:

end_calibration –

Remove calibration hooks and set the model status to frozen. Keep quantization
has_config –

Determine if the user has specified a quantization config on this modifier
initialize_quantization –

Attach quantization schemes and observers to modules in the model according to
resolve_quantization_config –

Returns the quantization config specified by this modifier
start_calibration –

Register activation calibration hooks (including kv_cache quantization) and

end_calibration

end_calibration(model: Module)

Remove calibration hooks and set the model status to frozen. Keep quantization enabled for future operations

Parameters:

model
(Module) –

model to end calibration for

Source code in llmcompressor/modifiers/quantization/quantization/mixin.py

def end_calibration(self, model: torch.nn.Module):
    """
    Remove calibration hooks and set the model status to frozen. Keep quantization
    enabled for future operations

    :param model: model to end calibration for
    """
    self.remove_hooks(self._calibration_hooks)
    model.apply(freeze_module_quantization)  # remove observers
    model.apply(enable_quantization)  # keep quantization enabled

has_config

has_config() -> bool

Determine if the user has specified a quantization config on this modifier

Source code in llmcompressor/modifiers/quantization/quantization/mixin.py

def has_config(self) -> bool:
    """
    Determine if the user has specified a quantization config on this modifier
    """
    return not (
        self.config_groups is None
        and self.targets == ["Linear"]
        and self.ignore == []
        and self.scheme is None
        and self.kv_cache_scheme is None
    )

initialize_quantization

initialize_quantization(model: Module)

Attach quantization schemes and observers to modules in the model according to the quantization config specified on this modifier

Parameters:

model
(Module) –

model to attach schemes and observers to

Source code in llmcompressor/modifiers/quantization/quantization/mixin.py

def initialize_quantization(self, model: torch.nn.Module):
    """
    Attach quantization schemes and observers to modules in the model according to
    the quantization config specified on this modifier

    :param model: model to attach schemes and observers to
    """
    reset_quantization_status(model)  # reset any previously applied qconfigs

    # apply scheme and status to model
    config = self.resolve_quantization_config()
    apply_quantization_config(model, config)

    # apply observers, disable quantization until calibration
    model.apply(self._initialize_observers)
    model.apply(disable_quantization)

resolve_quantization_config

resolve_quantization_config() -> QuantizationConfig

Returns the quantization config specified by this modifier

Source code in llmcompressor/modifiers/quantization/quantization/mixin.py

def resolve_quantization_config(self) -> QuantizationConfig:
    """
    Returns the quantization config specified by this modifier
    """
    scheme = self.scheme
    targets = self.targets
    config_groups = self.config_groups
    kv_cache_scheme = self.kv_cache_scheme
    ignore = self.ignore

    if scheme is not None and config_groups is not None:
        raise ValueError("Please specify either `scheme` or `config_groups`")

    if scheme is not None:
        # takes precedence over config_groups

        if isinstance(scheme, str) and is_preset_scheme(scheme):
            # attach targets to scheme
            scheme = {scheme: targets}

        config_groups = {}
        for idx, key in enumerate(scheme.keys()):
            if is_preset_scheme(key):
                scheme = preset_name_to_scheme(key, scheme[key])
            else:
                scheme = QuantizationScheme.model_validate(
                    {"targets": scheme[key], **scheme}
                )

            group_name = f"group_{idx}"
            config_groups[group_name] = scheme

    if config_groups is None or len(config_groups) == 0:
        default_quant_scheme = QuantizationScheme(targets=targets)
        config_groups = {"group_0": default_quant_scheme}

    return QuantizationConfig(
        config_groups=config_groups,
        kv_cache_scheme=kv_cache_scheme,
        quantization_status=QuantizationStatus.INITIALIZED,
        ignore=ignore,
    )

start_calibration

start_calibration(model: Module)

Register activation calibration hooks (including kv_cache quantization) and enable quantization as we calibrate

Parameters:

model
(Module) –

model to prepare for calibration

Source code in llmcompressor/modifiers/quantization/quantization/mixin.py

def start_calibration(self, model: torch.nn.Module):
    """
    Register activation calibration hooks (including kv_cache quantization) and
    enable quantization as we calibrate

    :param model: model to prepare for calibration
    """
    self._calibration_hooks = self._initialize_hooks(model)
    model.apply(apply_calibration_status)
    model.apply(enable_quantization)  # quantize at the same time as calibrate

QuantizationModifier

Bases: Modifier, QuantizationMixin

Enables post training quantization (PTQ) and quantization aware training (QAT) for a given module or its submodules. After calibration (PTQ) or the start epoch (QAT), the specified module(s) forward pass will emulate quantized execution and the modifier will be enabled until training is completed.

Parameters:

config_groups
–

dictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized.
targets
–

list of layer names to quantize if a scheme is provided. Defaults to Linear layers
ignore
–

optional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list.
scheme
–

a single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level. Can also be set to a dictionary of the format preset_scheme_name: targets for example: W8A8: ['Linear'] for weight and activation 8-bit.
kv_cache_scheme
–

optional QuantizationArgs, that specify the quantization of the kv cache. If None, kv cache is not quantized. When applying kv cache quantization to transformer AutoModelForCausalLM, the kv_cache_scheme gets converted into a QuantizationScheme that: - targets the q_proj and k_proj modules of the model. The outputs of those modules are the keys and values that might be cached - quantizes the outputs of the aformentioned layers, so that keys and values are compressed before storing them in the cache There is an explicit assumption that the model contains modules with k_proj and v_proj in their names. If this is not the case and kv_cache_scheme != None, the quantization of kv cache will fail

Methods:

on_end –

Finish calibrating by removing observers and calibration hooks
on_initialize –

Prepare to calibrate activations and weights
on_start –

Begin calibrating activations and weights. Calibrate weights only once on start

on_end

on_end(state: State, event: Event, **kwargs)

Finish calibrating by removing observers and calibration hooks

Source code in llmcompressor/modifiers/quantization/quantization/base.py

def on_end(self, state: State, event: Event, **kwargs):
    """
    Finish calibrating by removing observers and calibration hooks
    """
    self.ended_ = True
    QuantizationMixin.end_calibration(
        self, state.model
    )  # keep quantization enabled

on_initialize

on_initialize(state: State, **kwargs) -> bool

Prepare to calibrate activations and weights

According to the quantization config, a quantization scheme is attached to each targeted module. The module's forward call is also overwritten to perform quantization to inputs, weights, and outputs.

Then, according to the module's quantization scheme, observers and calibration hooks are added. These hooks are disabled until the modifier starts.

Source code in llmcompressor/modifiers/quantization/quantization/base.py

def on_initialize(self, state: State, **kwargs) -> bool:
    """
    Prepare to calibrate activations and weights

    According to the quantization config, a quantization scheme is attached to each
    targeted module. The module's forward call is also overwritten to perform
    quantization to inputs, weights, and outputs.

    Then, according to the module's quantization scheme, observers and calibration
    hooks are added. These hooks are disabled until the modifier starts.
    """
    if not QuantizationMixin.has_config(self):
        raise ValueError(
            "QuantizationModifier requires that quantization fields be specified"
        )
    QuantizationMixin.initialize_quantization(self, state.model)

    return True

on_start

on_start(state: State, event: Event, **kwargs)

Begin calibrating activations and weights. Calibrate weights only once on start

Source code in llmcompressor/modifiers/quantization/quantization/base.py

def on_start(self, state: State, event: Event, **kwargs):
    """
    Begin calibrating activations and weights. Calibrate weights only once on start
    """
    self.started_ = True
    QuantizationMixin.start_calibration(self, state.model)

    modules = list(state.model.modules())
    # TODO: this step can be combined with update_weight_zp_scale
    # once update_fused_layer_weight_global_scales is removed
    # and not required by vLLM
    for module in tqdm.tqdm(modules):
        update_weight_global_scale(module)

    for module in tqdm.tqdm(modules, desc="Calibrating weights"):
        update_fused_layer_weight_global_scales(module)
        update_weight_zp_scale(module)

QuantizedKVParameterCache

QuantizedKVParameterCache(
    quantization_args: QuantizationArgs,
)

Bases: DynamicCache

Quantized KV cache used in the forward call based on HF's dynamic cache. Quantization strategy (tensor, group, channel) set from Quantization arg's strategy Singleton, so that the same cache gets reused in all forward call of self_attn. Each time forward is called, .update() is called, and ._quantize(), ._dequantize() gets called appropriately. The size of tensor is [batch_size, num_heads, seq_len - residual_length, head_dim].

Triggered by adding kv_cache_scheme in the recipe.

Example:

```python3 recipe = ''' quant_stage: quant_modifiers: QuantizationModifier: kv_cache_scheme: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true '''

Methods:

get_seq_length –

Returns the sequence length of the cached states.
reset –

Reset the instantiation, create new instance on init
reset_states –

reset the kv states (used in calibration)
update –

Get the k_scale and v_scale and output the

Source code in llmcompressor/modifiers/quantization/cache.py

def __init__(self, quantization_args: QuantizationArgs):
    if not self._initialized:
        super().__init__()

        self.quantization_args = quantization_args

        self.k_observers: List[Observer] = []
        self.v_observers: List[Observer] = []

        # each index corresponds to layer_idx of the attention layer
        self.k_scales: List[Tensor] = []
        self.v_scales: List[Tensor] = []

        self.k_zps: List[Tensor] = []
        self.v_zps: List[Tensor] = []

        self._initialized = True

get_seq_length

get_seq_length(layer_idx: Optional[int] = 0) -> int

Returns the sequence length of the cached states. A layer index can be optionally passed.

Source code in llmcompressor/modifiers/quantization/cache.py

def get_seq_length(self, layer_idx: Optional[int] = 0) -> int:
    """
    Returns the sequence length of the cached states.
    A layer index can be optionally passed.
    """
    if len(self.key_cache) <= layer_idx:
        return 0
    # since we cannot get the seq_length of each layer directly and
    # rely on `_seen_tokens` which is updated every "layer_idx" == 0,
    # this is a hack to get the actual seq_length for the given layer_idx
    # this part of code otherwise fails when used to
    # verify attn_weight shape in some models
    return self._seen_tokens if layer_idx == 0 else self._seen_tokens - 1

reset

reset()

Reset the instantiation, create new instance on init

Source code in llmcompressor/modifiers/quantization/cache.py

def reset(self):
    """
    Reset the instantiation, create new instance on init
    """
    QuantizedKVParameterCache._instance = None
    QuantizedKVParameterCache._initialized = False

reset_states

reset_states()

reset the kv states (used in calibration)

Source code in llmcompressor/modifiers/quantization/cache.py

def reset_states(self):
    """reset the kv states (used in calibration)"""
    self.key_cache: List[Tensor] = []
    self.value_cache: List[Tensor] = []
    # Used in `generate` to keep tally of how many tokens the cache has seen
    self._seen_tokens = 0
    self._quantized_key_cache: List[Tensor] = []
    self._quantized_value_cache: List[Tensor] = []

update

update(
    key_states: Tensor,
    value_states: Tensor,
    layer_idx: int,
    cache_kwargs: Optional[Dict[str, Any]] = None,
) -> Tuple[Tensor, Tensor]

Get the k_scale and v_scale and output the fakequant-ed key_states and value_states

Source code in llmcompressor/modifiers/quantization/cache.py

def update(
    self,
    key_states: Tensor,
    value_states: Tensor,
    layer_idx: int,
    cache_kwargs: Optional[Dict[str, Any]] = None,
) -> Tuple[Tensor, Tensor]:
    """
    Get the k_scale and v_scale and output the
     fakequant-ed key_states and value_states
    """

    if len(self.k_observers) <= layer_idx:
        k_observer_name = self.quantization_args.observer
        k_observer = Observer.load_from_registry(
            k_observer_name, quantization_args=self.quantization_args
        )
        v_observer_name = self.quantization_args.observer
        v_observer = Observer.load_from_registry(
            v_observer_name, quantization_args=self.quantization_args
        )

        # NOTE: User may ignore some layers in configuration,
        # meaning len(self.k_observers) <= layer_idx-1
        # Must account for that case by padding list so that
        # index of lists corresponds to layer_idx
        _pad_and_append_at_idx_(self.k_observers, layer_idx, k_observer)
        _pad_and_append_at_idx_(self.v_observers, layer_idx, v_observer)

    q_key_states = self._quantize(
        key_states.contiguous(), KVCacheScaleType.KEY, layer_idx
    )
    q_value_states = self._quantize(
        value_states.contiguous(), KVCacheScaleType.VALUE, layer_idx
    )

    qdq_key_states = self._dequantize(q_key_states, KVCacheScaleType.KEY, layer_idx)
    qdq_value_states = self._dequantize(
        q_value_states, KVCacheScaleType.VALUE, layer_idx
    )

    keys_to_return, values_to_return = qdq_key_states, qdq_value_states

    return keys_to_return, values_to_return

llmcompressor.modifiers.quantization

GPTQModifier

sequential_targets

block_size

dampening_frac

actorder

offload_hessians

config_groups

targets

ignore

scheme

kv_cache_scheme

calibrate_module

module

args

_output

compress_modules

on_end

on_finalize

state

on_initialize

state

Observer

calculate_gparam

observed

calculate_qparams

observed

reduce_dims

tensor_id

global_scale

forward

observed

g_idx

global_scale

get_gparam

observed

get_qparams

observed

g_idx

global_scale

post_calculate_qparams

record_observed_tokens

reset

QuantizationMixin

config_groups

targets

ignore

scheme

kv_cache_scheme

end_calibration

model

has_config

initialize_quantization

model

resolve_quantization_config

start_calibration

model

QuantizationModifier

config_groups

targets

ignore

scheme

kv_cache_scheme

on_end

on_initialize

on_start

QuantizedKVParameterCache

get_seq_length

reset

reset_states

update

`sequential_targets`

`block_size`

`dampening_frac`

`actorder`

`offload_hessians`

`config_groups`

`targets`

`ignore`

`scheme`

`kv_cache_scheme`

`module`

`args`

`_output`

`state`

`state`

`observed`

`observed`

`reduce_dims`

`tensor_id`

`global_scale`

`observed`

`g_idx`

`global_scale`

`observed`

`observed`

`g_idx`

`global_scale`

`config_groups`

`targets`

`ignore`

`scheme`

`kv_cache_scheme`

`model`

`model`

`model`

`config_groups`

`targets`

`ignore`

`scheme`

`kv_cache_scheme`