llmcompressor.modifiers.quantization
Modules:
-
cache
–Quantized key-value cache implementation for efficient inference.
-
calibration
– -
gptq
– -
quantization
–
Classes:
-
GPTQModifier
–Implements the GPTQ algorithm from https://arxiv.org/abs/2210.17323. This modifier
-
Observer
–Base Observer class to be subclassed for specific implementation.
-
QuantizationMixin
–Mixin which enables a Modifier to act as a quantization config, attching observers,
-
QuantizationModifier
–Enables post training quantization (PTQ) and quantization aware training (QAT) for a
-
QuantizedKVParameterCache
–Quantized KV cache used in the forward call based on HF's dynamic cache.
GPTQModifier
Bases: Modifier
, QuantizationMixin
Implements the GPTQ algorithm from https://arxiv.org/abs/2210.17323. This modifier uses activations to calibrate a hessian matrix, which is then used to determine optimal quantizion values and orderings for the model weights.
| Sample yaml: | test_stage: | obcq_modifiers: | GPTQModifier: | block_size: 128 | dampening_frac: 0.001 | offload_hessians: False | actorder: static | config_groups: | group_0: | targets: | - "Linear" | input_activations: null | output_activations: null | weights: | num_bits: 8 | type: "int" | symmetric: true | strategy: group | group_size: 128
Lifecycle: - on_initialize - apply config to model - on_start - add activation calibration hooks - add gptq weight calibration hooks - on_sequential_epoch_end - quantize_weight - on_finalize - remove_hooks() - model.apply(freeze_module_quantization)
Parameters:
-
sequential_targets
list of layer names to compress during GPTQ, or 'ALL' to compress every layer in the model
-
block_size
Used to determine number of columns to compress in one pass
-
dampening_frac
Amount of dampening to apply to H, as a fraction of the diagonal norm
-
actorder
order in which weight columns are quantized. For more information, on actorder options, see https://github.com/vllm-project/vllm/pull/8135
-
offload_hessians
Set to True for decreased memory usage but increased runtime.
-
config_groups
dictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized.
-
targets
list of layer names to quantize if a scheme is provided. Defaults to Linear layers
-
ignore
optional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list.
-
scheme
a single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level. Can also be set to a dictionary of the format
preset_scheme_name: targets
for example:W8A8: ['Linear']
for weight and activation 8-bit. -
kv_cache_scheme
optional QuantizationArgs, that specify the quantization of the kv cache. If None, kv cache is not quantized. When applying kv cache quantization to transformer AutoModelForCausalLM, the kv_cache_scheme gets converted into a QuantizationScheme that: - targets the
q_proj
andk_proj
modules of the model. The outputs of those modules are the keys and values that might be cached - quantizes the outputs of the aformentioned layers, so that keys and values are compressed before storing them in the cache There is an explicit assumption that the model contains modules withk_proj
andv_proj
in their names. If this is not the case and kv_cache_scheme != None, the quantization of kv cache will fail
Methods:
-
calibrate_module
–Calibration hook used to accumulate the hessian of the input to the module
-
compress_modules
–Quantize modules which have been calibrated
-
on_end
–Finish calibrating by removing observers and calibration hooks
-
on_finalize
–disable the quantization observers used by the OBCQ algorithm
-
on_initialize
–Initialize and run the GPTQ algorithm on the current state
calibrate_module
Calibration hook used to accumulate the hessian of the input to the module
Parameters:
-
module
Module
) –module being calibrated
-
args
Tuple[Tensor, ...]
) –inputs to the module, the first element of which is the cannonical input
-
_output
Tensor
) –uncompressed module output, unused
Source code in llmcompressor/modifiers/quantization/gptq/base.py
compress_modules
Quantize modules which have been calibrated
Source code in llmcompressor/modifiers/quantization/gptq/base.py
on_end
Finish calibrating by removing observers and calibration hooks
Source code in llmcompressor/modifiers/quantization/gptq/base.py
on_finalize
disable the quantization observers used by the OBCQ algorithm
Parameters:
-
state
State
) –session state storing input model and calibration data
Source code in llmcompressor/modifiers/quantization/gptq/base.py
on_initialize
Initialize and run the GPTQ algorithm on the current state
Parameters:
-
state
State
) –session state storing input model and calibration data
Source code in llmcompressor/modifiers/quantization/gptq/base.py
Observer
Bases: InternalModule
, RegistryMixin
Base Observer class to be subclassed for specific implementation. Subclasses should override calculate_qparams
to return a scale, zero_point pair
Methods:
-
calculate_gparam
–:param observed: observed tensor to calculate quantization parameters for
-
calculate_qparams
–:param observed: observed tensor to calculate quantization parameters for
-
forward
–maps directly to get_qparams
-
get_gparam
–Function to derive a global scale parameter
-
get_qparams
–Convenience function to wrap overwritten calculate_qparams
-
post_calculate_qparams
–Run any logic specific to its observers after running calculate_qparams
-
record_observed_tokens
–Counts the number of tokens observed during the
-
reset
–Reset the state of the observer
Source code in llmcompressor/observers/base.py
calculate_gparam
Parameters:
-
observed
Tensor
) –observed tensor to calculate quantization parameters for
Returns:
-
Tensor
–global scale derived from the observed tensor
Source code in llmcompressor/observers/base.py
calculate_qparams
calculate_qparams(
observed: Tensor,
reduce_dims: Optional[Tuple[int]] = None,
tensor_id: Optional[Any] = None,
global_scale: Optional[Tensor] = None,
) -> Tuple[FloatTensor, IntTensor]
Parameters:
-
observed
Tensor
) –observed tensor to calculate quantization parameters for
-
reduce_dims
Optional[Tuple[int]]
, default:None
) –optional tuple of dimensions to reduce along, returned scale and zero point will be shaped (1,) along the reduced dimensions
-
tensor_id
Optional[Any]
, default:None
) –optional id for tracking separate statistics when different ranges of observed tensors are passed, useful for sharding tensors by group_size or block quantization
-
global_scale
Optional[Tensor]
, default:None
) –optional scale to further scale local quantization scales
Returns:
-
Tuple[FloatTensor, IntTensor]
–tuple of scale and zero point derived from the observed tensor
Source code in llmcompressor/observers/base.py
forward
forward(
observed: Tensor,
g_idx: Optional[Tensor] = None,
global_scale: Optional[Tensor] = None,
should_calculate_gparam: bool = False,
) -> Tuple[FloatTensor, IntTensor]
maps directly to get_qparams
Parameters:
-
observed
Tensor
) –optional observed tensor from which to calculate quantization parameters
-
g_idx
Optional[Tensor]
, default:None
) –optional mapping from column index to group index
-
global_scale
Optional[Tensor]
, default:None
) –optional scale to further scale local quantization scales
Returns:
-
Tuple[FloatTensor, IntTensor]
–tuple of scale and zero point based on last observed value
Source code in llmcompressor/observers/base.py
get_gparam
Function to derive a global scale parameter
Parameters:
-
observed
Tensor
) –observed tensor to calculate global parameters from
Returns:
- –
derived global scale
Source code in llmcompressor/observers/base.py
get_qparams
get_qparams(
observed: Optional[Tensor] = None,
g_idx: Optional[Tensor] = None,
global_scale: Optional[Tensor] = None,
) -> Tuple[FloatTensor, IntTensor]
Convenience function to wrap overwritten calculate_qparams adds support to make observed tensor optional and support for tracking latest calculated scale and zero point
Parameters:
-
observed
Optional[Tensor]
, default:None
) –optional observed tensor to calculate quantization parameters from
-
g_idx
Optional[Tensor]
, default:None
) –optional mapping from column index to group index
-
global_scale
Optional[Tensor]
, default:None
) –optional scale to further scale local quantization scales
Returns:
-
Tuple[FloatTensor, IntTensor]
–tuple of scale and zero point based on last observed value
Source code in llmcompressor/observers/base.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 |
|
post_calculate_qparams
record_observed_tokens
Counts the number of tokens observed during the forward passes. The count is aggregated in the _num_observed_tokens attribute of the class.
Note: The batch_tensor is expected to have two dimensions (batch_size * sequence_length, num_features). This is the general shape expected by the forward pass of the expert layers in a MOE model. If the input tensor does not have two dimensions, the _num_observed_tokens attribute will be set to None.
Source code in llmcompressor/observers/base.py
QuantizationMixin
Bases: HooksMixin
Mixin which enables a Modifier to act as a quantization config, attching observers, calibration hooks, and compression wrappers to modifiers
Lifecycle: - on_initialize: QuantizationMixin.initialize_quantization - Attach schemes to modules - Attach observers to modules - Disable quantization until calibration starts/finishes - on_start: QuantizationMixin.start_calibration - Attach calibration hooks - Apply calibration status - Enable quantization during calibration - on_end: QuantizationMixin.end_calibration - Remove calibration hooks - Apply freeze status - Keep quantization enabled for future steps NOTE: QuantizationMixin does not update scales and zero-points on its own, as this is not desired for all Modifiers inheriting from it. Modifier must explicitly call update_weight_zp_scale
. See QuantizationModifier.on_start method for example
Parameters:
-
config_groups
dictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized.
-
targets
list of layer names to quantize if a scheme is provided. Defaults to Linear layers
-
ignore
optional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list.
-
scheme
a single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level. Can also be set to a dictionary of the format
preset_scheme_name: targets
for example:W8A8: ['Linear']
for weight and activation 8-bit. -
kv_cache_scheme
optional QuantizationArgs, that specify the quantization of the kv cache. If None, kv cache is not quantized. When applying kv cache quantization to transformer AutoModelForCausalLM, the kv_cache_scheme gets converted into a QuantizationScheme that: - targets the
q_proj
andk_proj
modules of the model. The outputs of those modules are the keys and values that might be cached - quantizes the outputs of the aformentioned layers, so that keys and values are compressed before storing them in the cache There is an explicit assumption that the model contains modules withk_proj
andv_proj
in their names. If this is not the case and kv_cache_scheme != None, the quantization of kv cache will fail
Methods:
-
end_calibration
–Remove calibration hooks and set the model status to frozen. Keep quantization
-
has_config
–Determine if the user has specified a quantization config on this modifier
-
initialize_quantization
–Attach quantization schemes and observers to modules in the model according to
-
resolve_quantization_config
–Returns the quantization config specified by this modifier
-
start_calibration
–Register activation calibration hooks (including kv_cache quantization) and
end_calibration
Remove calibration hooks and set the model status to frozen. Keep quantization enabled for future operations
Parameters:
-
model
Module
) –model to end calibration for
Source code in llmcompressor/modifiers/quantization/quantization/mixin.py
has_config
Determine if the user has specified a quantization config on this modifier
Source code in llmcompressor/modifiers/quantization/quantization/mixin.py
initialize_quantization
Attach quantization schemes and observers to modules in the model according to the quantization config specified on this modifier
Parameters:
-
model
Module
) –model to attach schemes and observers to
Source code in llmcompressor/modifiers/quantization/quantization/mixin.py
resolve_quantization_config
Returns the quantization config specified by this modifier
Source code in llmcompressor/modifiers/quantization/quantization/mixin.py
start_calibration
Register activation calibration hooks (including kv_cache quantization) and enable quantization as we calibrate
Parameters:
-
model
Module
) –model to prepare for calibration
Source code in llmcompressor/modifiers/quantization/quantization/mixin.py
QuantizationModifier
Bases: Modifier
, QuantizationMixin
Enables post training quantization (PTQ) and quantization aware training (QAT) for a given module or its submodules. After calibration (PTQ) or the start epoch (QAT), the specified module(s) forward pass will emulate quantized execution and the modifier will be enabled until training is completed.
Parameters:
-
config_groups
dictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized.
-
targets
list of layer names to quantize if a scheme is provided. Defaults to Linear layers
-
ignore
optional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list.
-
scheme
a single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level. Can also be set to a dictionary of the format
preset_scheme_name: targets
for example:W8A8: ['Linear']
for weight and activation 8-bit. -
kv_cache_scheme
optional QuantizationArgs, that specify the quantization of the kv cache. If None, kv cache is not quantized. When applying kv cache quantization to transformer AutoModelForCausalLM, the kv_cache_scheme gets converted into a QuantizationScheme that: - targets the
q_proj
andk_proj
modules of the model. The outputs of those modules are the keys and values that might be cached - quantizes the outputs of the aformentioned layers, so that keys and values are compressed before storing them in the cache There is an explicit assumption that the model contains modules withk_proj
andv_proj
in their names. If this is not the case and kv_cache_scheme != None, the quantization of kv cache will fail
Methods:
-
on_end
–Finish calibrating by removing observers and calibration hooks
-
on_initialize
–Prepare to calibrate activations and weights
-
on_start
–Begin calibrating activations and weights. Calibrate weights only once on start
on_end
Finish calibrating by removing observers and calibration hooks
Source code in llmcompressor/modifiers/quantization/quantization/base.py
on_initialize
Prepare to calibrate activations and weights
According to the quantization config, a quantization scheme is attached to each targeted module. The module's forward call is also overwritten to perform quantization to inputs, weights, and outputs.
Then, according to the module's quantization scheme, observers and calibration hooks are added. These hooks are disabled until the modifier starts.
Source code in llmcompressor/modifiers/quantization/quantization/base.py
on_start
Begin calibrating activations and weights. Calibrate weights only once on start
Source code in llmcompressor/modifiers/quantization/quantization/base.py
QuantizedKVParameterCache
Bases: DynamicCache
Quantized KV cache used in the forward call based on HF's dynamic cache. Quantization strategy (tensor, group, channel) set from Quantization arg's strategy Singleton, so that the same cache gets reused in all forward call of self_attn. Each time forward is called, .update() is called, and ._quantize(), ._dequantize() gets called appropriately. The size of tensor is [batch_size, num_heads, seq_len - residual_length, head_dim]
.
Triggered by adding kv_cache_scheme in the recipe.
Example:
```python3 recipe = ''' quant_stage: quant_modifiers: QuantizationModifier: kv_cache_scheme: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true '''
Methods:
-
get_seq_length
–Returns the sequence length of the cached states.
-
reset
–Reset the instantiation, create new instance on init
-
reset_states
–reset the kv states (used in calibration)
-
update
–Get the k_scale and v_scale and output the
Source code in llmcompressor/modifiers/quantization/cache.py
get_seq_length
Returns the sequence length of the cached states. A layer index can be optionally passed.
Source code in llmcompressor/modifiers/quantization/cache.py
reset
Reset the instantiation, create new instance on init
reset_states
reset the kv states (used in calibration)
Source code in llmcompressor/modifiers/quantization/cache.py
update
update(
key_states: Tensor,
value_states: Tensor,
layer_idx: int,
cache_kwargs: Optional[Dict[str, Any]] = None,
) -> Tuple[Tensor, Tensor]
Get the k_scale and v_scale and output the fakequant-ed key_states and value_states