llmcompressor.modifiers.quantization.calibration
Functions:
-
calibrate_input_hook
–Hook to calibrate input activations.
-
calibrate_kv_cache_input_hook
–Hook to update inputs to attention layers when running
-
calibrate_kv_cache_output_hook
–Hook to update k_scale and v_scale parameters when running kv_cache quantization.
-
calibrate_output_hook
–Hook to calibrate output activations.
-
freeze_module_quantization
–deletes observers when calibration is complete.
-
initialize_observer
–Initialize observer module and attach as submodule.
-
initialize_quantized_kv_cache
–Initialize a quantized kv_cache on a module (analogous to initializing an observer)
-
update_weight_zp_scale
–marks a layer as ready for calibration which activates observers
calibrate_activations
Calibrate input or output activations by calling the a module's attached observer.
Parameters:
-
module
Module
) –torch.nn.Module
-
base_name
str
) –substring used to fetch the observer, scales, and zp
-
value
Tensor
) –torch.Tensor to be passed to the observer
Source code in llmcompressor/modifiers/quantization/calibration.py
calibrate_input_hook
Hook to calibrate input activations. Will call the observers to update the scales/zp before applying input QDQ in the module's forward pass.
Source code in llmcompressor/modifiers/quantization/calibration.py
calibrate_kv_cache_input_hook
calibrate_kv_cache_input_hook(
module: Module, args: Any, kwargs: Dict[str, Any]
) -> Tuple[Tuple[Any, ...], Dict[str, Any]]
Hook to update inputs to attention layers when running kv_cache quantization. Will update the passed in kv_cache to singleton QuantizedKVParameterCache.
Source code in llmcompressor/modifiers/quantization/calibration.py
calibrate_kv_cache_output_hook
Hook to update k_scale and v_scale parameters when running kv_cache quantization.
Source code in llmcompressor/modifiers/quantization/calibration.py
calibrate_output_hook
Hook to calibrate output activations. Will call the observers to update the scales/zp before applying output QDQ.
Source code in llmcompressor/modifiers/quantization/calibration.py
call_observer
call_observer(
module: Module,
base_name: str,
value: Optional[Tensor] = None,
should_calculate_gparam: bool = False,
should_calculate_qparams: bool = True,
)
Call a module's attached input/weight/output observer using a provided value. Update the module's scale and zp using the observer's return values.
Parameters:
-
module
Module
) –torch.nn.Module
-
base_name
str
) –substring used to fetch the observer, scales, and zp
-
value
Optional[Tensor]
, default:None
) –torch.Tensor to be passed to the observer for activations. If base_name is "weight", then the module's weight tensor will be used
Source code in llmcompressor/modifiers/quantization/calibration.py
freeze_module_quantization
deletes observers when calibration is complete.
apply to full model with model.apply(freeze_module_quantization)
Parameters:
-
module
Module
) –module to freeze quantization for
Source code in llmcompressor/modifiers/quantization/calibration.py
initialize_observer
Initialize observer module and attach as submodule. The name of the observer is fetched from the quantization_args. The name is then used to load the observer from the registry and attached to the module. The name of the observer uses the base_name provided.
Parameters:
-
module
Module
) –torch.nn.Module that the observer is being attached to
-
base_name
str
) –str used to name the observer attribute
Source code in llmcompressor/modifiers/quantization/calibration.py
initialize_quantized_kv_cache
Initialize a quantized kv_cache on a module (analogous to initializing an observer) When a config specifying kv_cache quantization is applied to a model, the kv_cache args are redefined as the output_activations targeting attention modules.
This function should be called on attention modules with output_activations
Source code in llmcompressor/modifiers/quantization/calibration.py
update_weight_zp_scale
marks a layer as ready for calibration which activates observers to update scales and zero points on each forward pass
apply to full model with model.apply(update_weight_zp_scale)
Parameters:
-
module
Module
) –module to set for calibration
-
quantize_weights_upfront
whether to automatically run weight quantization at the start of calibration