Observers Overview
An Observer
in llm-compressor
is a utility class responsible for analyzing tensors (e.g., weights, activations) and producing quantization parameters such as scale
and zero_point
. These observers are used by quantization modifiers to compute the statistics necessary for transforming tensors into lower precision formats.
Observers are designed to be flexible and support a variety of quantization strategies, including per-tensor, per-group, per-channel, and per-token quantization.
Base Class
Observer
Base class for all observers. Subclasses must implement the calculate_qparams
method to define how quantization parameters are computed.
The base class handles: - Group-wise scale/zero_point computation - Token-wise and channel-wise quantization logic - Optional support for g_idx
(group index mappings) - Recording observed tokens for logging and analysis - Resetting internal state during lifecycle transitions
This class is not used directly but provides the scaffolding for all custom observers.
Implemented Observers
MinMax
Computes scale
and zero_point
by tracking the minimum and maximum of the observed tensor. This is the simplest and most common observer. Works well for symmetric and asymmetric quantization.
Best used for: - Int8 or Int4 symmetric quantization - Channel-wise or group-wise strategies
MSE
Computes quantization parameters by minimizing the Mean Squared Error (MSE) between the original and quantized tensor. Optionally maintains a moving average of min/max values for smoother convergence.
Best used when: - Calibration accuracy is critical - Quantization error needs to be tightly controlled
Quantization Strategies
Observers support multiple quantization strategies via the QuantizationArgs.strategy
field:
TENSOR
: Global scale and zero_point across entire tensor.GROUP
,TENSOR_GROUP
: Slice tensor into equal-sized groups along columns.CHANNEL
: Per-channel quantization (e.g., across output dimensions).TOKEN
: Quantize activations along token or sequence dimensions.BLOCK
: (Not yet implemented) Placeholder for block-wise quantization.
Observer Configuration Parameters
Observers can be configured with optional keyword arguments that control their behavior. These are passed through the QuantizationArgs.observer_kwargs
dictionary and parsed internally when the observer is initialized.
Below are the supported configuration parameters and their meanings:
Argument | Default Value |
---|---|
maxshrink | 0.20 |
patience | 5 |
averaging_constant | 0.01 |
grid | 100.0 |
norm | 2.0 |
Example Usage
from llmcompressor.observers import Observer
from compressed_tensors.quantization.quant_args import QuantizationArgs
args = QuantizationArgs(num_bits=4, strategy="group", group_size=128)
observer = Observer.load_from_registry("minmax", quantization_args=args)
x = torch.randn(64, 512)
scale, zero_point = observer(x)