llmcompressor.modifiers.quantization.cache
Quantized key-value cache implementation for efficient inference.
Provides quantized KV cache classes extending HuggingFace's DynamicCache with quantization support. Enables memory-efficient attention mechanisms by quantizing cached key and value tensors during model inference with configurable quantization strategies.
Classes:
-
QuantizedKVParameterCache
–Quantized KV cache used in the forward call based on HF's dynamic cache.
QuantizedKVParameterCache
Bases: DynamicCache
Quantized KV cache used in the forward call based on HF's dynamic cache. Quantization strategy (tensor, group, channel) set from Quantization arg's strategy Singleton, so that the same cache gets reused in all forward call of self_attn. Each time forward is called, .update() is called, and ._quantize(), ._dequantize() gets called appropriately. The size of tensor is [batch_size, num_heads, seq_len - residual_length, head_dim]
.
Triggered by adding kv_cache_scheme in the recipe.
Example:
```python3 recipe = ''' quant_stage: quant_modifiers: QuantizationModifier: kv_cache_scheme: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true '''
Methods:
-
get_seq_length
–Returns the sequence length of the cached states.
-
reset
–Reset the instantiation, create new instance on init
-
reset_states
–reset the kv states (used in calibration)
-
update
–Get the k_scale and v_scale and output the
Source code in llmcompressor/modifiers/quantization/cache.py
get_seq_length
Returns the sequence length of the cached states. A layer index can be optionally passed.
Source code in llmcompressor/modifiers/quantization/cache.py
reset
Reset the instantiation, create new instance on init
reset_states
reset the kv states (used in calibration)
Source code in llmcompressor/modifiers/quantization/cache.py
update
update(
key_states: Tensor,
value_states: Tensor,
layer_idx: int,
cache_kwargs: Optional[Dict[str, Any]] = None,
) -> Tuple[Tensor, Tensor]
Get the k_scale and v_scale and output the fakequant-ed key_states and value_states