llmcompressor.modifiers.transform.spinquant.base

Classes:

SpinQuantModifier –

Implements the transforms according to "SpinQuant: LLM quantization

SpinQuantModifier

Bases: Modifier

Implements the transforms according to "SpinQuant: LLM quantization with learned rotations" (https://arxiv.org/abs/2405.16406)

Transforms (rotations) are extra layers added to a model which reduce the accuracy loss induced by quantization. This is achived through "rotating" weights and activations into a space with a smaller dynamic range of values, thus decreasing the range of scales required for quantization.

The SpinQuant authors describe four different rotations which can be applied to a model. R1 and R2 are "offline" rotations, meaning that they can be fused into existing weights and therefore do not induce runtime cost. R3 and R4 are "online" rotations, meaning that they require additional computation at runtime.

Lifecycle: - on_initialize - infer SpinQuantMappings & NormMappings - as needed, create transform schemes for R1, R2, R3, & R4 - on_start - normalize embeddings - fuse norm layers into subsequent Linear layers - apply TransformConfig - fuse transforms into weights for mergeable transforms - add hooks for online transforms - on sequential epoch end - on_end - on_finalize

Parameters:

rotations
–

A list containing the names of rotations to apply to the model. Possible rotations include R1, R2, R3, and R4
transform_type
–

The type of transform to apply to the model. "hadamard" has the least performance cost but only supports sizes which are powers of power of two. "random-matrix" has more performance cost, but supports a much larger set of sizes. "random-matrix" has the greatest performance cost, but supports any size
randomize
–

if True, create distinct transforms for each application
learnable
–

if True, attach gradients to transform weights for training
precision
–

Precision at which all transforms should be applied. This applies to both weight fusing and online rotations
transform_block_size
–

Block size to use for rotation matrices. The model's hidden_size and head_dim must be evenly divisible by transform_block_size. Layers will be transformed by a block-diagonal matrix where each block is a matrix of this size. If None is provided, model's hidden_size will be used for R1, R3, and R4 and model's head_dim will be used for R2
mappings
–

Specifies layers within a model to target for transforms. A mapping will be inferred if None is provided
norm_mappings
–

Specifies layers within a model to target for norm fusing. A mapping will be inferred if None is provided
transform_config
–

Optional transform config for overriding provided arguments

llmcompressor.modifiers.transform.spinquant.base

SpinQuantModifier

`rotations`

`transform_type`

`randomize`

`learnable`

`precision`

`transform_block_size`

`mappings`

`norm_mappings`

`transform_config`