llmcompressor.modifiers.transform.spinquant
Modules:
-
base– -
mappings– -
norm_mappings–
Classes:
-
Event–A class for defining an event that can be triggered during sparsification.
-
EventType–An Enum for defining the different types of events that can be triggered
-
Modifier–A base class for all modifiers to inherit from.
-
NormMapping–SpinQuant needs to know where every norm layer exists in the model,
-
SpinQuantMapping–SpinQuant needs to know the entire architecture of the model,
-
SpinQuantModifier–Implements the transforms according to "SpinQuant: LLM quantization
-
State–State class holds information about the current compression state.
Functions:
-
center_embeddings–Shift each embedding to have a mean of zero
-
fuse_norm_linears–Fuse the scaling operation of norm layer into subsequent linear layers.
Event dataclass
Event(
type_: Optional[EventType] = None,
steps_per_epoch: Optional[int] = None,
batches_per_step: Optional[int] = None,
invocations_per_step: int = 1,
global_step: int = 0,
global_batch: int = 0,
)
A class for defining an event that can be triggered during sparsification.
Parameters:
-
(type_Optional[EventType], default:None) –The type of event.
-
(steps_per_epochOptional[int], default:None) –The number of steps per epoch.
-
(batches_per_stepOptional[int], default:None) –The number of batches per step where step is an optimizer step invocation. For most pathways, these are the same. See the invocations_per_step parameter for more details when they are not.
-
(invocations_per_stepint, default:1) –The number of invocations of the step wrapper before optimizer.step was called. Generally can be left as 1 (default). For older amp pathways, this is the number of times the scaler wrapper was invoked before the wrapped optimizer step function was called to handle accumulation in fp16.
-
(global_stepint, default:0) –The current global step.
-
(global_batchint, default:0) –The current global batch.
Methods:
-
new_instance–Creates a new instance of the event with the provided keyword arguments.
-
should_update–Determines if the event should trigger an update.
Attributes:
-
current_index(float) –Calculates the current index of the event.
-
epoch(int) –Calculates the current epoch.
-
epoch_based(bool) –Determines if the event is based on epochs.
-
epoch_batch(int) –Calculates the current batch within the current epoch.
-
epoch_full(float) –Calculates the current epoch with the fraction of the current step.
-
epoch_step(int) –Calculates the current step within the current epoch.
current_index property writable
Calculates the current index of the event.
Returns:
-
float–The current index of the event, which is either the global step or the epoch with the fraction of the current step.
Raises:
-
ValueError–if the event is not epoch based or if the steps per epoch are too many.
epoch property
Calculates the current epoch.
Returns:
-
int–The current epoch.
Raises:
-
ValueError–if the event is not epoch based.
epoch_based property
Determines if the event is based on epochs.
Returns:
-
bool–True if the event is based on epochs, False otherwise.
epoch_batch property
Calculates the current batch within the current epoch.
Returns:
-
int–The current batch within the current epoch.
Raises:
-
ValueError–if the event is not epoch based.
epoch_full property
Calculates the current epoch with the fraction of the current step.
Returns:
-
float–The current epoch with the fraction of the current step.
Raises:
-
ValueError–if the event is not epoch based.
epoch_step property
Calculates the current step within the current epoch.
Returns:
-
int–The current step within the current epoch.
Raises:
-
ValueError–if the event is not epoch based.
new_instance
Creates a new instance of the event with the provided keyword arguments.
Parameters:
-
–kwargsKeyword arguments to set in the new instance.
Returns:
-
Event–A new instance of the event with the provided kwargs.
Source code in llmcompressor/core/events/event.py
should_update
Determines if the event should trigger an update.
Parameters:
-
(startOptional[float]) –The start index to check against, set to None to ignore start.
-
(endOptional[float]) –The end index to check against, set to None to ignore end.
-
(updateOptional[float]) –The update interval, set to None or 0.0 to always update, otherwise must be greater than 0.0, defaults to None.
Returns:
-
bool–True if the event should trigger an update, False otherwise.
Source code in llmcompressor/core/events/event.py
EventType
Bases: Enum
An Enum for defining the different types of events that can be triggered during model compression lifecycles. The purpose of each EventType is to trigger the corresponding modifier callback during training or post training pipelines.
Parameters:
-
–INITIALIZEEvent type for initialization.
-
–FINALIZEEvent type for finalization.
-
–BATCH_STARTEvent type for the start of a batch.
-
–LOSS_CALCULATEDEvent type for when loss is calculated.
-
–BATCH_ENDEvent type for the end of a batch.
-
–CALIBRATION_EPOCH_STARTEvent type for the start of a calibration epoch.
-
–SEQUENTIAL_EPOCH_ENDEvent type for the end of a layer calibration epoch, specifically used by
src/llmcompressor/pipelines/sequential/pipeline.py -
–CALIBRATION_EPOCH_ENDEvent type for the end of a calibration epoch.
-
–OPTIM_PRE_STEPEvent type for pre-optimization step.
-
–OPTIM_POST_STEPEvent type for post-optimization step.
Modifier
Bases: ModifierInterface, HooksMixin
A base class for all modifiers to inherit from. Modifiers are used to modify the training process for a model. Defines base attributes and methods available to all modifiers
Lifecycle: 1. initialize 2. on_event -> * on_start if self.start <= event.current_index * on_end if self.end >= event.current_index 5. finalize
Parameters:
-
–indexThe index of the modifier in the list of modifiers for the model
-
–groupThe group name for the modifier
-
–startThe start step for the modifier
-
–endThe end step for the modifier
-
–updateThe update step for the modifier
Methods:
-
finalize–Finalize the modifier for the given model and state.
-
initialize–Initialize the modifier for the given model and state.
-
on_end–on_end is called when the modifier ends and must be implemented
-
on_event–on_event is called whenever an event is triggered
-
on_finalize–on_finalize is called on modifier finalization and
-
on_initialize–on_initialize is called on modifier initialization and
-
on_start–on_start is called when the modifier starts and
-
on_update–on_update is called when the model in question must be
-
should_end–:param event: The event to check if the modifier should end
-
should_start–:param event: The event to check if the modifier should start
-
update_event–Update modifier based on the given event. In turn calls
Attributes:
-
finalized(bool) –:return: True if the modifier has been finalized
-
initialized(bool) –:return: True if the modifier has been initialized
finalize
Finalize the modifier for the given model and state.
Parameters:
-
(stateState) –The current state of the model
-
–kwargsAdditional arguments for finalizing the modifier
Raises:
-
RuntimeError–if the modifier has not been initialized
Source code in llmcompressor/modifiers/modifier.py
initialize
Initialize the modifier for the given model and state.
Parameters:
-
(stateState) –The current state of the model
-
–kwargsAdditional arguments for initializing the modifier
Raises:
-
RuntimeError–if the modifier has already been finalized
Source code in llmcompressor/modifiers/modifier.py
on_end
on_end is called when the modifier ends and must be implemented by the inheriting modifier.
Parameters:
-
(stateState) –The current state of the model
-
(eventEvent) –The event that triggered the end
-
–kwargsAdditional arguments for ending the modifier
Source code in llmcompressor/modifiers/modifier.py
on_event
on_finalize
on_finalize is called on modifier finalization and must be implemented by the inheriting modifier.
Parameters:
-
(stateState) –The current state of the model
-
–kwargsAdditional arguments for finalizing the modifier
Returns:
-
bool–True if the modifier was finalized successfully, False otherwise
Source code in llmcompressor/modifiers/modifier.py
on_initialize abstractmethod
on_initialize is called on modifier initialization and must be implemented by the inheriting modifier.
Parameters:
-
(stateState) –The current state of the model
-
–kwargsAdditional arguments for initializing the modifier
Returns:
-
bool–True if the modifier was initialized successfully, False otherwise
Source code in llmcompressor/modifiers/modifier.py
on_start
on_start is called when the modifier starts and must be implemented by the inheriting modifier.
Parameters:
-
(stateState) –The current state of the model
-
(eventEvent) –The event that triggered the start
-
–kwargsAdditional arguments for starting the modifier
Source code in llmcompressor/modifiers/modifier.py
on_update
on_update is called when the model in question must be updated based on passed in event. Must be implemented by the inheriting modifier.
Parameters:
-
(stateState) –The current state of the model
-
(eventEvent) –The event that triggered the update
-
–kwargsAdditional arguments for updating the model
Source code in llmcompressor/modifiers/modifier.py
should_end
Parameters:
-
(eventEvent) –The event to check if the modifier should end
Returns:
- –
True if the modifier should end based on the given event
Source code in llmcompressor/modifiers/modifier.py
should_start
Parameters:
-
(eventEvent) –The event to check if the modifier should start
Returns:
-
bool–True if the modifier should start based on the given event
Source code in llmcompressor/modifiers/modifier.py
update_event
Update modifier based on the given event. In turn calls on_start, on_update, and on_end based on the event and modifier settings. Returns immediately if the modifier is not initialized
Parameters:
-
(stateState) –The current state of sparsification
-
(eventEvent) –The event to update the modifier with
-
–kwargsAdditional arguments for updating the modifier
Raises:
-
RuntimeError–if the modifier has been finalized
Source code in llmcompressor/modifiers/modifier.py
NormMapping
Bases: BaseModel
SpinQuant needs to know where every norm layer exists in the model, as well as all the subsequent Linear layers the norm passes into. This is because the norm layer weights need to normalized before transforms can be fused into Linear layers.
Parameters:
-
–normname or regex that matches norm layer in model
-
–linearslist of names or regexes of Linear layers that receive input from norm.
SpinQuantMapping
Bases: BaseModel
SpinQuant needs to know the entire architecture of the model, as R1, R2, R3, and R4 rotations need to be applied to specific layers (https://arxiv.org/pdf/2405.16406 Fig. 1).
Parameters:
-
–embeddingname or regex of embedding layer
-
–attn_qname or regex of q_proj layer in attention block
-
–attn_kname or regex of k_proj layer in attention block
-
–attn_vname or regex of v_proj layer in attention block
-
–attn_oname or regex of o_proj layer in attention block
-
–attn_head_dimhead_dim of the attention module, needed because R2 needs to be applied "head-wisely" to v_proj and o_proj
-
–mlp_inlist of names or regexes for the mlp blocks that receive the input to the MLP block, usually up_proj and gate_proj
-
–mlp_outlist of names or regexes for the mlp blocks that consitute the output of the MLP block, usually down_proj
SpinQuantModifier
Bases: Modifier
Implements the transforms according to "SpinQuant: LLM quantization with learned rotations" (https://arxiv.org/abs/2405.16406)
Transforms (rotations) are extra layers added to a model which reduce the accuracy loss induced by quantization. This is achived through "rotating" weights and activations into a space with a smaller dynamic range of values, thus decreasing the range of scales required for quantization.
The SpinQuant authors describe four different rotations which can be applied to a model. R1 and R2 are "offline" rotations, meaning that they can be fused into existing weights and therefore do not induce runtime cost. R3 and R4 are "online" rotations, meaning that they require additional computation at runtime.
Lifecycle: - on_initialize - infer SpinQuantMappings & NormMappings - as needed, create transform schemes for R1, R2, R3, & R4 - on_start - normalize embeddings - fuse norm layers into subsequent Linear layers - apply TransformConfig - fuse transforms into weights for mergeable transforms - add hooks for online transforms - on sequential epoch end - on_end - on_finalize
Parameters:
-
–rotationsA list containing the names of rotations to apply to the model. Possible rotations include R1, R2, R3, and R4
-
–transform_typeThe type of transform to apply to the model.
"hadamard"has the least performance cost but only supports sizes which are powers of power of two."random-matrix"has more performance cost, but supports a much larger set of sizes."random-matrix"has the greatest performance cost, but supports any size -
–randomizeif True, create distinct transforms for each application
-
–learnableif True, attach gradients to transform weights for training
-
–precisionPrecision at which all transforms should be applied. This applies to both weight fusing and online rotations
-
–transform_block_sizeBlock size to use for rotation matrices. The model's hidden_size and head_dim must be evenly divisible by transform_block_size. Layers will be transformed by a block-diagonal matrix where each block is a matrix of this size. If None is provided, model's hidden_size will be used for R1, R3, and R4 and model's head_dim will be used for R2
-
–mappingsSpecifies layers within a model to target for transforms. A mapping will be inferred if None is provided
-
–norm_mappingsSpecifies layers within a model to target for norm fusing. A mapping will be inferred if None is provided
-
–transform_configOptional transform config for overriding provided arguments
State dataclass
State(
model: Any = None,
teacher_model: Any = None,
optimizer: Any = None,
optim_wrapped: bool = None,
loss: Any = None,
batch_data: Any = None,
data: Data = Data(),
hardware: Hardware = Hardware(),
loggers: Optional[LoggerManager] = None,
model_log_cadence: Optional[float] = None,
_last_log_step: Union[float, int, None] = None,
)
State class holds information about the current compression state.
Parameters:
-
(modelAny, default:None) –The model being used for compression
-
(teacher_modelAny, default:None) –The teacher model being used for compression
-
(optimizerAny, default:None) –The optimizer being used for training
-
(optim_wrappedbool, default:None) –Whether or not the optimizer has been wrapped
-
(lossAny, default:None) –The loss function being used for training
-
(batch_dataAny, default:None) –The current batch of data being used for compression
-
(dataData, default:Data()) –The data sets being used for training, validation, testing, and/or calibration, wrapped in a Data instance
-
(hardwareHardware, default:Hardware()) –Hardware instance holding info about the target hardware being used
-
(loggersOptional[LoggerManager], default:None) –LoggerManager instance holding all the loggers to log
-
(model_log_cadenceOptional[float], default:None) –The cadence to log model information w.r.t epochs. If 1, logs every epoch. If 2, logs every other epoch, etc. Default is 1.
Methods:
-
update–Update the state with the given parameters.
Attributes:
-
compression_ready(bool) –Check if the model and optimizer are set for compression.
compression_ready property
Check if the model and optimizer are set for compression.
Returns:
-
bool–True if model and optimizer are set, False otherwise
update
update(
model: Any = None,
teacher_model: Any = None,
optimizer: Any = None,
attach_optim_callbacks: bool = True,
train_data: Any = None,
val_data: Any = None,
test_data: Any = None,
calib_data: Any = None,
copy_data: bool = True,
start: float = None,
steps_per_epoch: int = None,
batches_per_step: int = None,
loggers: Union[
None, LoggerManager, List[BaseLogger]
] = None,
model_log_cadence: Optional[float] = None,
**kwargs,
) -> Dict
Update the state with the given parameters.
Parameters:
-
(modelAny, default:None) –The model to update the state with
-
(teacher_modelAny, default:None) –The teacher model to update the state with
-
(optimizerAny, default:None) –The optimizer to update the state with
-
(attach_optim_callbacksbool, default:True) –Whether or not to attach optimizer callbacks
-
(train_dataAny, default:None) –The training data to update the state with
-
(val_dataAny, default:None) –The validation data to update the state with
-
(test_dataAny, default:None) –The testing data to update the state with
-
(calib_dataAny, default:None) –The calibration data to update the state with
-
(copy_databool, default:True) –Whether or not to copy the data
-
(startfloat, default:None) –The start index to update the state with
-
(steps_per_epochint, default:None) –The steps per epoch to update the state with
-
(batches_per_stepint, default:None) –The batches per step to update the state with
-
(loggersUnion[None, LoggerManager, List[BaseLogger]], default:None) –The metrics manager to setup logging important info and milestones to, also accepts a list of BaseLogger(s)
-
(model_log_cadenceOptional[float], default:None) –The cadence to log model information w.r.t epochs. If 1, logs every epoch. If 2, logs every other epoch, etc. Default is 1.
-
–kwargsAdditional keyword arguments to update the state with
Returns:
-
Dict–The updated state as a dictionary
Source code in llmcompressor/core/state.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | |
center_embeddings
Shift each embedding to have a mean of zero
Parameters:
-
(embeddingModule) –embedding module containing embeddings to center
Source code in llmcompressor/modeling/fuse.py
fuse_norm_linears
Fuse the scaling operation of norm layer into subsequent linear layers. This useful for ensuring transform invariance between norm and linear layers.
Note that unitary transforms (rotation) commute with normalization, but not scaling
Parameters:
-
(normModule) –norm layer whose weight will be fused into subsequent linears
-
(linearsIterable[Linear]) –linear layers which directly follow the norm layer