llmcompressor.entrypoints.oneshot
Oneshot compression entrypoint for post-training model optimization.
Provides the main oneshot compression entry point for applying quantization, pruning, and other compression techniques to pre-trained models without additional training. Supports calibration-based compression with various pipeline configurations for efficient model optimization.
Classes:
-
Oneshot
–Class responsible for carrying out one-shot calibration on a pretrained model.
Functions:
-
oneshot
–Performs oneshot calibration on a model.
Oneshot
Class responsible for carrying out one-shot calibration on a pretrained model.
This class handles the entire lifecycle of one-shot calibration, including preprocessing (model and tokenizer/processor initialization), model optimization (quantization or sparsification), and postprocessing (saving outputs). The intructions for model optimization can be specified by using a recipe.
-
Input Keyword Arguments:
kwargs
are parsed into:model_args
: Arguments for loading and configuring a pretrained model (e.g.,AutoModelForCausalLM
).dataset_args
: Arguments for dataset-related configurations, such as calibration dataloaders.recipe_args
: Arguments for defining and configuring recipes that specify optimization actions.
Parsers are defined in
src/llmcompressor/args/
. -
Lifecycle Overview: The oneshot calibration lifecycle consists of three steps:
- Preprocessing:
- Instantiates a pretrained model and tokenizer/processor.
- Ensures input and output embedding layers are untied if they share tensors.
- Patches the model to include additional functionality for saving with quantization configurations.
- Oneshot Calibration:
- Optimizes the model using a global
CompressionSession
and applies recipe-defined modifiers (e.g.,GPTQModifier
,SparseGPTModifier
)
- Optimizes the model using a global
- Postprocessing:
- Saves the model, tokenizer/processor, and configuration to the specified
output_dir
.
- Saves the model, tokenizer/processor, and configuration to the specified
- Preprocessing:
-
Usage:
Methods: init(**kwargs): Initializes the Oneshot
object by parsing input arguments, performing preprocessing, and setting instance attributes.
__call__(**kwargs):
Performs the one-shot calibration process by preparing a calibration
dataloader, applying recipe modifiers to the model, and executing
postprocessing steps.
save():
Saves the calibrated model and tokenizer/processor to the specified
`output_dir`. Supports saving in compressed formats based on model
arguments.
apply_recipe_modifiers(calibration_dataloader, **kwargs):
Applies lifecycle actions (e.g., `initialize`, `finalize`) using modifiers
defined in the recipe. Each action is executed via the global
`CompressionSession`.
Initializes the Oneshot
class with provided arguments.
Parses the input keyword arguments into model_args
, dataset_args
, and recipe_args
. Performs preprocessing to initialize the model and tokenizer/processor.
Parameters:
-
model_args
ModelArguments parameters, responsible for controlling model loading and saving logic
-
dataset_args
DatasetArguments parameters, responsible for controlling dataset loading, preprocessing and dataloader loading
-
recipe_args
RecipeArguments parameters, responsible for containing recipe-related parameters
-
output_dir
Path to save the output model after carrying out oneshot
-
log_dir
Optional[str]
, default:'sparse_logs'
) –Path to save logs during oneshot run. Nothing is logged to file if None.
Methods:
-
apply_recipe_modifiers
–Applies recipe modifiers to the model during the lifecycle.
Source code in llmcompressor/entrypoints/oneshot.py
apply_recipe_modifiers
apply_recipe_modifiers(
calibration_dataloader: Optional[DataLoader],
recipe_stage: Optional[str] = None,
)
Applies recipe modifiers to the model during the lifecycle.
The modifiers are defined in the recipe and executed via lifecycle actions (initialize
, finalize
) through the global CompressionSession
.
Source code in llmcompressor/entrypoints/oneshot.py
oneshot
oneshot(
model: Union[str, PreTrainedModel],
distill_teacher: Optional[str] = None,
config_name: Optional[str] = None,
tokenizer: Optional[
Union[str, PreTrainedTokenizerBase]
] = None,
processor: Optional[Union[str, ProcessorMixin]] = None,
cache_dir: Optional[str] = None,
use_auth_token: bool = False,
precision: str = "auto",
tie_word_embeddings: bool = False,
trust_remote_code_model: bool = False,
save_compressed: bool = True,
model_revision: str = "main",
recipe: Optional[Union[str, List[str]]] = None,
recipe_args: Optional[List[str]] = None,
clear_sparse_session: bool = False,
stage: Optional[str] = None,
dataset: Optional[
Union[str, Dataset, DatasetDict]
] = None,
dataset_config_name: Optional[str] = None,
dataset_path: Optional[str] = None,
num_calibration_samples: int = 512,
shuffle_calibration_samples: bool = True,
max_seq_length: int = 384,
pad_to_max_length: bool = True,
text_column: str = "text",
concatenate_data: bool = False,
streaming: bool = False,
overwrite_cache: bool = False,
preprocessing_num_workers: Optional[int] = None,
min_tokens_per_module: Optional[float] = None,
calibrate_moe_context: bool = False,
quantization_aware_calibration: bool = True,
output_dir: Optional[str] = None,
log_dir: Optional[str] = "sparse_logs",
**kwargs
) -> PreTrainedModel
Performs oneshot calibration on a model.
Model arguments
Parameters:
-
model
Union[str, PreTrainedModel]
) –A pretrained model identifier from huggingface.co/models or a path to a local model. Required parameter.
-
distill_teacher
Optional[str]
, default:None
) –Teacher model (a trained text generation model) for distillation.
-
config_name
Optional[str]
, default:None
) –Pretrained config name or path if not the same as model_name.
-
tokenizer
Optional[Union[str, PreTrainedTokenizerBase]]
, default:None
) –Pretrained tokenizer name or path if not the same as model_name.
-
processor
Optional[Union[str, ProcessorMixin]]
, default:None
) –Pretrained processor name or path if not the same as model_name.
-
cache_dir
Optional[str]
, default:None
) –Where to store the pretrained data from huggingface.co.
-
use_auth_token
bool
, default:False
) –Whether to use Hugging Face auth token for private models.
-
precision
str
, default:'auto'
) –Precision to cast model weights to, default to auto.
-
tie_word_embeddings
bool
, default:False
) –Whether the model's input and output word embeddings should be tied.
-
trust_remote_code_model
bool
, default:False
) –Whether to allow for custom models to execute their own modeling files.
-
save_compressed
bool
, default:True
) –Whether to compress sparse models during save.
-
model_revision
str
, default:'main'
) –The specific model version to use (can be branch name, tag, or commit id). # Recipe arguments
-
recipe
Optional[Union[str, List[str]]]
, default:None
) –Path to a LLM Compressor sparsification recipe.
-
recipe_args
Optional[List[str]]
, default:None
) –List of recipe arguments to evaluate, in the format "key1=value1", "key2=value2".
-
clear_sparse_session
bool
, default:False
) –Whether to clear CompressionSession/ CompressionLifecycle data between runs.
-
stage
Optional[str]
, default:None
) –The stage of the recipe to use for oneshot. # Dataset arguments
-
dataset
Optional[Union[str, Dataset, DatasetDict]]
, default:None
) –The name of the dataset to use (via the datasets library).
-
dataset_config_name
Optional[str]
, default:None
) –The configuration name of the dataset to use.
-
dataset_path
Optional[str]
, default:None
) –Path to a custom dataset. Supports json, csv, dvc.
-
num_calibration_samples
int
, default:512
) –Number of samples to use for one-shot calibration.
-
shuffle_calibration_samples
bool
, default:True
) –Whether to shuffle the dataset before calibration.
-
max_seq_length
int
, default:384
) –Maximum total input sequence length after tokenization.
-
pad_to_max_length
bool
, default:True
) –Whether to pad all samples to
max_seq_length
. -
text_column
str
, default:'text'
) –Key to use as the
text
input to tokenizer/processor. -
concatenate_data
bool
, default:False
) –Whether to concatenate datapoints to fill max_seq_length.
-
streaming
bool
, default:False
) –True to stream data from a cloud dataset.
-
overwrite_cache
bool
, default:False
) –Whether to overwrite the cached preprocessed datasets.
-
preprocessing_num_workers
Optional[int]
, default:None
) –Number of processes for preprocessing.
-
min_tokens_per_module
Optional[float]
, default:None
) –Minimum percentage of tokens per module, relevant for MoE models.
-
calibrate_moe_context
bool
, default:False
) –If during calibration, the MoE context should be enabled for the given model. This usually involves updating all MoE modules in the model for the duration of calibration.
-
quantization_aware_calibration
bool
, default:True
) –Whether to enable quantization-aware calibration in the sequential pipeline. When True, quantization is applied during forward pass in calibration. When False, quantization is disabled during forward pass in calibration. Default is set to True. # Miscellaneous arguments
-
output_dir
Optional[str]
, default:None
) –Path to save the output model after calibration. Nothing is saved if None.
-
log_dir
Optional[str]
, default:'sparse_logs'
) –Path to save logs during oneshot run. Nothing is logged to file if None.
Returns:
-
PreTrainedModel
–The calibrated PreTrainedModel
Source code in llmcompressor/entrypoints/oneshot.py
210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 |
|