llmcompressor.args.dataset_arguments
Dataset argument classes for LLM compression workflows.
This module defines dataclass-based argument containers for configuring dataset loading, preprocessing, and calibration parameters across different dataset sources and processing pipelines. Supports various input formats including HuggingFace datasets, custom JSON/CSV files, and DVC-managed datasets.
Classes:
-
CustomDatasetArguments
–Arguments for training using custom datasets
-
DVCDatasetArguments
–Arguments for training using DVC
-
DatasetArguments
–Arguments pertaining to what data we are going to input our model for
CustomDatasetArguments dataclass
CustomDatasetArguments(
dvc_data_repository: Optional[str] = None,
dataset_path: Optional[str] = None,
text_column: str = "text",
remove_columns: Union[None, str, List] = None,
preprocessing_func: Union[None, str, Callable] = None,
data_collator: Callable[[Any], Any] = (
lambda: DefaultDataCollator()
)(),
)
DVCDatasetArguments dataclass
Arguments for training using DVC
DatasetArguments dataclass
DatasetArguments(
dvc_data_repository: Optional[str] = None,
dataset_path: Optional[str] = None,
text_column: str = "text",
remove_columns: Union[None, str, List] = None,
preprocessing_func: Union[None, str, Callable] = None,
data_collator: Callable[[Any], Any] = (
lambda: DefaultDataCollator()
)(),
dataset: Optional[str] = None,
dataset_config_name: Optional[str] = None,
max_seq_length: int = 384,
concatenate_data: bool = False,
raw_kwargs: Dict = dict(),
splits: Union[None, str, List, Dict] = None,
num_calibration_samples: Optional[int] = 512,
calibrate_moe_context: bool = False,
shuffle_calibration_samples: Optional[bool] = True,
streaming: Optional[bool] = False,
overwrite_cache: bool = False,
preprocessing_num_workers: Optional[int] = None,
pad_to_max_length: bool = True,
max_train_samples: Optional[int] = None,
min_tokens_per_module: Optional[float] = None,
pipeline: Optional[str] = "independent",
tracing_ignore: List[str] = (
lambda: [
"_update_causal_mask",
"create_causal_mask",
"make_causal_mask",
"get_causal_mask",
"mask_interface",
"mask_function",
"_prepare_4d_causal_attention_mask",
"_prepare_fsmt_decoder_inputs",
"_prepare_4d_causal_attention_mask_with_cache_position",
]
)(),
sequential_targets: Optional[List[str]] = None,
quantization_aware_calibration: bool = True,
)
Bases: CustomDatasetArguments
Arguments pertaining to what data we are going to input our model for calibration, training
Using HfArgumentParser
we can turn this class into argparse arguments to be able to specify them on the command line