Skip to content

llmcompressor.args.dataset_arguments

Dataset argument classes for LLM compression workflows.

This module defines dataclass-based argument containers for configuring dataset loading, preprocessing, and calibration parameters across different dataset sources and processing pipelines. Supports various input formats including HuggingFace datasets, custom JSON/CSV files, and DVC-managed datasets.

Classes:

CustomDatasetArguments dataclass

CustomDatasetArguments(
    dvc_data_repository: Optional[str] = None,
    dataset_path: Optional[str] = None,
    text_column: str = "text",
    remove_columns: Union[None, str, List] = None,
    preprocessing_func: Union[None, str, Callable] = None,
    data_collator: Callable[[Any], Any] = (
        lambda: DefaultDataCollator()
    )(),
)

Bases: DVCDatasetArguments

Arguments for training using custom datasets

DVCDatasetArguments dataclass

DVCDatasetArguments(
    dvc_data_repository: Optional[str] = None,
)

Arguments for training using DVC

DatasetArguments dataclass

DatasetArguments(
    dvc_data_repository: Optional[str] = None,
    dataset_path: Optional[str] = None,
    text_column: str = "text",
    remove_columns: Union[None, str, List] = None,
    preprocessing_func: Union[None, str, Callable] = None,
    data_collator: Callable[[Any], Any] = (
        lambda: DefaultDataCollator()
    )(),
    dataset: Optional[str] = None,
    dataset_config_name: Optional[str] = None,
    max_seq_length: int = 384,
    concatenate_data: bool = False,
    raw_kwargs: Dict = dict(),
    splits: Union[None, str, List, Dict] = None,
    num_calibration_samples: Optional[int] = 512,
    calibrate_moe_context: bool = False,
    shuffle_calibration_samples: Optional[bool] = True,
    streaming: Optional[bool] = False,
    overwrite_cache: bool = False,
    preprocessing_num_workers: Optional[int] = None,
    pad_to_max_length: bool = True,
    max_train_samples: Optional[int] = None,
    min_tokens_per_module: Optional[float] = None,
    pipeline: Optional[str] = "independent",
    tracing_ignore: List[str] = (
        lambda: [
            "_update_causal_mask",
            "create_causal_mask",
            "make_causal_mask",
            "get_causal_mask",
            "mask_interface",
            "mask_function",
            "_prepare_4d_causal_attention_mask",
            "_prepare_fsmt_decoder_inputs",
            "_prepare_4d_causal_attention_mask_with_cache_position",
        ]
    )(),
    sequential_targets: Optional[List[str]] = None,
    quantization_aware_calibration: bool = True,
)

Bases: CustomDatasetArguments

Arguments pertaining to what data we are going to input our model for calibration, training

Using HfArgumentParser we can turn this class into argparse arguments to be able to specify them on the command line