Skip to content

llmcompressor.transformers.finetune.data.peoples_speech

Classes:

PeoplesSpeech

PeoplesSpeech(
    dataset_args: DataTrainingArguments,
    split: str,
    processor: Processor,
)

Bases: TextGenerationDataset

ML Commons People's Speech audio dataset

Unfortunately, due to the specialized nature of audio model preprocessing, some model specific code must be defined here. This dataset has been tested with the WhisperForConditionalGeneration and Qwen2AudioForConditionalGeneration model classes

Parameters:

  • data_args

    configuration settings for dataset loading

  • split

    (str) –

    split from dataset to load, for instance test or train[:5%]

  • processor

    (Processor) –

    processor or tokenizer to use on dataset

Source code in llmcompressor/transformers/finetune/data/peoples_speech.py
def __init__(self, dataset_args: "DataArgs", split: str, processor: Processor):
    dataset_args = deepcopy(dataset_args)
    dataset_args.dataset = "MLCommons/peoples_speech"
    dataset_args.dataset_config_name = "test"
    if not dataset_args.overwrite_cache:
        logger.warning(
            "Because audio processors are more complex, dataset mapping functions "
            "vary with model architecture and their results cannot be cached. "
            "Setting overwrite_cache=True"
        )
        dataset_args.overwrite_cache = True
    self.processor_type = processor.__class__.__name__

    super().__init__(dataset_args=dataset_args, split=split, processor=processor)