llmcompressor.transformers.finetune.data.base
Base classes for text generation dataset handling and processing.
This module provides the foundational TextGenerationDataset class with registry support for different dataset types. Handles dataset loading, tokenization, preprocessing, and text generation specific formatting for fine-tuning workflows.
Classes:
-
TextGenerationDataset
–Base class for text datasets. Applies the following transformations to a dataset
TextGenerationDataset
Bases: RegistryMixin
Base class for text datasets. Applies the following transformations to a dataset in order to prepare the dataset to be loaded by a dataloader
- Load dataset from huggingface or local cache
- Preprocess dataset according to preprocess function or chat/dataset template
- Tokenize dataset using model tokenizer/processor
- Apply post processing such as grouping text and/or adding labels for finetuning
Parameters:
-
dataset_args
DatasetArguments
) –configuration settings for dataset loading
-
split
str
) –split from dataset to load, for instance
test
ortrain[:5%]
-
processor
Processor
) –processor or tokenizer to use on dataset
Methods:
-
load_dataset
–Load the raw dataset from Hugging Face, using cached copy if available
-
map
–Wrapper function around Dataset.map and IterableDataset.map.
Attributes:
-
preprocess
(Union[Callable[[LazyRow], Any], None]
) –The function must return keys which correspond to processor/tokenizer kwargs,
Source code in llmcompressor/transformers/finetune/data/base.py
preprocess cached
property
The function must return keys which correspond to processor/tokenizer kwargs, optionally including PROMPT_KEY
load_dataset
Load the raw dataset from Hugging Face, using cached copy if available
Parameters:
-
cache_dir
disk location to search for cached dataset
Returns:
- –
the requested dataset
Source code in llmcompressor/transformers/finetune/data/base.py
map
map(
dataset: Union[Dataset, IterableDataset],
function: Callable[[Any], Any],
**kwargs
) -> Union[Dataset, IterableDataset]
Wrapper function around Dataset.map and IterableDataset.map.
If the dataset is streaming (in the case of IterableDataset), non-applicable arguments are ignored and the dataset features are resolved