Custom dataset implementation for JSON and CSV data sources.
This module provides a CustomDataset class for loading and processing local JSON and CSV files for text generation fine-tuning. Supports flexible data formats and custom preprocessing pipelines for user-provided datasets.
Classes:
-
CustomDataset
– Child text generation class for custom local dataset supporting load
CustomDataset(
dataset_args: DatasetArguments,
split: str,
processor: Processor,
)
Bases: TextGenerationDataset
Child text generation class for custom local dataset supporting load for csv and json
Parameters:
- (
DatasetArguments
) – configuration settings for dataset loading
- (
str
) – split from dataset to load, for instance test
or train[:5%]
Can also be set to None to load all the splits
- (
Processor
) – processor or tokenizer to use on dataset
Source code in llmcompressor/transformers/finetune/data/base.py
| def __init__(
self,
dataset_args: DatasetArguments,
split: str,
processor: Processor,
):
self.dataset_args = dataset_args
self.split = split
self.processor = processor
# get tokenizer
self.tokenizer = getattr(self.processor, "tokenizer", self.processor)
if self.tokenizer is not None:
# fill in pad token
if not self.tokenizer.pad_token:
self.tokenizer.pad_token = self.tokenizer.eos_token
# configure sequence length
max_seq_length = dataset_args.max_seq_length
if dataset_args.max_seq_length > self.tokenizer.model_max_length:
logger.warning(
f"The max_seq_length passed ({max_seq_length}) is larger than "
f"maximum length for model ({self.tokenizer.model_max_length}). "
f"Using max_seq_length={self.tokenizer.model_max_length}."
)
self.max_seq_length = min(
dataset_args.max_seq_length, self.tokenizer.model_max_length
)
# configure padding
self.padding = (
False
if self.dataset_args.concatenate_data
else "max_length"
if self.dataset_args.pad_to_max_length
else False
)
else:
self.max_seq_length = None
self.padding = False
|