UltraChatDataset(
dataset_args: DatasetArguments,
split: str,
processor: Processor,
)
Bases: TextGenerationDataset
Child text generation class for the Ultra Chat 200k dataset
Parameters:
- (
DatasetArguments
) – configuration settings for dataset loading
- (
str
) – split from dataset to load, for instance test
or train[:5%]
- (
Processor
) – processor or tokenizer to use on dataset
Source code in llmcompressor/transformers/finetune/data/ultrachat_200k.py
| def __init__(
self, dataset_args: "DatasetArguments", split: str, processor: Processor
):
dataset_args = deepcopy(dataset_args)
dataset_args.dataset = "HuggingFaceH4/ultrachat_200k"
dataset_args.text_column = "messages"
if split in ["train", "test"]:
split += "_sft"
super().__init__(dataset_args=dataset_args, split=split, processor=processor)
if (
self.tokenizer is not None
and getattr(self.tokenizer, "chat_template", None) is None
):
# note that since tokenizer is a member of processor,
# this change affects processor.apply_chat_template
self.tokenizer.chat_template = self.DEFAULT_CHAT_TEMPLATE
logger.warning(
"tokenizer.chat_template is not set, using default chat template for "
f"{self.__class__.__name__}"
)
|