llmcompressor.transformers.finetune.data.data_helpers
Functions:
-
get_custom_datasets_from_path
–Get a dictionary of custom datasets from a directory path. Support HF's load_dataset
-
get_raw_dataset
–Load the raw dataset from Hugging Face, using cached copy if available
get_custom_datasets_from_path
Get a dictionary of custom datasets from a directory path. Support HF's load_dataset for local folder datasets https://huggingface.co/docs/datasets/loading
This function scans the specified directory path for files with a specific extension (default is '.json'). It constructs a dictionary where the keys are either subdirectory names or direct dataset names (depending on the directory structure) and the values are either file paths (if only one file exists with that name) or lists of file paths (if multiple files exist).
Parameters:
-
path
str
) –The path to the directory containing the dataset files.
-
ext
str
, default:'json'
) –The file extension to filter files by. Default is 'json'.
Returns:
-
Dict[str, str]
–A dictionary mapping dataset names to their file paths or lists of file paths. Example: dataset = get_custom_datasets_from_path("/path/to/dataset/directory", "json") Note: If datasets are organized in subdirectories, the function constructs the dictionary with lists of file paths. If datasets are found directly in the main directory, they are included with their respective names. Accepts: - path train.json test.json val.json - path train data1.json data2.json ... test ... val ...
Source code in llmcompressor/transformers/finetune/data/data_helpers.py
get_raw_dataset
get_raw_dataset(
dataset_args,
cache_dir: Optional[str] = None,
streaming: Optional[bool] = False,
**kwargs
) -> Dataset
Load the raw dataset from Hugging Face, using cached copy if available
Parameters:
-
cache_dir
Optional[str]
, default:None
) –disk location to search for cached dataset
-
streaming
Optional[bool]
, default:False
) –True to stream data from Hugging Face, otherwise download
Returns:
-
Dataset
–the requested dataset
Source code in llmcompressor/transformers/finetune/data/data_helpers.py
transform_dataset_keys
Transform dict keys to train
, val
or test
for the given input dict if matches exist with the existing keys. Note that there can only be one matching file name. Ex. Folder(train_foo.json) -> Folder(train.json) Folder(train1.json, train2.json) -> Same
Parameters:
-
data_files
Dict[str, Any]
) –The dict where keys will be transformed