Bring Your Own Dataset¶
By default FMBench
uses the LongBench dataset
dataset for testing the models, but this is not the only dataset you can test with. You may want to test with other datasets available on HuggingFace or use your own datasets for testing.
Hugging Face Data Preparation is now integrated within FMBench¶
FMBench supports direct loading of Hugging Face datasets with a simplified prefixing method. To specify a Hugging Face dataset and its split, include hf:
, followed by the dataset identifier
, subset name
, and split name
.
If you only provide the dataset-id
and not the subset name
and split name
, the following defaults will be used:
- Subset name: default
- Split name: train
Important: If your dataset does not have the default subset name
and split name
provided above, then provide the dataset information in the config file in the following format: hf:dataset-id/subset-name/split-name.
Example formats:
source_data_files:
# Full specification
- hf:databricks/databricks-dolly-15k/default/train
# Using defaults (subset: default, split: train)
- hf:databricks/databricks-dolly-15k
In your configuration file, add entries to source_data_files
using the following format:
-
In your config file, prefix the dataset name with
hf:
in thesource_data_files
section:source_data_files: # Format: hf:dataset-id/subset-name/split-name. - hf:THUDM/LongBench/2wikimqa_e/test - hf:THUDM/LongBench/2wikimqa/test - hf:THUDM/LongBench/hotpotqa_e/test - hf:THUDM/LongBench/hotpotqa/test - hf:THUDM/LongBench/narrativeqa/test - hf:THUDM/LongBench/triviaqa_e/test - hf:THUDM/LongBench/triviaqa/test
When FMBench encounters a dataset prefixed with hf:
, it will:
- Automatically download the dataset from Hugging Face
- Convert it to the required JSON Lines format
- Handle both text and image datasets dynamically
- Store the processed data in either:
- The S3 read bucket for cloud deployments
- The
/tmp/fmbench-read/source_data/
directory for local runs
Note: This requires a Hugging Face token to be configured in your environment for private or gated datasets.
Using Custom Datasets¶
If you want to use your own dataset or a pre-processed dataset, you can:
-
Provide the dataset path without the
hf:
prefix in the config: -
Or, use the
[
bring_your_own_dataset](./src/fmbench/bring_your_own_dataset.ipynb) notebook
to convert your custom dataset to JSON Lines format and upload it to the appropriate S3 bucket or local directory.
FMBench will use these files directly from the specified location without any preprocessing.
Support for new Image and Text datasets¶
While you can use any hugging face dataset without pre processing, FMBench provides configuration files for running llama3-2-11b-instruct
, claude-3-sonnet
, claude-3-5-sonnet
on the following image and text datasets:
- Databricks dolly dataset: config-llama-3-2-11b-databricks-dolly-15k.yml
- Multimodal ScienceQA dataset: config-llama-3-2-11b-vision-instruct-scienceqa.yml
- Multimodal marqo-GS-10M dataset: config-llama-3-2-11b-vision-instruct-marqo-GS-10M.yml
Support for Open-Orca dataset¶
Support for Open-Orca dataset and corresponding prompts for Llama3, Llama2 and Mistral, see: