sambanova/generative_data_prep
Python
Captured source
source ↗sambanova/generative_data_prep
Language: Python
License: Apache-2.0
Stars: 67
Forks: 10
Open issues: 7
Created: 2023-03-28T02:04:40Z
Pushed: 2026-02-04T19:00:26Z
Default branch: main
Fork: no
Archived: no
README:  
Generative data preparation
This software package allows you to prepare datasets for training generative LLMs on SambaStudio and SambaNova's Reconfigurable Data Units (RDUs). Some features include efficient multiprocessing, shuffling data that outsizes RAM, and specifying tokens to attend to during training.
The `pipeline.py` script streamlines the data preparation process. It takes a single input file, shuffles and splits it into train/dev/test files, tokenizes, sequences, and converts them to HDF5 format using the utilities in `data_prep.py`. The output directory contains multiple split HDF5 files that are needed to run data parallel training. This output directory will be directly used as a training dataset in SambaStudio. While this package features simple flows that work out of the box, it also supports more customization allowing for many styles of packing varied length text into tokenized sequences.
If you are an advanced user looking to process data with pre-defined splits, integrate with the package validation tools, or contribute, check out the [Advanced Usage](#advanced-usage) section below!
Table of contents
- [Requirements](#requirements)
- [Installation](#installation)
- [Getting Started](#getting-started)
- [Input](#input)
- [Formatting data for Chat/Instruction/Fine Tuned Models](#formatting-data-for-chatinstructionfine-tuned-models)
- [Output](#output)
- [Flags](#flags)
- [Examples](#examples)
- [Pre-training](#pre-training)
- [Fine-tuning](#fine-tuning)
- [Dialogue](#dialogue)
- [Meta in context learning](#meta-in-context-learning)
- [Understanding Command Outputs](#understanding-outputs)
- [FAQs](#faqs)
- [Advanced Usage](#advanced-usage)
Requirements
- Python version 3.8.10+
- Support for Linux, Mac OS and Windows.
Installation
git clone https://github.com/sambanova/generative_data_prep.git cd generative_data_prep pip install .
Getting Started
The following simple example will help you get started with your first processed dataset:
Example
python3 -m generative_data_prep pipeline --input_path= --output_path= --pretrained_tokenizer=openai-community/gpt2 --max_seq_length=1024 --input_packing_config='greedy::drop' --shuffle=on_RAM
Here are a few important parameters to know about when running this example:
Flag Name Type Description Instructions
input_path str An existing file path to the dataset to be processed, or directory of files. File must be in .jsonl or .txt format. Check out the input section for more details.
output_path str A path to the desired output location for the directory of processed dataset files. If the path doesn't exist, a new directory will be created using the provided path. Check out the output section for more details.
pretrained_tokenizer str The model specific tokenizer to use when tokenizing the input dataset. You can specify the tokenizer in two ways. The preferred method is to provide the directory path to the locally downloaded base checkpoint. The alternative method is to use the model ID from the Hugging Face model card, such as "mistralai/Mistral-7B-v0.1" for Mistral-7B-v0.1. If the model is gated on Hugging Face, you must request access and log in via the Hugging Face CLI before executing the data preparation command.
max_seq_length int The maximum sequence length (in tokens) that an RDU model training configuration can support. When launching the training job on SambaStudio, under "Hyperparameters and Settings," ensure that the max_seq_length value during training matches exactly with this input flag. Note that the available max_seq_length training configurations may not align with the model’s maximum sequence length on Hugging Face.
input_packing_config str
Defines the strategy used to pack the provided text data into fixed-length sequences.
For pre-training, use 'full'.
For fine-tuning:
• 'greedy::truncate_right' for efficient training with multiple data points per sequence
• 'single::truncate_right' for limited data with one data point per sequence
See input_packing_config for all options and details.
shuffle str Determines whether to shuffle the input dataset, and whether to shuffle on RAM. There are 3 options for this flag: 'False', 'on_RAM', 'large_file'. Check out the shuffle flag below for more details.
apply_chat_template bool Whether to tokenize the data using tokenizer.apply_chat_template, adding chatML tags during tokenization (e.g., <user>: ... <assistant>:). This option is typically used for instruction tuning or fine-tuning chat models. To enable this flag, the tokenizer you are loading must have a chat template defined. You can verify this by checking the tokenizer_config.json file for a chat_template key.
Input
The input_path argument must be a file or a directory containing one files, each file must be a .txt or `.jsonl`.
.jsonl Format
The JSON Lines format can be used for [fine-tuning](#fine-tuning), or [pre-training](#pre-training)/continual pre-training. Each line in the .jsonl format should be a json object with a prompt, and completion element. For example:
{"prompt": "What did the fox do?", "completion": "The quick brown fox jumped over the lazy dog."}
{"prompt": "How much wood does a woodchuck chuck?", "completion": "A woodchuck chucks 1000 wood."}
{"prompt": "Who sells seashells by the sea shore?", "completion": "She sells seashells by the sea shore."}We also support lists of prompt/completion pairs within a .jsonl file. This guarantees that the prompt/completion pairs in the list will be placed contiguously in the same sequence. If the input prompt/completion...
Excerpt shown — open the source for the full document.