What does this repo signal mean?

SambaNova Systems published sambanova/generative_data_prep (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo sambanova/generative_data_prep · language Python · New data prep tool, moderate traction (67 stars).. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

SambaNova Systems Repo: sambanova/generative_data_prep

Captured source

source ↗

GitHub/github.com/sambanova/generative_data_prep

sambanova/generative_data_prep repository metadata

Source ↗

published Mar 28, 2023seen Jun 5captured Jun 11http 200method plain

sambanova/generative_data_prep

Language: Python

License: Apache-2.0

Stars: 67

Forks: 10

Open issues: 7

Created: 2023-03-28T02:04:40Z

Pushed: 2026-02-04T19:00:26Z

Default branch: main

Fork: no

Archived: no

README: ![CircleCI](https://dl.circleci.com/status-badge/redirect/gh/sambanova/generative_data_prep/tree/main) ![codecov](https://codecov.io/gh/sambanova/generative_data_prep)

Generative data preparation

This software package allows you to prepare datasets for training generative LLMs on SambaStudio and SambaNova's Reconfigurable Data Units (RDUs). Some features include efficient multiprocessing, shuffling data that outsizes RAM, and specifying tokens to attend to during training.

The `pipeline.py` script streamlines the data preparation process. It takes a single input file, shuffles and splits it into train/dev/test files, tokenizes, sequences, and converts them to HDF5 format using the utilities in `data_prep.py`. The output directory contains multiple split HDF5 files that are needed to run data parallel training. This output directory will be directly used as a training dataset in SambaStudio. While this package features simple flows that work out of the box, it also supports more customization allowing for many styles of packing varied length text into tokenized sequences.

If you are an advanced user looking to process data with pre-defined splits, integrate with the package validation tools, or contribute, check out the [Advanced Usage](#advanced-usage) section below!

[Requirements](#requirements)
[Installation](#installation)
[Getting Started](#getting-started)
[Input](#input)
[Formatting data for Chat/Instruction/Fine Tuned Models](#formatting-data-for-chatinstructionfine-tuned-models)
[Output](#output)
[Flags](#flags)
[Examples](#examples)
[Pre-training](#pre-training)
[Fine-tuning](#fine-tuning)
[Dialogue](#dialogue)
[Meta in context learning](#meta-in-context-learning)
[Understanding Command Outputs](#understanding-outputs)
[FAQs](#faqs)
[Advanced Usage](#advanced-usage)

Requirements

Python version 3.8.10+
Support for Linux, Mac OS and Windows.

Installation

git clone https://github.com/sambanova/generative_data_prep.git
cd generative_data_prep
pip install .

Getting Started

The following simple example will help you get started with your first processed dataset:

Example

python3 -m generative_data_prep pipeline --input_path= --output_path= --pretrained_tokenizer=openai-community/gpt2 --max_seq_length=1024 --input_packing_config='greedy::drop' --shuffle=on_RAM

Here are a few important parameters to know about when running this example:

Flag Name Type Description Instructions

input_path str An existing file path to the dataset to be processed, or directory of files. File must be in .jsonl or .txt format. Check out the input section for more details.

output_path str A path to the desired output location for the directory of processed dataset files. If the path doesn't exist, a new directory will be created using the provided path. Check out the output section for more details.

pretrained_tokenizer str The model specific tokenizer to use when tokenizing the input dataset. You can specify the tokenizer in two ways. The preferred method is to provide the directory path to the locally downloaded base checkpoint. The alternative method is to use the model ID from the Hugging Face model card, such as "mistralai/Mistral-7B-v0.1" for Mistral-7B-v0.1. If the model is gated on Hugging Face, you must request access and log in via the Hugging Face CLI before executing the data preparation command.

max_seq_length int The maximum sequence length (in tokens) that an RDU model training configuration can support. When launching the training job on SambaStudio, under "Hyperparameters and Settings," ensure that the max_seq_length value during training matches exactly with this input flag. Note that the available max_seq_length training configurations may not align with the model’s maximum sequence length on Hugging Face.

input_packing_config str

Defines the strategy used to pack the provided text data into fixed-length sequences.

For pre-training, use 'full'.

For fine-tuning:

• 'greedy::truncate_right' for efficient training with multiple data points per sequence

• 'single::truncate_right' for limited data with one data point per sequence

See input_packing_config for all options and details.

shuffle str Determines whether to shuffle the input dataset, and whether to shuffle on RAM. There are 3 options for this flag: 'False', 'on_RAM', 'large_file'. Check out the shuffle flag below for more details.

apply_chat_template bool Whether to tokenize the data using tokenizer.apply_chat_template, adding chatML tags during tokenization (e.g., <user>: ... <assistant>:). This option is typically used for instruction tuning or fine-tuning chat models. To enable this flag, the tokenizer you are loading must have a chat template defined. You can verify this by checking the tokenizer_config.json file for a chat_template key.

Input

The input_path argument must be a file or a directory containing one files, each file must be a .txt or `.jsonl`.

`.jsonl` Format

The JSON Lines format can be used for [fine-tuning](#fine-tuning), or [pre-training](#pre-training)/continual pre-training. Each line in the .jsonl format should be a json object with a prompt, and completion element. For example:

{"prompt": "What did the fox do?", "completion": "The quick brown fox jumped over the lazy dog."}
{"prompt": "How much wood does a woodchuck chuck?", "completion": "A woodchuck chucks 1000 wood."}
{"prompt": "Who sells seashells by the sea shore?", "completion": "She sells seashells by the sea shore."}

We also support lists of prompt/completion pairs within a .jsonl file. This guarantees that the prompt/completion pairs in the list will be placed contiguously in the same sequence. If the input prompt/completion...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New data prep tool, moderate traction (67 stars).