meta-llama/synthetic-data-kit
Python
Captured source
source ↗meta-llama/synthetic-data-kit
Description: Tool for generating high quality Synthetic datasets
Language: Python
License: MIT
Stars: 1597
Forks: 219
Open issues: 48
Created: 2025-03-27T06:40:42Z
Pushed: 2025-10-28T20:10:55Z
Default branch: main
Fork: no
Archived: no
README:
Synthetic Data Kit
Tool for generating high-quality synthetic datasets to fine-tune LLMs.
Generate Reasoning Traces, QA Pairs, save them to a fine-tuning format with a simple CLI.
> Checkout our guide on using the tool to unlock task-specific reasoning in Llama-3 family
What does Synthetic Data Kit offer?
Fine-Tuning Large Language Models is easy. There are many mature tools that you can use to fine-tune Llama model family using various post-training techniques.
Why target data preparation?
Multiple tools support standardized formats. However, most of the times your dataset is not structured in "user", "assistant" threads or in a certain format that plays well with a fine-tuning packages.
This toolkit simplifies the journey of:
- Using a LLM (vLLM or any local/external API endpoint) to generate examples
- Modular 4 command flow
- Converting your existing files to fine-tuning friendly formats
- Creating synthetic datasets
- Supporting various formats of post-training fine-tuning
How does Synthetic Data Kit offer it?
The tool is designed to follow a simple CLI structure with 4 commands:
ingestvarious file formatscreateyour fine-tuning format:QApairs,QApairs with CoT,summaryformatcurate: Using Llama as a judge to curate high quality examples.save-as: After that you can simply save these to a format that your fine-tuning workflow requires.
You can override any parameter or detail by either using the CLI or overriding the default YAML config.
Installation
From PyPI
# Create a new environment conda create -n synthetic-data python=3.10 conda activate synthetic-data pip install synthetic-data-kit
(Alternatively) From Source
git clone https://github.com/meta-llama/synthetic-data-kit.git cd synthetic-data-kit pip install -e .
To get an overview of commands type:
synthetic-data-kit --help
1. Tool Setup
- The tool can process both individual files and entire directories.
# Create directory structure for the 4-stage pipeline
mkdir -p data/{input,parsed,generated,curated,final}
# Or use the legacy structure (still supported)
mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}- You also need a LLM backend that you will utilize for generating your dataset, if using vLLM:
# Start vLLM server # Note you will need to grab your HF Authentication from: https://huggingface.co/settings/tokens vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000
2. Usage
The flow follows 4 simple steps: ingest, create, curate, save-as. You can process individual files or entire directories. All data is now stored in Lance format by default.
# Check if your backend is running synthetic-data-kit system-check # SINGLE FILE PROCESSING (Original approach) # Parse a document to a Lance dataset synthetic-data-kit ingest docs/report.pdf # This saves file to data/parsed/report.lance # Generate QA pairs (default) synthetic-data-kit create data/parsed/report.lance --type qa OR # Generate Chain of Thought (CoT) reasoning examples synthetic-data-kit create data/parsed/report.txt --type cot # Both of these save file to data/generated/report_qa_pairs.json # Filter content based on quality synthetic-data-kit curate data/generated/report_qa_pairs.json # Convert to alpaca fine-tuning format and save as HF arrow file synthetic-data-kit save-as data/curated/report_cleaned.json --format alpaca --storage hf
2.1 Batch Directory Processing (New)
Process entire directories of files with a single command:
# Parse all documents in a directory synthetic-data-kit ingest ./documents/ # Processes all .pdf, .html, .docx, .pptx, .txt files # Saves parsed text files to data/parsed/ # Generate QA pairs for all text files synthetic-data-kit create ./data/parsed/ --type qa # Processes all .txt files in the directory # Saves QA pairs to data/generated/ # Curate all generated files synthetic-data-kit curate ./data/generated/ --threshold 8.0 # Processes all .json files in the directory # Saves curated files to data/curated/ # Convert all curated files to training format synthetic-data-kit save-as ./data/curated/ --format alpaca # Processes all .json files in the directory # Saves final files to data/final/
2.2 Preview Mode
Use --preview to see what files would be processed without actually processing them:
# Preview files before processing synthetic-data-kit ingest ./documents --preview # Shows: directory stats, file counts by extension, list of files synthetic-data-kit create ./data/parsed --preview # Shows: .txt files that would be processed
Configuration
The toolkit uses a YAML configuration file (default: configs/config.yaml).
Note, this can be overridden via either CLI arguments OR passing a custom YAML file
# Example configuration using vLLM llm: provider: "vllm" vllm: api_base: "http://localhost:8000/v1" model: "meta-llama/Llama-3.3-70B-Instruct" sleep_time: 0.1 generation: temperature: 0.7 chunk_size: 4000 num_pairs: 25 max_context_length: 8000 curate: threshold: 7.0 batch_size: 8
or using an API endpoint:
# Example configuration using the llama API llm: provider: "api-endpoint" api-endpoint: api_base: "https://api.llama.com/v1" api_key: "llama-api-key" model: "Llama-4-Maverick-17B-128E-Instruct-FP8" max_retries: 3 sleep_time: 0.5
Customizing Configuration
Create a overriding configuration file and use it with the -c flag:
synthetic-data-kit -c my_config.yaml ingest docs/paper.pdf
Examples
Processing a Single PDF Document
# Ingest PDF synthetic-data-kit ingest research_paper.pdf # Generate QA pairs synthetic-data-kit create data/parsed/research_paper.txt -n 30 # Curate data synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5 # Save in OpenAI fine-tuning format (JSON) synthetic-data-kit save-as data/curated/research_paper_cleaned.json -f ft # Save in OpenAI fine-tuning format (HF dataset) synthetic-data-kit save-as…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Solid repo from Meta with decent traction.
Meta AI (Llama) has a repo signal matching data demand, evals and quality.