ByteDance-Seed/DATAMASK
Python
Captured source
source ↗ByteDance-Seed/DATAMASK
Description: Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning
Language: Python
License: Apache-2.0
Stars: 18
Forks: 0
Open issues: 0
Created: 2025-12-29T11:31:57Z
Pushed: 2026-01-04T03:15:45Z
Default branch: main
Fork: no
Archived: no
README:
DATAMASK
English | [中文README](README.zh_CN.md)
📖 Overview
Motivation In this study, we revisit metric-based selection and observe that selecting samples based on quality metrics (FineWeb-Edu, Ultra-FineWeb, and FineWeb-DCLM) shows severe diminishing returns during long-term pre-training, while selecting based on diversity metrics (FineWeb-Semdedup) removes too many valuable high-quality samples, both of which limit the capabilities of pre-trained LLMs.
Method: DATAMASK To solve the problem, as pipeline shown above, we propose a novel and efficient optimization framework for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process. It approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits.
Results: FineWeb-Mask Through policy gradient-based optimization and various acceleration enhancements, DATAMASK significantly reduces selection time by 98.9% compared to greedy algorithm (estimated on DiSF algorithm), enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask and achieves significant improvements after pre-training with hundreds of billions of tokens, demonstrating its effectiveness.
🚀 Quick Start
0. Clone the repository:
git clone https://github.com/ByteDance-Seed/DATAMASK.git cd DATAMASK
1. Prepare your data Based on our original code, your data should be .parquet format. In the parquet file, you should provide idx, scores and text features. Idx like chunk_id is used to identify selected samples in your dataset during large scale distributed selecting. In our paper, the data looks like:
| chunk_id | quality_score | feature_arr | |--------|--------|--------| | 0 | 8 | size(768) | | 1 | 4 | size(768) | | ... | ... | ... |
2. Define your optimization metrics In the code, we provide the implementation of three types of diversity scores, combined with quality score to perform the optimization. You can define your own number of metrics and the combination of metrics by modifying xx functions defined in utils/utils.py.
3. Hyper-parameters In DATAMASK optimization, we introduce multiple hyper-parameters. We provided extensive ablations studies on them in the paper. Please refer to the paper for more details.
- n_epochs: updating steps
- partial: ratio of a batch of total samples for batch training.
- algorithm: diversity score tyeps, options from ["DiSF","Facility","Pair_simi"]
- max_lr and min_lr : initial learning rate, and final learning rate when reached n_epochs. We use linear scheduler.
- select_ratio : selecting ratio
- lamb: lambda for balance between quality and diversity
- n_rollout: number of rollouts for each epoch
- init : logit initialization strategy, options from same and quality
4. Quick test After settle down data, optimization metrics, and hyper-parameters, you can try following codes as a quick test:
python3 train_mask.py \ --input your_input_path.parquet \ --output your_output_path.parquet \ --device cuda:0 \ --n_epochs 5000 \ --algorithm DiSF \ --partial 0.1 \ --max_lr 10 \ --min_lr 1 \ --n_rollout 128 \ --select_ratio 0.3 \ --init quality \ --lamb 0.5 ;
🔎 Text Feature Visualizations
To visualize the dilemma, we visualize the text embeddings via t-SNE on random subsets of FineWeb. White, light blue, and dark blue points correspond to samples that are top diverse,= top high-quality, and samples selected by algorithms that exhibit both high diversity and quality. Light blue points show tighter clustering. Dark blue points are sparse in algorithms except for ours. It means that, selecting samples based on quality scores (dark blue and purple data points) leads to tighter clustering compared to the raw data distribution, indicating higher semantic redundancy and reduced information diversity.
🌟 Optimization Curves and One Optimization Ablation
Here we show the optimization curves and one optimization ablation in terms of rollout number G. Ablation on choosing G. As shown in the figure, we record the optimized Facility Location values while tuning G in terms of computational time. Results show that a value of G that is too small causes the training to diverge, while a larger G incurs excessive computational costs. After tuning based on the three diversity metrics, we recommend G = 128 or 256, which is the smallest value that remains stable and yields near-optimal values across all cases. As for more ablations, please refer to the paper.
📈 Detailed Performance
In the following, we show detailed performance of each task during pre-training on the 1.5B dense model and 7B MoE model. The upper part is based on dense model, while the lower part is based on MoE model.
| 1.5B Dense | RACE-H | RACE-M | HellaSwag | NQ | OBQA | KQAPro | MMLU | TrivalQA | ARC-Challenge | SIQA | PIQA | WinoGrande | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | FineWeb | 40.8 | 51.2 | 58.9 | 11.1 | 48.6 | 45.1 | 33.9 | 35.2 | 31.9 | 48.9 | 75.1 | 58.5 | | FineWeb-Semdedup | 40.5 | 51.7 | 57.2 | 11.0 | 45.0 | 43.5 | 33.4 | 30.2 | 31.6 | 48.7 | 74.9 | 57.9 | | FineWeb-Edu | 42.3 | 51.4 | 57.6 | 12.1 | 53.6 | 42.5 | 37.8 | 37.8 | 44.5 | 49.2 | 74.2 | 59.0 | | UltraFineWeb-en | 41.8 | 53.0 | 57.8 | 10.9 | 50.6 | 42.2 | 37.2 | 30.6 | 44.2 | 48.9 | 75.9 | 57.1 | | FineWebPro | 43.1 | 52.2 | 61.3 | 12.4 | 51.5 | 42.0 | 36.2 | 38.6 | 43.0 | 50.1 | 75.2 | 61.0 | | FineWeb-DCLM | 43.6 | 52.9 | 61.4 | 11.1 | 48.2 | 43.4 | 34.8 | 37.8 | 40.7 | 50.4 | 76.2 | 61.4 | | FineWeb-Mask (Ours) | 43.8 | 53.7 | 56.4 | 14.1 | 51.4 | 47.0 | 36.5 | 47.3 | 40.9 | 51.4 | 74.4 | 59.8 | | | | | | | | | | | | | | | | 7B MoE | RACE-H | RACE-M | HellaSwag | NQ | OBQA | KQAPro | MMLU…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10New repo, low traction.
ByteDance (Doubao/Seed) has a repo signal matching data demand, infrastructure, safety and policy.