nvidia/nemotron-climb-fasttext-classifiers
Captured source
source ↗Model Overview
Description:
Nemotron-CLIMB FastText Classifiers are five lightweight, CPU-based fastText text classifiers — quality, advertisement, informational_value, cultural_value, and educational_value — developed by NVIDIA as part of the Nemotron-CLIMB data curation pipeline. Their sole purpose is to efficiently estimate the suitability of candidate web documents for large-language-model training at scale, enabling automated data quality control before any model training occurs. These models are ready for commercial use.
License/Terms of Use:
The fastText library used for training is developed by Meta Research and released under the MIT License. The NVIDIA-trained classifier weights are released under the NVIDIA Open Model License.
Deployment Geography:
Global
Use Case:
These classifiers are intended for use by ML engineers and data scientists who are building or refining pre-training corpora for large language models. The specific use case is automated scoring and filtering of web-crawled documents across five quality dimensions (text quality, advertisement content, informational value, cultural value, and educational value) as part of a data curation pipeline.
References(s):
- DCLM (DataComp-LM) — Source data pool derived from Common Crawl
- nvidia/Nemotron-4-340B-Instruct — Teacher LLM used for annotation
Model Architecture:
Architecture Type: Shallow Neural Network (fastText supervised classifier)
Network Architecture: fastText supervised model with bag-of-words and n-gram input representation. Each classifier is a separate binary model file.
This model was developed based on the [fastText](https://fasttext.cc/) supervised classification framework developed by Meta Research.
Number of model parameters: Each model contains 300-dimensional word embeddings over a large vocabulary derived from ~1 million web documents, resulting in a binary model file of approximately 6.8 GB per classifier.
Design Choices: The classifiers were produced through a two-stage knowledge distillation process: 1. LLM-based annotation (teacher signal). Approximately 1 million web documents — sourced from the publicly available DCLM (DataComp-LM) data pool, which itself is derived from Common Crawl — were evaluated by nvidia/Nemotron-4-340B-Instruct. Each document was truncated to 2,048 tokens and scored on a 0–5 Likert scale across multiple quality dimensions using a detailed rubric prompt. The rubric assesses text quality, presence of promotional language, informational depth, cultural significance, and educational value. 2. FastText classifier training (student models). For each quality dimension, a separate fastText supervised classifier was trained on the LLM-generated labels. Training used an 80/10/10 train/validation/test split, with hyperparameters: learning rate 0.289, 7 epochs, 2-word n-grams, 300-dimensional embeddings.
Input(s):
Input Type(s): Text
Input Format(s):
- Text: Plain-text string (UTF-8 encoded)
Input Parameters:
- Text: One-Dimensional (1D)
Other Properties Related to Input: Documents are expected to be web-crawled text. During the teacher-annotation stage, documents were truncated to 2,048 tokens. At inference time, fastText processes the full input text. No special pre-processing is required beyond standard text normalization (lowercasing is handled internally by fastText).
Output(s)
Output Type(s): Text (classification label with confidence score)
Output Format(s):
- Text: A predicted label in the range
__label__0through__label__5(corresponding to the 0–5 Likert scale) along with an associated probability score.
Output Parameters:
- Text: One-Dimensional (1D)
Other Properties Related to Output: Each classifier outputs a discrete score from 0 to 5 representing the quality of the input document along its respective dimension (quality, advertisement, informational value, cultural value, or educational value). Higher scores indicate higher quality / value. The classifiers run at high throughput on CPU hardware and do not require GPU acceleration.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine:
- fastText (Python bindings or command-line interface)
- Not Applicable (N/A) — No NVIDIA-specific runtime engine is required
Supported Hardware Microarchitecture Compatibility:
- CPU-only — These models are designed to run on standard x86_64 or ARM CPU hardware. No GPU is required.
Supported Operating System(s):
- Linux
- macOS
- Windows
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s):
| Classifier | Filename | Size | Version |---|---|---|---| | Quality | best_model_quality.bin | ~6.8 GB | v1.0 | | Advertisement | best_model_advertisement.bin | ~6.8 GB | v1.0 | | Informational Value | best_model_informational_value.bin | ~6.8 GB | v1.0 | | Cultural Value | best_model_cultural_value.bin | ~6.8 GB | v1.0 | | Educational Value | best_model_educational_value.bin | ~6.8 GB | v1.0 |
All five classifiers are v1.0 releases produced from the same knowledge-distillation pipeline.
Training, Testing, and Evaluation Datasets:
Training Dataset:
Data Modality:
- Text
Training Data Size:
Text Training Data Size: Less than a Billion Tokens
Data Collection Method by dataset:
- Hybrid: Automated, Synthetic (LLM-generated labels via Nemotron-4-340B-Instruct)…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10NVIDIA classifier release, moderate impact.