CoreWeave Sets New Standard as First NVIDIA GB200 Exemplar Cloud, Improving Upon NVIDIA’s Own Training Performance Targets
Captured source
source ↗CoreWeave Becomes First NVIDIA GB200 Exemplar Cloud
Announcement
Announcement
Webinar
Announcement
Podcast
Announcement
GTC 2026
Announcement
CoreWeave brings up the industry’s first NVIDIA Vera Rubin NVL72 deployment.
Read more
Products
Data and storage
Infrastructure control
Runtime acceleration
Model and agent development
Mission control
Solutions
Pricing
Resources
About us
Contact us Login
Contact us Login
Clear
Pioneers need the most performant and reliable AI cloud, and CoreWeave has become the first cloud provider in the world to be named an NVIDIA Exemplar Cloud for training workloads running on NVIDIA GB200 NVL72, optimized by CoreWeave Mission Control TM , the operating standard for the AI cloud. In collaboration with NVIDIA, CoreWeave achieved groundbreaking results, consistently exceeding target performance across critical test cases. This landmark achievement validates CoreWeave’s purpose-built infrastructure for delivering unparalleled performance and reliability on the latest AI accelerators required by the most demanding AI models. The NVIDIA Exemplar Cloud initiative was launched to provide rigorous, standardized benchmarking across cloud platforms, ensuring transparency and reproducibility. Earlier this year, CoreWeave became an NVIDIA Exemplar Cloud for NVIDIA H100 GPU using up to 1,024 GPUs and a 30 billion parameter Llama 2-style model. Why participating in the Exemplar Cloud Initiative matters By examining benchmark data, developers, researchers, and engineers are empowered to optimize their deployments with confidence and hold providers accountable for delivering on performance. This transparent approach to performance measurement is essential for running mission-critical AI workloads. The most recent results demonstrate how CoreWeave’s multi-generation approach to fine-tuned GPU performance with full stack observability and relentless automated performance optimizations yields consistent peak performance and reliability. This means having the ability to deploy demanding AI applications, such as large-scale pretraining or disaggregated multi-node inference with the confidence that the AI workloads will run efficiently and as expected. This minimizes guesswork and surprises with each generation of GPUs, providing the predictability and reproducibility AI pioneers need as they evolve models and scale training. Quantifiable performance: key benchmark results GPU performance was measured using Model Flops Utilization (MFU), the independent gold standard for measuring efficiency in large model training. CoreWeave’s clusters demonstrated superior utilization compared to reference targets, validating the strength of our architecture and integrated software stack. Key results include: DeepSeek v3 (BF16): Achieved a 1.9% greater MFU than the NVIDIA reference target on 512 GPUs, confirming CoreWeave’s optimizations for unmatched performance. Grok-1 314B (BF16): Achieved 4.7% greater MFU than the NVIDIA reference target on 512 GPUs, underscoring superior compute stability. Llama 3.1 405B (FP8): Demonstrated 3.8% greater MFU on 512 GPUs, confirming the optimization of the high-speed NVIDIA NVLink fabric and storage I/O. Llama 3 70B (FP8): Posted a 2.4% greater MFU on 512 GPUs, proving enhanced efficiency even on widely adopted open-source models.
CoreWeave consistently exceeded NVIDIA’s benchmark target performance Unlocking predictability for AI pioneers This milestone affirms the performance customers consistently receive from CoreWeave’s platform across generations of software and hardware. Predictable, highly efficient performance is essential for today’s long running training workloads, and this commitment to excellence is how CoreWeave helps customers achieve results faster and accelerate development timelines. The certification process rigorously tests performance and stability under extreme load, focusing on: Cluster goodput and efficiency for optimal training: Maximizing the amount of useful work achieved by the GPU cluster, ensuring resources are not wasted on idle time or I/O bottleneck. Performance scaling for faster training completion: Demonstrating predictable and efficient scaling across large, distributed GB200 NVL72 clusters. Hardware and system resilience for uninterrupted training: Validating the underlying infrastructure's ability to maintain stability during intense, prolonged training sessions.
Deep dive into CoreWeave’s groundbreaking results To achieve these groundbreaking results, we didn't rely on a "cherry-picked" performance lab environment. Instead, the Exemplar Cloud benchmarks were conducted on a standard CoreWeave cloud cluster. The environment consisted of 8 racks of NVIDIA GB200 NVL72 systems, seamlessly integrated via high speed NVIDIA Quantum-2 InfiniBand networking platform. CoreWeave Mission Control was the key to operating the cluster with the utmost efficiency. It uses a massive dataset gathered from operating hundreds of thousands of NVIDIA GPUs to enable predictive failure detection and mitigation instead of passive monitoring. The system constantly tests and identifies suboptimal components, replacing them proactively before they can impact a customer's workload. This ensures that only the highest-performing, most reliable hardware is ever part of a cluster. Our architecture is built to maximize performance, resiliency, and efficiency with deep integration between hardware and software. The runtime environment leverages a unified stack with performance optimizations across every layer from metal to model: CoreWeave Bare Metal servers are fast, reliable, and performant, providing direct access to GPU computing resources without hypervisors that slow down processing, add overhead and latency. In addition, our network fabric enables high bandwidth memory and rapid data access for low latency, high throughput interconnects for the cluster. CoreWeave Kubernetes Service (CKS) provides the base runtime environment and is fully integrated with CoreWeaveMission Control to minimize overhead and maximize compute performance. CoreWeave Slurm on Kubernetes (SUNK) enables topology aware scheduling which optimizes performance and utilization for large scale training clusters. Based on a battle-tested Slurm distribution scaled to handle tens of thousands of nodes and hundreds of thousands of concurrent jobs, SUNK is also fully integrated with CoreWeaveMission Control. This allows maintenance and...
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable infrastructure milestone by major cloud provider