Why Distributed Training Fails at Scale
Captured source
source ↗Why Distributed Training Fails at Scale | CoreWeave Blog
Announcement
Announcement
Webinar
Announcement
Podcast
Announcement
GTC 2026
Announcement
CoreWeave brings up the industry’s first NVIDIA Vera Rubin NVL72 deployment.
Read more
Products
Data and storage
Infrastructure control
Runtime acceleration
Model and agent development
Mission control
Solutions
Pricing
Resources
About us
Contact us Login
Contact us Login
Clear
You're 11 days into a training run on 1,024 GPUs. The job is healthy—or at least it was when you checked before leaving for the evening. Then the alert arrives at 2 a.m. The job is hung. Not a clean exit, but a stall. By the time your team realizes this has happened, finds the last valid checkpoint, and resubmits the job, hours of compute are gone, and your timeline has slipped another week. Your team tested this run just two weeks ago, and everything was fine. But now, you’re operating on real-world infrastructure at 100x times the scale as your test runs. This kind of failure doesn't come from bad engineering. It comes from expecting infrastructure to do something it wasn't designed to do at this scale. More compute, more problems? A distributed training job across hundreds or thousands of GPUs is a coordination problem. In a tightly coupled distributed job, a node that falls behind doesn't just slow down; it can bring the entire job to a crawl. That single node stalls every other node waiting on that synchronization point. Collective operations like AllReduce hit a wall at scale because one lagging node forces the entire cluster to wait. The bigger the cluster, the more coordination surface area, and the more ways something can go wrong before anyone knows it. Take a look at the latest report from Epoch AI, which examined a comprehensive database of over 3500 models, tracking key factors driving machine learning progress. Before 2010, the longest AI training runs spanned about seven days. Today, most frontier models train for well over seven days, and many run for 40 days or more—which requires much larger systems.
Time isn’t the only factor. Model parameter counts have scaled from millions to hundreds of billions. The clusters required to support that growth have followed, training larger models on more and newer GPUs, growing training compute (FLOP) to roughly 4.3x per year since 2010 .
Model parameter counts have scaled from millions to hundreds of billions over the past decade, driving a corresponding increase in the GPU clusters—and coordination complexity—required to train them. Source: Epoch AI, 2026 Impressive? Yes, but scale changes more than the cost and timeline. It changes the failure profile entirely. This has created a massive challenge for platform engineers: large-scale distributed training has evolved faster than the infrastructure supporting it. The GPU coordination challenge Most on-premises infrastructure and general-purpose cloud weren't built for this reality. General-purpose cloud scales independent workloads. It doesn't scale AI training coordination, so it struggles with distributed training. As models grow and training runs lengthen, the limits of general-purpose infrastructure don't announce themselves. They show up as utilization gaps, coordination failures, and progress reports nobody wants to deliver. The numbers reflect this: Research from Meta FAIR shows that mean time to failure (MTTF) drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs , illustrating the inverse relationship between cluster size and reliability. 1,024-GPU Cluster: MTTF is approximately 7.9 hours 16,384-GPU Cluster: MTTF drops sharply to just 1.8 hours
More disruptions often means more time spent troubleshooting and more compute wasted. These failures impact key utilization metrics. For example, in a recent Signal65 report , the current industry baseline for Model FLOPs Utilization (MFU) during large-scale AI training is generally considered to be 35–45% . Goodput, the percentage of time a system spends doing useful training work versus handling interruptions, averages ~90% across the industry. This efficiency gap exists for predictable reasons: communication overhead across distributed nodes, memory access constraints, and synchronization requirements all limit how much theoretical peak performance actually translates to model computation . As cluster size grows, the coordination surface area expands, leading to more cumulative drag from overhead that already limits utilization at smaller scales. The three places distributed training breaks Infrastructure failures in large-scale training aren't random. They cluster around three predictable failure modes. Understanding them is the first step toward designing for them. 1. Compute failures GPU faults, thermal drift, and ECC (error-correcting code) errors that degrade silently before they crash a job. The insidious quality of these failures is that they often don't cause an immediate crash. They introduce silent degradation that poisons training progress before anyone notices. By the time the job fails, the last valid checkpoint may be hours old. Thermal throttling is a particularly common and underdiagnosed version of this. A GPU that throttles to stay within its power or temperature envelope becomes a permanent straggler, forcing every other healthy GPU in the job to wait at the next synchronization barrier. You aren't just losing one GPU; you're losing the aggregate throughput of the entire cluster. 2. Coordination failures Straggler nodes, synchronization stalls, and AllReduce bottlenecks. These are not hardware failures in the traditional sense. The node is still running, but the training step cannot be completed until every participant finishes. At scale, the probability of encountering at least one straggler on any given step approaches certainty. The question is whether the infrastructure is built to detect, isolate, and route around it. 3. Recovery failures What happens after something breaks matters as much as the failure itself. An infrastructure that fails slowly, checkpoints infrequently, and restarts from scratch compounds the original failure into a much larger loss of time and compute. The difference between a job that recovers in minutes and one that requires a full restart from an eight-hour-old checkpoint can translate into days of lost progress on long training runs. When every...
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Substantive post on scaling challenges by notable cloud provider