WritingCoreWeaveCoreWeavepublished Jun 1, 2026seen 6d

Choosing the Right NVIDIA Platform for Running Inference on CoreWeave

Open original ↗

Captured source

source ↗

Choosing the Right NVIDIA GPU for Running Inference | CoreWeave Blog

Announcement

Announcement

Webinar

Announcement

Podcast

Announcement

GTC 2026

Announcement

CoreWeave brings up the industry’s first NVIDIA Vera Rubin NVL72 deployment.

Read more

Products

Data and storage

Infrastructure control

Runtime acceleration

Model and agent development

Mission control

Solutions

Pricing

Resources

About us

Contact us Login

Contact us Login

Clear

Inference has a seemingly simple job: turn tokens into answers, reliably, at the latency and throughput users expect. In practice, it can get complicated quickly. Modern inference workloads are rapidly evolving beyond simple chat applications toward reasoning, long-context processing, and agentic AI systems that require dramatically more compute, memory bandwidth, and interconnect performance. Matching the right GPU to your workload unlocks real budget efficiency—you only pay for the memory you actually use, keep tensor cores fully engaged with optimal batch sizes, and scale replicas precisely to your p95 targets. In the end, you get  more headroom to innovate and the freedom to scale on your terms. That’s why picking the ideal NVIDIA AI platform  for inference starts with your workload profile, not a spec sheet. The determining factors are  model size and context window (VRAM), concurrency and batching behavior (throughput), latency SLOs (tail performance), and deployment shape (single GPU vs multi-GPU, single node vs multi-node). With CoreWeave, we help you match platform selection to your use case and business goals, with an AI-native stack purpose-built to run inference efficiently at scale. In this blog, we’ll break down the NVIDIA GPUs available on CoreWeave and map each one to common inference patterns, so you can right-size performance, avoid overprovisioning, and keep cost per token predictable. NVIDIA GB300 NVL72 NVIDIA GB300 NVL72 is purpose-built for AI reasoning and test-time scaling workloads, where model quality improves with additional compute at inference time. It’s also ideal for rack-scale deployments where maximizing utilization across tightly coupled multi-GPU infrastructure is more important  than optimizing a single GPU instance. Without blowing up p95, it can help you: Serve frontier-scale or very large mixture of experts (MoE)  models Run high-concurrency chat endpoints Handle mixed request shapes (short interactive chat plus long-context summarization and generation)

Recent SemiAnalysis InferenceX data shows that NVIDIA software optimizations and NVIDIA Blackwell Ultra GB300 NVL72 platforms deliver up to 50x higher throughput per megawatt and 35x lower token cost compared to Hopper-generation platforms. As of May 2026, GB300 NVL72 represents one of the most advanced NVIDIA platforms available for large-scale AI reasoning workloads.  In 2026’s MLPerf 6.0 benchmark , CoreWeave’s GB300 NVL72 submissions led all submitters in multiple categories. NVIDIA GB200 NVL72 NVIDIA GB200 NVL72 is ideal when single-GPU or single-node architectures become the limiting factor for model size, throughput, or latency. Built on the NVIDIA Blackwell architecture, GB200 NVL72 combines second-generation Transformer Engine innovations with high-bandwidth NVLink connectivity to support inference at scale. With 130 TB/s of NVLink Switch bandwidth, the 72 Blackwell GPUs and 36 Grace CPUs act as a single massive system, accelerating real-time inference of reasoning models. For large-scale inference on billion parameter sized models, GB200 NVL72 provides 25x lower cost and energy consumption. NVIDIA GB200 NVL72 is a great fit  for production LLM inference that needs rack-scale throughput and consistent tail latency, especially for very large models and multi-GPU serving. NVIDIA HGX™ B300 NVIDIA HGX B300 is a strong fit for production LLM serving that mixes high concurrency, large context windows, and reasoning-heavy prompts including multi-step tool use, agent workflows, and long-context inference. Built on NVIDIA Blackwell Ultra architecture, NVIDIA HGX B300 doubles interconnect speed with NVIDIA Quantum-X800 InfiniBand networking, NVIDIA BlueField-3 data processing units (DPUs), and 800 Gbps NVIDIA ConnectX-8 SuperNICs, enhancing NVFP4 inference performance, and increasing GPU memory capacity by 50% over the NVIDIA HGX B200. It’s an ideal platform for maximizing tokens per second per GPU without sacrificing p95 latency, especially as the request mix shifts from short chat to long-context, higher-attention workloads. NVIDIA HGX B200 NVIDIA HGX B200 fits teams serving a portfolio: chat and code models, RAG-backed assistants, and mid-to-large parameter LLMs with moderate-to-high concurrency. It’s a good default when you need high tokens/sec, efficient batching, and room to scale from single-node to multi-node without jumping straight to rack-scale architectures. NVIDIA HGX B200 is a versatile “workhorse” for production inference that needs strong throughput per dollar across a broad mix of models. NVIDIA HGX H200 NVIDIA HGX H200 shines when H100-class memory becomes the limiting factor: bigger context windows, heavier retrieval augmentation, and higher batch sizes that are constrained by cache/memory movement. If you keep running into “not enough memory” or “KV cache is the bottleneck” during performance tuning, NVIDIA HGX H200 is often the cleanest step up. NVIDIA HGX H200 is ideal for inference workloads where performance depends on moving model weights, KV cache, activations, and data quickly through memory and across the AI stack: Long-context LLMs Large KV cache footprints RAG pipelines that pressure VRAM and bandwidth.

NVIDIA HGX H100 NVIDIA HGX H100 remains a reliable choice for high-performance inference across common LLM serving stacks, especially when you’re balancing throughput and latency on well-understood model families. It’s also a strong option when you expect to reuse the same fleet for multiple workload types, and you value the depth of existing software optimization and operational familiarity. NVIDIA HGX H100 is ideal for proven, broadly optimized production inference (and mixed training/inference) with mature ecosystem tuning. NVIDIA RTX PRO 6000 Blackwell Server Edition RTX PRO 6000 Blackwell Server Edition is a strong fit for: Agentic AI inference Multimodal serving Enterprise workloads that blend inference with visual computing or simulation-adjacent tasks

It’s also compelling for teams that want to…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine technical guide, not a model release or notable research.