WritingDigitalOcean (GradientAI)DigitalOcean (GradientAI)published May 5, 2026seen 5d

Powering the Inference Era: Inside the DigitalOcean AI-Native Cloud

Open original ↗

Captured source

source ↗

Powering the Inference Era: Inside the DigitalOcean AI-Native Cloud | DigitalOcean

© 2026 DigitalOcean, LLC. Sitemap .

Dark mode is coming soon. Product updates Powering the Inference Era: Inside the DigitalOcean AI-Native Cloud

By Vinay Kumar, Chief Product & Technology Officer

Updated: May 5, 2026 7 min read

<- Back to blog home

I’ve spent the last fifteen years building cloud services: early days of AWS building S3 and EBS, helping launch Oracle Cloud Infrastructure from inception, and now building the agentic cloud at DigitalOcean for AI-natives. Every cloud I’ve worked on was designed for the workloads of its era. Those clouds were built for human-centric SaaS applications: a few users, a handful of requests per session, predictable data flows.

AI workloads break every one of those assumptions.

AI runs in loops. Agents think, then act, then think again. A single user task can span hundreds of thousands of tokens, traverse half a dozen tools, hit a knowledge base, write code, execute it, and persist state, all before returning an answer. The clouds we have weren’t built for this. Hyperscalers give you hundreds of services built for yesterday’s applications, and leave the integration to you. Inference-only providers sit on someone else’s compute and stack their margin on top. GPU rental shops (frequently referred to as “Neoclouds”) give you silicon, but not a system.

This week at Deploy 2026 , we launched the DigitalOcean AI-Native Cloud, a purpose-built platform for the inference and agentic era that integrates five layers from silicon to agents into a single open stack.

We shipped fifteen products on Tuesday. Here’s what’s inside.

The shape of the stack

Our AI-Native Cloud is composed of five layers, each addressing a real workload pattern we’ve watched our customers wrestle with.

They’re independently useful and beautifully integrated:

Managed Agents: production runtime for agents, with sandboxes, durable state, and a universal data plane

Data & Learning: managed databases, vector stores, knowledge bases, and feedback loops

Inference Engine: every open and frontier model on one endpoint, optimized at the kernel

Core Cloud: compute, networking, and storage primitives, tuned for AI

Infrastructure: DigitalOcean-owned silicon and facilities, co-engineered with the industry’s best

Open source isn’t an add-on at any of these layers. It’s the foundation: PostgreSQL, MySQL, MongoDB, Valkey, OpenSearch, Kafka, Weaviate, vLLM, SGLang, OpenCode, LangGraph, CrewAI. Open all the way down. You bring your weights, your harness, your tools. We provide the runtime.

Let me walk through it, from the ground up.

Infrastructure: own the silicon, own the economics

Our global footprint now spans 19 data centers and 200+ network points of presence, with future capacity coming online in Kansas City and Memphis . That includes our first liquid-cooled racks , purpose-built for next-generation high-density GPU workloads.

Our Richmond data center is now generally available , with NVIDIA HGX™ B300 and AMD Instinct™ MI350X GPUs available alongside the H100, H200, and MI300/MI325 silicon already running across our fleet. We co-engineer at the kernel level with both NVIDIA and AMD. We don’t rent capacity. We own it. That’s why your unit economics improve as you scale on us, instead of getting worse.

Core Cloud: the foundation under every agent

Hundreds of thousands of customers already run on our core cloud every day: Droplets, Kubernetes (DOKS), VPC networking, and object/block/network file storage. We’ve extended it for AI workloads with a non-blocking RDMA fabric, RDMA-enabled NFS, and VPC-native inference out of the box.

At Deploy we announced Burstable CPU and MicroVM Droplets, currently in Private Preview. These are Firecracker-based instances that start in roughly 200 milliseconds, ideal for agent sandboxes and lightweight, spiky workloads. Agents need GPUs for thinking and CPUs for doing. We have both, and now they’re sized for how agents actually behave.

Inference Engine: every model, one endpoint

This is the layer we’ve rebuilt from the ground up. We co-developed it with design partners like Hippocratic AI, and the result is one of the highest-performing inference engines on the market today: fastest inference for Qwen 3.5 and DeepSeek V3.2 in independent Artificial Analysis benchmarks for token throughput.

Here’s what’s new:

Inference Router (Public Preview): a preference-aware control plane that picks the right model for each request, balancing cost, latency, and quality with no code changes

Dedicated Inference (General Availability): reserved capacity with predictable performance and economics for production workloads

Bring Your Own Model (BYOM) (General Availability): a service for hosting your fine-tunes on our serving stack and inherit the kernel-level optimizations

Multi-modal model support (General Availability): text, vision, audio, and video on a single API

Batch Inference (General Availability): purpose-built for asynchronous workloads (document processing, eval runs, synthetic data generation) at roughly 50% of peak serverless pricing

Content Safety Guardrails (General Availability): policy controls integrated at the inference layer

Serverless Inference with multi-modal support (General Availability): single API, scale to zero, pay only for tokens consumed

Evaluations (Public Preview): automated scoring against golden datasets or built-in judge models, so you can swap models without flying blind

The Router deserves a closer look. It’s a preference-aware control plane that picks the best model for each request, balancing cost, latency, and quality without touching application code. Unlike static routing rules, it runs on a purpose-built small language model that resolves intent in 200 milliseconds and ranks candidates against live cost and latency data, so the right model wins at 2am and at 2pm. Most AI builders start on a single frontier model. Then PMF happens, the bill scales linearly with usage, and the unit economics get painful fast. Most successful AI natives we work with run three or more models in production. The leading edge is running twenty or more. The Router makes that possible without a rewrite.

Take Celiums.AI , across 29.2M tokens processed through the Inference Router, 83% of their traffic now lands on open-source models, up from zero.

“Our AI Ethics Engine was built with open-source AI, so running it on closed-source models…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low-traction marketing blog post, no code or release