WritingDigitalOcean (GradientAI)DigitalOcean (GradientAI)published May 20, 2026seen 5d

How We Built DigitalOcean Inference Router

Open original ↗

Captured source

source ↗
published May 20, 2026seen 5dcaptured 3dhttp 200method plain

How We Built DigitalOcean Inference Router | DigitalOcean

© 2026 DigitalOcean, LLC. Sitemap .

Dark mode is coming soon. Engineering How We Built DigitalOcean Inference Router

By Adil Hafeez

Principal Engineer

Published: May 20, 2026 14 min read

<- Back to blog home

Most teams building on LLMs today make a single model decision and apply it uniformly across every request. They reach for a frontier model not because every task demands it, but because building the infrastructure to do anything smarter is hard, time-consuming, and easy to get wrong. When the tooling isn’t there, the path of least resistance is to use a single model, even if it means that you end up overpaying for most tasks.

Let’s take an example. If you’re a developer building with Cursor, Claude Code, Open Code or any coding agent today, you’ve already felt this. In a single session, your agent does deep codebase analysis, writes new functions, fixes bugs from test output, explains methods, searches documentation. These tasks are not equivalent but if you’re on a single hardcoded model, you’re paying frontier rates for all of them, including the ones that don’t need it. The stakes are even higher in agentic workflows and multi-agent systems. When multiple agents are running in parallel each planning, executing, and evaluating across long-horizon tasks the cost of uniform model selection compounds with every step. Furthermore, major AI providers are moving toward token-based billing and tighter rate limits. Inference costs are about to get more expensive.

The alternative is for hardcoded routing logic in the application layer with an intent classifier with the help of an LLM which adds to your cost and gets brittle fast. Even if you were to use a smaller model like Haiku to keep costs down, you’re now paying for a routing call on top of every inference call. Also, accuracy takes a hit as the model is not purpose built for routing, and as your task types evolve or models change, the logic breaks in ways that are hard to catch. You’ve introduced double taxation: the cost of the classifier plus the cost of maintaining brittle routing code that needs updating with every change to your stack. You’ve turned model selection into a feature you own and maintain, which is not desirable as a developer. What scales is routing at the infrastructure level by matching each request to the right model automatically, based on what the task actually requires, taking into consideration the priorities like cost and latency, without putting that logic in your code.

That’s why we built DigitalOcean’s Inference Router . It reads each request, understands the task from the conversation context, and routes to the best-fit model from your configured pool—optimizing for cost, latency, or quality depending on what you need. Drop it in with a single line change (“model”: “router:software-engineering”), and your agent starts using the right model for each task automatically.

Under the hood, it’s powered by Plano , an open-source AI-native proxy originally developed at Katanemo ( now part of DigitalOcean ). The routing model driving intent resolution is a 30B Mixture-of-Experts (MoE) model fine-tuned for task detection over long context windows—it outperforms GPT-5.1 and Claude Sonnet 4.5 on routing accuracy across multi-turn conversations, resolving intent in ~200ms. A 4B dense variant is also available for latency-sensitive deployments. This article covers how it works: the product, the models, the ranking engine, and the infrastructure underneath.

DigitalOcean’s Inference Router

The fastest way to start routing is with a preset router . Inference Router ships with presets for common workflows like Software Engineering, General, Writing, and Knowledge Base & Document Intelligence.

Each preset’s model recommendations are based on a hybrid evaluation methodology. We combine public benchmark signals from leading leaderboards to identify top candidates per task, then validate through in-house benchmarking on curated task-specific datasets. Final recommendations are confirmed by DigitalOcean’s data science team using automated scoring and human evaluation. Where open-source models deliver comparable accuracy to closed-source alternatives for a specific task, we recommend them, and where they don’t, we recommend the frontier closed-source model.

Presets support three selection policies: Optimal (DigitalOcean’s recommended model ordering based on this hybrid evaluation), Cost Efficiency (prioritize lowest cost), and Speed Optimization (prioritize lowest latency). Pick one from the router catalog , and you’re routing in minutes.

Using a preset is a drop-in replacement for any model call. Prefix the router name with router: in the model field:

None curl -s https://inference.do-ai.run/v1/chat/completions \ -H "Authorization: Bearer $MODEL_ACCESS_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "router:software-engineering", "messages": [ {"role": "user", "content": "Write a Python function to sort a list of dictionaries by key"} ] }'

The response tells you which model was selected (in the model field) and which task matched (via the x-model-router-selected-route header). If no task matches, the request falls through to the router’s configured fallback models, tried in order.

When you need more control or want to customize routing for your own use case, custom routers let you define your own tasks. Each task has a name, a natural-language description (used for intent matching), a pool of eligible models, and a selection policy (cheapest, fastest, or ranked order):

None curl -X POST "https://api.digitalocean.com/v2/gen-ai/models/routers" \ -H "Authorization: Bearer $MODEL_ACCESS_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "my-coding-router", "policies": [ { "custom_task": { "name": "code-generation", "description": "Generate new code, write functions, or create boilerplate" }, "models": ["openai-gpt-5.2", "anthropic-claude-sonnet-4.5"], "selection_policy": { "prefer": "fastest" } }, { "custom_task": { "name": "bug-fixing", "description": "Identify and fix errors or bugs in user-supplied code" }, "models": ["openai-gpt-5.2", "glm-5"], "selection_policy": { "prefer": "cheapest" } } ], "fallback_models": ["openai-gpt-oss-120b"] }'

You can test routing behavior in the Playground , compare a router against a single model side by side, with cost and latency differences visible per request. For systematic…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine engineering blog post by cloud provider, not major AI event