AI Serving Platform That Adapts to Your Model
Captured source
source ↗AI Serving Platform That Adapts to Your Model | Databricks Blog Skip to main content
Summary
What it is: A fully managed platform that runs any model in production, from a 2 MB scikit-learn classifier on one CPU core to a fine-tuned 70B LLM on eight GPUs, with no knobs.
The challenge it solves: Custom models have wildly different resource profiles and traffic patterns, so no single static config fits them all. The platform adapts instead, holding latency low while keeping every node efficient.
The results: 300K+ QPS at <10ms p99 latency overhead and up to 90% lower infrastructure cost for customers migrating off self managed stacks.
Challenges of Running Custom Model Inferences When you deploy a machine learning model to production, you are committing to a contract : every request completes within a few milliseconds regardless of traffic spikes, and your bill stays low when traffic is low. Model serving is the infrastructure that keeps that contract, and for most of the industry's history, keeping it has been as hard as building the model itself. Custom models are fundamentally different from foundation models. A platform hosting a foundation model (Llama, Mistral, a CLIP variant) knows exactly what it is running: the architecture, the memory footprint, the inference characteristics, and can optimize deeply for that one model. Custom model platforms are the opposite. The same platform has to serve a 2 MB scikit-learn classifier on a single CPU core and a fine-tuned 70B LLM on eight GPUs; a low-latency ranker that cannot tolerate queuing and an embedding model that thrives on aggressive batching. A platform that can serve every kind of model and no two with the same resource profile, traffic shape, or latency budget. Traditional platforms offload that complexity back to the customer: replica count, per-replica concurrency, autoscaling thresholds. This is still DIY, just at a higher abstraction. And it never stops: every new model and traffic shift means re-profiling and re-tuning, so your best engineers fire-fight production before and after shipping, and serving becomes the anchor that slows every launch. The result is the cost that matters most — models proven in dev sit for weeks before they reach production. Our Mission: Remove the ML Stack Tax Re-tuning serving infrastructure by hand is a tax on every model an organization runs; at scale it becomes structural, with teams standing up dedicated serving groups whose whole job is keeping models alive and performant in production. We call it the ML Stack Tax . Databricks Custom Model Serving is a fully managed real time inference platform for any model packaged in MLflow . Our mission is to erase that tax across three stages of a model's life so that our customer’s serving teams can focus on more sophisticated value addition: Make pre-production simple. A model trained in Databricks deploys with a single click — we match the environment exactly, with no runtime surprises, and optimize deployment time so iteration and rollback stays fast. Make production reliable, scalable, and cost-efficient. The infrastructure adapts to each model and its traffic at run time, holding latency low and cost down with no knobs to set. (The focus of this post.) Make post-production simple. Every endpoint emits telemetry into Unity Catalog out of the box (metrics, OTel-native logs and traces, instant inference tables capturing every request to Delta and MLflow Tracing). Genie Code sits on top of all of it to deliver first-of-its-kind agentic operational observability. Observability for AI is a context problem, and the whole context lives in one platform.
This works because Custom Model Serving is built natively into Databricks: data, features, training, MLflow packaging, serving, and agents are one governed stack, not separate systems stitched together. This post covers the second stage on how we reach 300K+ QPS at low latency across a wide variety of models with a no knob approach. This is what makes the tax disappear. Architecture Three constraints shape every decision in the architecture: low latency, high scale, and cost efficiency . They pull against each other (the easy way to cut latency is to over-provision, the easy way to cut cost is to under-provision) and holding all three at once, for every kind of model, without any resource wastage is the real engineering problem.
Three things make it work. A short, isolated request path that keeps latency overhead minimal at every hop. Automatic runtime selection - each model is served on the inference engine best suited to it. The heart of the platform — an autoscaler that adapts to both the model and its traffic in real time , holding latency and scale up while driving cost down.
The first two keep a single request fast; the third keeps the whole system fast and cost-effective as models and traffic change. Most of this section is about the third. Short, Isolated Request Path Every serving endpoint is a fully isolated Kubernetes deployment with its own pods and a container image specific to the model version. This isolation is deliberate: one endpoint's traffic, failures, or resource pressure cannot affect another's, and it keeps custom workloads secure. The path itself is kept as short as possible, because latency is a first-class constraint at every layer . A request arrives through a PoP proxy; once authenticated, it passes through a shared load balancer for connection management and immediately lands on the pod that serves it. Each pod also runs an observability sidecar that exports metrics, logs, payload logs, and traces, for both platform monitoring and customer-facing dashboards.
Efficient Model Runtime Selection Inside each pod, the model runs on the inference engine best suited to its type — an async Gunicorn MLflow server for classic ML models, and GPU-optimized engines for large models with support for vLLM, Triton or customer's own runtime — all behind one uniform serving interface. Meeting each model with the right runtime keeps per-request overhead low without hand-tuning; the specifics are shown in the diagram below.
The Autoscaler: Adapting to Model and Traffic A custom Kubernetes controller we built — the AutoPilot Pod Autoscaler (APA) — sits at the center of the platform. It continuously collects signals from the load balancer (active concurrency, queue depth) and from the pods themselves (CPU utilization, GPU utilization, GPU memory, and many…
Excerpt shown — open the source for the full document.