Introducing Dedicated Container Inference: Delivering 2.6x faster inference for custom AI models
Captured source
source ↗Introducing Dedicated Container Inference: Delivering 2.6x faster inference for custom AI models
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Inference
Published 2/12/2026
Introducing Dedicated Container Inference: Delivering 2.6x faster inference for custom AI models
Authors
Sylvie Liberman, Rasul Nabiyev, Mohamad Rostami, Dulaj Disanayaka, Will Van Eaton, Nikitha Suryadevara
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Links in this article
Reference documentation Sample apps and examples Contact sales
Summary
Dedicated Container Inference lets teams deploy custom generative media models — like video generation, avatar synthesis, and image processing — with production-grade orchestration they don't have to build themselves. Bring your Docker container; Together handles autoscaling, queuing, traffic isolation, and monitoring. For teams already training on Together's GPU Cloud, going from training to production requires zero artifact transfers. Customers like Creatify and Hedra have seen 1.4x–2.6x inference speedups, driven by both the platform architecture and hands-on optimization from Together's research team.
Since Together’s inception, we have powered large-scale LLM inference. We understand this space deeply and have built a product to optimize for stateless requests, token-based latency, and highly optimized serving paths. Dedicated Container Inference is built for a different class of workloads. Over the last year, we have worked closely with teams deploying custom, non-LLM models into production. These go far beyond text-in, text-out APIs: video generation pipelines, avatar synthesis, large-scale image processing, and custom audio/media models with real business constraints. What these teams consistently needed was not just GPUs or containers. They needed a way to run custom inference code in production without building their own job orchestration layer. Autoscaling, queuing, traffic isolation, retries, and monitoring all mattered, but none of them wanted to reimplement that stack in-house. Dedicated Container Inference is our answer to that gap - bringing production-grade orchestration for custom models to the AI Native Cloud. You bring your container and inference logic. We handle deployment, autoscaling, queuing, and monitoring at the job level, built specifically for GPU-intensive workloads. How is this different? Most inference platforms optimize around a single abstraction, either a stateless endpoint or a large batch job. That works until you have real-time and batch traffic, customer tiers, and sudden demand spikes at the same time. Dedicated Container Inference is built around job orchestration for your container which enables: Multiple independent queues instead of a single FIFO stream Policy-driven traffic control rather than per-request priority Isolation between batch, real-time, and untrusted traffic Predictable behavior during spikes without over-provisioning
The difference though, is not just at inference time. Together is an end-to-end training-to-inference platform. Models trained on Together’s GPU Cloud can be deployed directly as Dedicated Containers without any artifact transfers or additional work. For teams building custom models, this tight loop reduces operational overhead and makes it easier to move from training to production without introducing new failure modes. Architecture at a glance Dedicated Container Inference is built on a container-based deployment framework where jobs and queues are first-class concepts. Instead of forcing inference into a single request-response shape, we treat your container as the unit of execution and manage everything around it.
Figure 1: Dedicated Container Inference architecture At a high level: You package your model as a Docker container The container includes your runtime, dependencies, and inference code. You decide how inference runs and what libraries you use. We deploy and manage that container on GPU infrastructure Together provisions GPUs, launches replicas, handles networking, health checks, and monitoring. You do not manage clusters or nodes directly. For large models that require multiple GPUs, we provide built-in support for distributed inference via torchrun. Volume mounts for model weights Rebuilding a 50GB container every time you update weights is slow and expensive. With volume mounts, you upload weights once and attach them at deploy time. Inference runs as jobs Requests are queued and executed by workers pulled from your deployment. This supports long-running jobs, batch workloads, and mixed traffic patterns. Autoscaling driven by queue depth or metric of choice Scale capacity up or down based on utilization, queue length, weighted queue priority, and job features like video length, or target wait time. Traffic policies are explicit You can define multiple queues and control priority by customer tier, use case, or SLA. Batch workloads do not interfere with real-time requests, and paid users are protected during spikes. Observability is built in Metrics, logs, and job state are available out of the box so you can monitor state. Performance that boosts production workloads For generative media workloads, small improvements in inference speed compound quickly into large cost and latency gains. With Dedicated Container Inference, teams benefit from our research pipeline from automatic kernel optimizations to hands-on profiling and tuning for workload-specific performance improvements. "Infrastructure costs can kill an AI company as they scale. Together's Dedicated Container Inference solved two critical problems for us: handling unpredictable viral traffic without over-provisioning, and taking our already-competitive model performance to the next level.
Their research team achieved significant lossless speedups that directly improved our unit economics—without sacrificing quality. They didn't just provide GPUs; they partnered with us to make our inference more efficient at…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New service feature; solid improvement but not frontier model