WritingTogether AITogether AIpublished Apr 2, 2026seen 5d

Deepgram speech-to-text and voice models now available natively on Together AI

Open original ↗

Captured source

source ↗

Deepgram speech-to-text and voice models now available natively on Together AI

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Model Library

Published 4/2/2026

Deepgram speech-to-text and voice models now available natively on Together AI

Production STT and TTS from Deepgram, available on Together AI Dedicated Model Inference for real-time voice agents.

Authors

Zain Hasan, Arielle Fidel, Rishabh Bhargava, Rajas Bansal, Sahil Yadav, Sebastien Beurnier, Hiral Jasani, Sonny Khan

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

Deepgram's announcement Flux Nova-3 Nova-3 Multilingual Aura-2

Deepgram Nova-3 , Nova-3 Multilingual , Flux , and Aura-2 now run natively on Together AI Dedicated Model Inference Deepgram covers both ends of the voice pipeline, from transcription to synthesis, in one model lineup Together AI gives teams a single production surface for real-time voice agents, with STT, LLM, and TTS on one platform Enterprise controls include zero data retention, SOC 2 Type II, HIPAA-ready support, and data residency options

Real-time voice agents often fail when speech is treated as transcription rather than conversation. Getting the words right is only part of the challenge: the system also has to detect turn boundaries, handle interruptions and overlap, and respond quickly enough to keep the exchange feeling natural. When teams try to patch those gaps with endpointing logic, routing layers, and extra providers, they often add latency and operational fragility right back into the system. Deepgram’s models are purpose-built for that layer, where transcription, turn-taking, and responsiveness have to work together in real time. Deepgram’s STT and TTS model lineup now runs natively on Together AI, the AI Native Cloud for building real-time voice agents, so teams can pair Deepgram transcription and synthesis with any LLM in the Together catalog and run the full voice pipeline on one production platform. For the broader architecture, see our real-time voice agents announcement . “Voice agents live or die by latency, and every network hop between providers is a place where the experience breaks down. By hosting Deepgram’s STT and TTS natively on Together AI’s infrastructure, we’re giving developers production-grade transcription without the tradeoff. Fast, accurate, and co-located with the rest of the pipeline.”

  • Abe Pursell, VP of Partnerships, Deepgram

Flux: Conversational STT with turn detection Accurate transcription is only part of the job. A voice agent also has to know when the speaker is actually finished, because if it misreads the turn, it either talks over the caller or waits too long and feels unresponsive. Flux is Deepgram’s conversational STT model for real-time agents, built not just to transcribe speech but to produce turn signals from conversational context rather than silence alone. That matters because many teams still rely on extra endpointing logic to bridge this gap, which adds complexity and makes latency harder to control. Flux simplifies that part of the stack and helps keep turn-taking more predictable in production with 250ms end-of-turn detection. Nova-3: Production transcription for real-world audio Production audio is messier than benchmark audio. Calls come with background noise, overlapping speakers, accents, telephony compression, and interruptions, and the model still has to return text the rest of the pipeline can trust. Nova-3 is built for those conditions, with support for vocabulary customization so teams can improve recognition of domain-specific terms without retraining. Nova-3 Multilingual extends that approach across multiple languages, which matters in deployments where callers switch languages mid-conversation. Aura-2: Enterprise TTS for production voice agents Aura-2 covers the synthesis side of the pipeline for business environments where clarity and consistency matter. Teams can use Deepgram STT and TTS together while keeping output stable for domain-specific terms and structured entities. That difference shows up in delivery. The voice has to stay clear, direct, and reliable when it reads structured information or specialized language back to the user. A voice that sounds fine in a demo is not enough if it starts to stumble once the interaction becomes operational.

Deepgram Aura-2

Thalia voice in English

Play

Pause

0:00

0:19

"Welcome to the show. Today we're exploring something truly fascinating — the power of voice. It's not just the words that matter. It's the feeling behind them, the quiet moments of reflection, and the clarity to handle the details when they count."

Like this: Dr. Sarah Chen, 450 Park Avenue, New York, 10022 — your confirmation is BX-4072 with a $14.99 copay.

That's a lot of detail, and every bit of it needs to land clearly. That's what a great voice can do."

Try now

Use cases Contact center voice agents Contact centers are inherently messy environments. Call quality varies, speakers overlap, interruptions are constant, and latency still has to stay low enough for natural back-and-forth. Deepgram’s models help agents stay in flow through those conditions, keeping conversations responsive and intelligible instead of letting them break down into delays, missed turns, or unclear responses. Healthcare voice agents Healthcare voice agents need accurate transcription of medication names, procedure terms, and clinical language, along with output that stays clear when reading the same terms back to patients. A transcription error at the start of the pipeline can surface later as a fluent but incorrect response, which is exactly the kind of failure these systems cannot afford. Nova-3 helps teams adapt recognition to clinical language, while Aura-2 keeps patient-facing output clear and consistent. Financial services Financial voice systems depend on precision. Account numbers, routing numbers, trade confirmations, and structured financial language need to be captured correctly the first time, because a…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Integration of notable speech models