WritingTogether AITogether AIpublished May 29, 2026seen 5d

How Together AI built the world’s fastest speech-to-text stack

Open original ↗

Captured source

source ↗

How Together AI built the world’s fastest speech-to-text stack

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Inference

Published 5/29/2026

How Together AI built the world’s fastest speech-to-text stack

NVIDIA TensorRT multi-profile engines, conditional NVIDIA CUDA graphs, evented I/O, shared memory, and the Python GC fix behind Together’s ASR latency results.

Authors

Sebastien Beurnier

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Artificial Analysis reported speed factor (Input audio seconds transcribed per second) -- Higher is better Modality matters A 1M-token text prompt can fit the entire Harry Potter series and still only weigh around 5 MB. That scale sounds enormous, but the input itself is compact. Text also arrives almost ready for inference: tokenize it, batch it, and move it through the model. Audio changes the shape of the problem. The same Harry Potter corpus as audiobooks is 5 to 10 GB, roughly three orders of magnitude larger than the text. Before any of it reaches the GPU, the server has to decode the container, resample, filter noise, run VAD, segment speech, and compute audio features. The model side flips too. LLMs these days have hundreds of billions or trillions of parameters, so serving work naturally concentrates inside the GPU: quantization, KV cache, attention kernels, batching, and parallelism. Speech-to-text models are much smaller, often in the hundreds of millions to low billions of parameters, so the surrounding data path matters much more. That makes ASR serving a full-path systems problem spanning GPU execution, CPU preprocessing, memory movement, transport, connection scheduling, and runtime behavior. The same stack also has to serve two different regimes: offline transcription, where throughput matters most, and streaming transcription, where latency and jitter dominate. Together’s ASR stack serves the two lowest-latency speech-to-text models ranked by Artificial Analysis: NVIDIA’s Parakeet-TDT 0.6B v3 and OpenAI’s Whisper Large v3. The faster of the two, NVIDIA Parakeet-TDT 0.6B v3, can transcribe roughly 20 hours of speech, about the runtime of the Harry Potter film franchise, in under 10 seconds. The rest of this post breaks down the production changes behind that result: TensorRT profiles for real audio shapes, GPU-side decoder control flow, lower-copy CPU paths, evented streaming I/O, and runtime GC control. Compile the encoder for real audio shapes Parakeet uses an encoder-decoder architecture, and roughly 95% of its weights sit in the encoder. The encoder takes a variable-length speech segment and produces acoustic frames for the decoder, which made it the first place to optimize. Audio inputs span a wide range of lengths, from a 200 ms streaming packet to 30 seconds of uninterrupted speech. A kernel plan tuned for one input shape can be substantially slower at another, so the engine needs to know the shape distribution it will see at compile time. Before TensorRT, we were already using an optimized PyTorch path with torch.compile and CUDA graphs, tuned across the same shape profiles. That gave us a strong baseline: profile-aware execution without leaving the PyTorch stack. TensorRT gave us a faster encoder path for production. It builds an optimized execution plan ahead of time, fusing kernels where possible, tuning memory layouts, and benchmarking kernel variants for the shape ranges we expect to serve. The important detail is profile tuning. A single engine tuned only for the largest input shape forces shorter audio segments into a padded path, which is especially costly for streaming chunks and short utterances. A multi-profile TensorRT engine lets us keep one copy of the encoder weights in memory while selecting the right optimization profile per request. The memory savings were modest, roughly 6 GB to 5 GB. The larger win was avoiding bad shape matches and moving from optimized PyTorch to TensorRT for tuned profiles. In the small-input regime, profile-aware TensorRT can be several times faster than sending those requests through a large padded profile. With the encoder optimized, the decoder loop became the next bottleneck. Remove the CPU from the decoder loop Parakeet’s decoder iterates over the encoder’s acoustic frames and emits either a token or a BLANK for frames that do not advance the transcript. The code is essentially: state = init() for frame in encoder_output: token = predict(frame, state) if token != BLANK: emit(token) state = update(state, token) When profiling, we found that predict and update were both fast. The per-iteration GPU work was measured in microseconds. The expensive line was the branch: if token != BLANK: That branch requires the CPU to read the token back from GPU memory to decide which path to take. This host sync prevents the decode loop from being captured as a single CUDA graph and forces every iteration to round-trip through Python. The GPU does a few microseconds of work, waits for the CPU, launches the next kernel, and repeats that pattern thousands of times per request. Conditional CUDA graph nodes moved that branch onto the GPU. A small device-side kernel evaluates the condition and tells the CUDA runtime whether to enter the token-emission and state-update subgraph. The branch resolves without leaving the GPU, so the entire decoder loop, counter, condition, emit, and state update, can be captured and launched as one CUDA graph. The CPU leaves the decoder’s inner loop, and the result is a 2 to 3x faster decoder.

‍ Stop copying audio bytes Once the encoder and decoder were running well, the remaining latency came from the CPU path around the model. That is where most ASR code we’ve audited spends its latency budget: redundant copies, unnecessary process hops on the hot path, and single-threaded functions that would benefit from higher parallelism. The first lever was collapsing unnecessary process boundaries. Audio preprocessing, whether file decoding, resampling, voice activity detection (VAD), feature…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Substantive technical achievement, notable but not frontier model