What does this model signal mean?

Baidu (ERNIE) published baidu/NAVA. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 96 HF downloads · Non-autoregressive text-to-speech model from Baidu.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Baidu (ERNIE) Model: baidu/NAVA

Captured source

source ↗

Hugging Face/huggingface.co/baidu/NAVA

baidu/NAVA model card

Source ↗

published May 29, 2026seen Jun 6captured Jun 11http 200method plaintask text-to-videolicense apache-2.0library customdownloads 96likes 125

NAVA — Native Audio-Visual Alignment for Generation

State-of-the-art audio-visual synchronization with only 6.3 B parameters.

ERNIE Team · Baidu Inc. · arXiv 2026

⭐ If you find this model useful, please consider giving our GitHub repo a star! ⭐

📖 中文版 README

---

TL;DR

NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt — including multi-speaker speech with reference-timbre control and image-conditioned continuations.

Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using 2× to 5× fewer parameters than open-source baselines.

> Highlights > - 720p 1-min Fast Generation — 720p synchronized audio-video in ~1 minute via 8-GPU Ulysses sequence parallel. > - Dual-Channel Audio — stereo audio (scene + speech) jointly denoised with video, no post-hoc vocoder alignment. > - Precise Multi-Timbre Control — reference WAVs bound to ... speech spans for per-speaker voice identity. > - Language-Described Camera Control — shot composition, motion, and pacing directly from the prompt. > - Multi-Resolution — landscape / portrait / square aspect ratios from the same checkpoint.

---

Model Details

Quick Facts

| | | |---|---| | Architecture | Align-then-Fuse MMDiT (Wan2.2 backbone) | | Parameters | 6.3 B (backbone, joint AV) | | Modality | Joint audio + video, text-conditioned | | Resolution | 1280×704 (recommended) · 960×960 also supported | | Frames / FPS | 37 frames @ 24 fps ≈ 6 s · 55–61 frames ≈ 9–10 s | | Audio | 25 latent tokens / sec, ≤ 10 s | | Sampling | Flow matching · UniPC scheduler · 50 default steps | | Precision | bf16 | | Parallelism | Single-GPU or Ulysses sequence parallel (up to 8 GPUs) | | Base model | Wan-AI/Wan2.2-TI2V-5B |

Architecture

NAVA instantiates *Native Audio-Visual Alignment* as an Align-then-Fuse MMDiT stack:

Hierarchical Alignment Layers — 10 double-stream blocks. Video and audio keep separate QKV projections and FFNs but share a joint self-attention over concatenated [video_tokens; audio_tokens], plus dedicated cross-attention to text. This builds an alignment space where AV correspondence is learned without semantic context interference.
Unified Fusion Layers — 20 single-stream blocks. Video and audio share QKV/FFN; a unified joint attention treats all tokens as one stream, with a single text cross-attention path. This is where context-conditioned denoising happens.
Backbone hyperparameters. dim=3072, ffn_dim=14336, 24 attention heads, 30 layers (10 double + 20 single), text_len=512, patch size (1, 2, 2). RMSNorm on QK; cross-attention norm; ε = 1e-6.
Positional encoding. 3D RoPE for video (temporal + height + width), 1D RoPE for audio, applied jointly inside the joint-attention path.
Timbre-in-Context Conditioning. Reference-WAV speaker embeddings (ReDimNet, 192-d) are injected through the context pathway and bound to ... speech spans, enabling per-speaker timbre control in multi-speaker scenes.
3D cross-modal CFG. Independent classifier-free guidance scales for video, audio, and the cross-modal alignment direction (video_align_guidance_scale, audio_align_guidance_scale) keep AV synchronization tight at inference.

What's Different from Existing Open-Source AV Models

| Design axis | Typical baselines | NAVA | |---|---|---| | Stream layout | Dual-tower (post-hoc align) or fully unified tri-modal | Align-then-Fuse — alignment space first, context fused after | | Speech control | Caption-only, no per-speaker timbre | Timbre-in-Context via reference WAVs | | Param budget | 10 B – 32 B | 6.3 B |

Components Shipped Alongside the Backbone

| Component | Description | Size | |---|---|---| | WanAVModel (backbone) | MMDiT, joint AV attention | 6.3 B | | Wan2.2 Video VAE | Causal 3D ConvNet · 16×16×4 spatial-temporal compression · 48 latent channels | 2.7 GB | | LTX Audio VAE + Vocoder | 128 latent channels · 25 tokens/sec · built-in waveform decoder | 348 MB | | umt5-xxl Text Encoder | T5 · 4096-d embeddings | 11 GB | | ReDimNet | Speaker embedding · 192-d | ~50 MB |

---

Evaluation

Table 1 — VerseBench (general AV capability)

NAVA achieves the best AV synchronization (Sync-C / Sync-D), video quality, and audio WER, with the smallest parameter budget.

| Model | Params | Resolution | Sync-C ↑ | Sync-D ↓ | IB ↑ | Video Quality ↑ | WER ↓ | PQ ↑ | FD ↓ | |---|---|---|---|---|---|---|---|---|---| | Ovi 1.1 | 10 B | 720p | 7.4839 | 7.9791 | 0.199 | 0.636 | 0.102 | 5.8432 | 0.9418 | | MOVA | A18B (32 B) | 720p | 7.2888 | 7.808 | 0.269 | 0.603 | 0.126 | 7.2331 | 0.9222 | | Davinci | 15 B | 540p | 7.1487 | 7.8158 | 0.269 | 0.600 | 0.151 | 5.9559 | 0.9307 | | LTX 2.3 | 19 B | 512p | 7.2476 | 7.6902 | 0.337 | 0.576 | 0.106 | 6.9459 | 0.8287 | | NAVA (ours) | 6.3 B | 720p | 7.7914 | 7.5655 | 0.313 | 0.659 | 0.099 | 6.8609 | 0.8328 |

↑ higher is better · ↓ lower is better · bold = best · underline = 2nd best.

Table 2 — Seed-TTS-eval (speech quality)

Among joint AV models, NAVA delivers speech quality close to dedicated audio-only systems. Audio-only rows are listed *for reference*; they are not directly comparable.

| Category | Model | WER ↓ | Speaker Similarity ↑ | |---|---|---|---| | Audio-Only *(reference)* | CosyVoice | 4.29 | 60.9 | | Audio-Only *(reference)* | Qwen2.5-Omni | 2.72 | 63.2 | | Audio-Video Joint | DreamID-Omni | 33.44 | 34.1 | | Audio-Video Joint | NAVA (ours) | 5.81 | 62.4 |

---

How to Use

> TL;DR command. After §1 setup is complete: > ``bash > bash scripts/inference.sh # General T2AV > bash scripts/inference_timbre.sh # I2AV + timbre control > > Outputs land under eval_results/`.

1 · Setup (once)

git clone https://github.com/ernie-research/NAVA && cd NAVA

# Python deps
pip install torch torchvision torchaudio
pip install diffusers transformers accelerate safetensors einops scipy PyYAML tqdm sentencepiece
pip install flash-attn...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction, routine model release