baidu/NAVA
Captured source
source ↗NAVA — Native Audio-Visual Alignment for Generation
State-of-the-art audio-visual synchronization with only 6.3 B parameters.
ERNIE Team · Baidu Inc. · arXiv 2026
⭐ If you find this model useful, please consider giving our GitHub repo a star! ⭐
📖 中文版 README
---
TL;DR
NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt — including multi-speaker speech with reference-timbre control and image-conditioned continuations.
Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using 2× to 5× fewer parameters than open-source baselines.
> Highlights > - 720p 1-min Fast Generation — 720p synchronized audio-video in ~1 minute via 8-GPU Ulysses sequence parallel. > - Dual-Channel Audio — stereo audio (scene + speech) jointly denoised with video, no post-hoc vocoder alignment. > - Precise Multi-Timbre Control — reference WAVs bound to ... speech spans for per-speaker voice identity. > - Language-Described Camera Control — shot composition, motion, and pacing directly from the prompt. > - Multi-Resolution — landscape / portrait / square aspect ratios from the same checkpoint.
---
Model Details
Quick Facts
| | | |---|---| | Architecture | Align-then-Fuse MMDiT (Wan2.2 backbone) | | Parameters | 6.3 B (backbone, joint AV) | | Modality | Joint audio + video, text-conditioned | | Resolution | 1280×704 (recommended) · 960×960 also supported | | Frames / FPS | 37 frames @ 24 fps ≈ 6 s · 55–61 frames ≈ 9–10 s | | Audio | 25 latent tokens / sec, ≤ 10 s | | Sampling | Flow matching · UniPC scheduler · 50 default steps | | Precision | bf16 | | Parallelism | Single-GPU or Ulysses sequence parallel (up to 8 GPUs) | | Base model | Wan-AI/Wan2.2-TI2V-5B |
Architecture
NAVA instantiates *Native Audio-Visual Alignment* as an Align-then-Fuse MMDiT stack:
- Hierarchical Alignment Layers — 10 double-stream blocks. Video and audio keep separate QKV projections and FFNs but share a joint self-attention over concatenated
[video_tokens; audio_tokens], plus dedicated cross-attention to text. This builds an alignment space where AV correspondence is learned without semantic context interference. - Unified Fusion Layers — 20 single-stream blocks. Video and audio share QKV/FFN; a unified joint attention treats all tokens as one stream, with a single text cross-attention path. This is where context-conditioned denoising happens.
- Backbone hyperparameters.
dim=3072,ffn_dim=14336, 24 attention heads, 30 layers (10 double + 20 single),text_len=512, patch size(1, 2, 2). RMSNorm on QK; cross-attention norm; ε = 1e-6. - Positional encoding. 3D RoPE for video (temporal + height + width), 1D RoPE for audio, applied jointly inside the joint-attention path.
- Timbre-in-Context Conditioning. Reference-WAV speaker embeddings (ReDimNet, 192-d) are injected through the context pathway and bound to
...speech spans, enabling per-speaker timbre control in multi-speaker scenes. - 3D cross-modal CFG. Independent classifier-free guidance scales for video, audio, and the cross-modal alignment direction (
video_align_guidance_scale,audio_align_guidance_scale) keep AV synchronization tight at inference.
What's Different from Existing Open-Source AV Models
| Design axis | Typical baselines | NAVA | |---|---|---| | Stream layout | Dual-tower (post-hoc align) or fully unified tri-modal | Align-then-Fuse — alignment space first, context fused after | | Speech control | Caption-only, no per-speaker timbre | Timbre-in-Context via reference WAVs | | Param budget | 10 B – 32 B | 6.3 B |
Components Shipped Alongside the Backbone
| Component | Description | Size | |---|---|---| | WanAVModel (backbone) | MMDiT, joint AV attention | 6.3 B | | Wan2.2 Video VAE | Causal 3D ConvNet · 16×16×4 spatial-temporal compression · 48 latent channels | 2.7 GB | | LTX Audio VAE + Vocoder | 128 latent channels · 25 tokens/sec · built-in waveform decoder | 348 MB | | umt5-xxl Text Encoder | T5 · 4096-d embeddings | 11 GB | | ReDimNet | Speaker embedding · 192-d | ~50 MB |
---
Evaluation
Table 1 — VerseBench (general AV capability)
NAVA achieves the best AV synchronization (Sync-C / Sync-D), video quality, and audio WER, with the smallest parameter budget.
| Model | Params | Resolution | Sync-C ↑ | Sync-D ↓ | IB ↑ | Video Quality ↑ | WER ↓ | PQ ↑ | FD ↓ | |---|---|---|---|---|---|---|---|---|---| | Ovi 1.1 | 10 B | 720p | 7.4839 | 7.9791 | 0.199 | 0.636 | 0.102 | 5.8432 | 0.9418 | | MOVA | A18B (32 B) | 720p | 7.2888 | 7.808 | 0.269 | 0.603 | 0.126 | 7.2331 | 0.9222 | | Davinci | 15 B | 540p | 7.1487 | 7.8158 | 0.269 | 0.600 | 0.151 | 5.9559 | 0.9307 | | LTX 2.3 | 19 B | 512p | 7.2476 | 7.6902 | 0.337 | 0.576 | 0.106 | 6.9459 | 0.8287 | | NAVA (ours) | 6.3 B | 720p | 7.7914 | 7.5655 | 0.313 | 0.659 | 0.099 | 6.8609 | 0.8328 |
↑ higher is better · ↓ lower is better · bold = best · underline = 2nd best.
Table 2 — Seed-TTS-eval (speech quality)
Among joint AV models, NAVA delivers speech quality close to dedicated audio-only systems. Audio-only rows are listed *for reference*; they are not directly comparable.
| Category | Model | WER ↓ | Speaker Similarity ↑ | |---|---|---|---| | Audio-Only *(reference)* | CosyVoice | 4.29 | 60.9 | | Audio-Only *(reference)* | Qwen2.5-Omni | 2.72 | 63.2 | | Audio-Video Joint | DreamID-Omni | 33.44 | 34.1 | | Audio-Video Joint | NAVA (ours) | 5.81 | 62.4 |
---
How to Use
> TL;DR command. After §1 setup is complete: > ``bash > bash scripts/inference.sh # General T2AV > bash scripts/inference_timbre.sh # I2AV + timbre control > > Outputs land under eval_results/`.
1 · Setup (once)
git clone https://github.com/ernie-research/NAVA && cd NAVA # Python deps pip install torch torchvision torchaudio pip install diffusers transformers accelerate safetensors einops scipy PyYAML tqdm sentencepiece pip install flash-attn…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low traction, routine model release