ByteDance-Seed/Triton-distributed v0.0.1-rc
ByteDance-Seed/Triton-distributed
Captured source
source ↗published Aug 20, 2025seen 5dcaptured 16hhttp 200method plain
v0.0.1-rc
Repository: ByteDance-Seed/Triton-distributed
Tag: v0.0.1-rc
Published: 2025-08-20T01:44:40Z
Prerelease: yes
Release notes:
Compiled with
- Triton v3.4
- NVSHMEM: v3.3.9
What's Changed
- feat: support mega kernel in https://github.com/ByteDance-Seed/Triton-distributed/pull/93 by @XG-zheng
- feat: support E2E MoE models like Qwen/Qwen3-235B-A22B in https://github.com/ByteDance-Seed/Triton-distributed/pull/85 by @houqi @XG-zheng @KnowingNothing @wenlei-bao @preminstrel
- feat: support GEMM+AllReduce on Hopper
- feat: GroupedGEMM+ReduceScatter supported on L20/Ampere
- feat: default use NVLS ld_reduce with .acc::f32 precision for BF16/FP16 reduction: for better precision
- fix: support NVLS multimem.st in vectorized way
- fix: fix some hang problem with cooperative_launch_grids. close https://github.com/ByteDance-Seed/Triton-distributed/issues/81
- fix: some BUGs in AG+GroupedGEMM which may cause unexpected memory access
- opt: AllReduce One-Shot latency to 9us in H800x8 on very small data message: close https://github.com/ByteDance-Seed/Triton-distributed/issues/57
- opt: AllReduce Two-Shot latency performance fix: return symmetric buffer directly to save some d2d copy overhead
- opt: AllReduce DoubleTree implementation much faster but still not for production: better pipeline needed.
- trival: support compile without CUDA toolkit and torch
- Enable rocSHMEM host API usage by @drprajap in https://github.com/ByteDance-Seed/Triton-distributed/pull/68
Known Issue
- AMD related is not included in the wheels. if you want to try AMD, build it yourself.
Full Changelog: https://github.com/ByteDance-Seed/Triton-distributed/compare/experimental...v0.0.1-rc
Notability
notability 5.0/10Notable library release from major company, but early stage and no traction data.