ReleaseBaidu (ERNIE)Baidu (ERNIE)published Jan 23, 2026seen 5d

PaddlePaddle/FastDeploy v2.4.0

PaddlePaddle/FastDeploy

Open original ↗

Captured source

source ↗
published Jan 23, 2026seen 5dcaptured 17hhttp 200method plain

v2.4.0

Repository: PaddlePaddle/FastDeploy

Tag: v2.4.0

Published: 2026-01-23T02:20:55Z

Prerelease: no

Release notes:

核心推理能力与模型支持增强

  • 支持文本 prompt_logprob 及全量 logprob 能力 #4769
  • 支持离线推理中基于 ZMQ 的 logprobs / prompt_logprobs,并引入 max_logprobs 参数 #4897
  • 支持在线推理中基于 ZMQ 的 logprobs / prompt_logprobs,并优化通信方式 #5089
  • 新增 logprobs / prompt_logprobstoken_id 解码控制开关 #5463
  • 受限解码新增 llguidance 后端 #5124
  • CUDAGraph 支持投机解码 Draft Model 加速(默认关闭)
  • [Speculative Decoding] 解耦 draft_tokens 后处理流程 #5205
  • 支持 Pooling 模型 Runner
  • 支持 Reward 模型
  • Pooling 模型通用 embedding 接口 #4344
  • Pooling 模型定制 reward 接口 #4518
  • 新增开源模型 Ernie-4.5-VL-28B-A3B-Thinkingreasoning_parser,兼容 - / _ 命名规则 #4571 #4668
  • 支持通过 chat_template_kwargs.options.thinking_mode 控制思考开关
  • 支持多模模型传入 prompt_token_ids 请求,并通过 messages 输入多模数据,实现 tokens-in / tokens-out 能力

并行架构、调度与 MoE 能力演进

  • GLM / Qwen 模型消除 EP 空跑时的通信开销 #5254
  • 支持 MoE 分 chunk 执行 #4575
  • 支持 EPLB(Expert Load Balancing)#4782
  • 支持 EPLB 重排与冗余专家策略 #5142 #5143 #5178 #5239 #5918
  • 支持路由重放机制
  • PD 分离支持 Deepseek V3 模型 EP 并行部署 #5251
  • PD 分离支持 Qwen3-MoE 模型 EP 并行部署 #4691
  • PD 分离支持 Prefill 与 Decode 使用不同 TP Size #5296
  • 新增 Python 版本 Router,支持集中式与分离式部署调度 #4709
  • 支持多步 MTP + CUDAGraph + PD 分离
  • 支持 MTP 无损验证
  • 支持 MTP 分 chunk #5343

多模态、缓存与量化能力增强

  • 支持多模单 batch、纯文本多 batch 混合 Prefill 调度 #4611
  • 支持多模 Prefix Cache #4803
  • 动态量化支持 Prefix Cache #5125
  • 修复并支持多模 Prefix Cache 与 CUDAGraph 同时开启 #4679
  • 支持 W4AFP8 动态量化 #5282
  • 支持静态 C8 scale 单独加载 #4624
  • 完善 Machete 对不同量化 group size 的支持 #4911
  • 支持 Flash Mask Attention Backend 接入 #5104 #5134 #5387
  • v1 Loader 加载性能优化 #4532
  • 支持预编译包功能 #4729

多硬件平台支持扩展

P800

  • 支持多模 Prefix Cache #5356
  • 支持 PD 分离 #5179
  • 支持思考模型思考强度限制 #4761
  • 支持 TP + EP 并行 #4688 #4836

Intel HPU

  • 新增 Prefix Caching 支持 #4971
  • 新增 Chunked Prefill 支持 #5289

Iluvatar GPU

  • 支持 ERNIE-4.5-21B-A3B 与 ERNIE-4.5-VL-28B-A3B-Thinking #4774 #4995
  • 修复多项 CI 问题 #4972 #5012 #5100

MetaX

  • 支持 ERNIE-4.5-VL-28B #4820
  • 新增 Cutlass MoE #4602 #4685 #5128
  • 支持 default_v1 loader #4956 #5001
  • 优化 Flash MLA 性能 #4915
  • 新增 Triton MoE 的 default_v1 loader 与 quant_config #5030
  • 支持 ENABLE_V1_KVCACHE_SCHEDULER #5163

性能优化、可观测性与稳定性修复

性能与通信优化

  • AppendAttn 算子支持 CUDA-PDL #5072
  • DeepGemm H2D 消除 #5262
  • 优化集中式 EP 通信逻辑 #5145
  • 移除 CUDA Graph 下 Append Attention 的 DtoH 同步开销
  • 支持两阶段低时延通信 #4162
  • 支持 TP + EP 混合并行 #4615 #5315 #5353
  • 默认编译 RDMA,降低多模 CUDAGraph 开销

可观测性与安全

  • 支持基于请求级别的细粒度链路追踪 #5458
  • 添加 trace_id / span_id 自动注入与开关 #4692 #5765
  • 新增 --api-key 权限校验参数 #4806

稳定性与 Bug 修复

  • 修复 logprob / prompt_logprob 计算、序列化及通信相关问题 #4681 #4884 #5237 #5335
  • 修复 EP、PD 分离、MTP、Prefix Cache、量化、多模态等多类推理场景下的稳定性问题
  • 修复多硬件(XPU / MetaX / Luvatar / P800)算子与参数校验问题

What's Changed

  • [BugFix] fix total_block_num init error in worker_process by @RichardWooSJTU in https://github.com/PaddlePaddle/FastDeploy/pull/4553
  • [BugFix] Fix graph opt test case by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4634
  • [Feature] add mm token usage by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4570
  • [XPU] Update the return value of TextImageGatherScatter by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4636
  • [Docs] Add PaddleOCR-VL-0.9B best practices by @ming1753 in https://github.com/PaddlePaddle/FastDeploy/pull/4658
  • [XPU] fix pos_emb_type bug by @cqulilujia in https://github.com/PaddlePaddle/FastDeploy/pull/4638
  • [Docs] add Qwen25vl yaml by @xjkmfa in https://github.com/PaddlePaddle/FastDeploy/pull/4662
  • [Feature] add a new reasoning parser by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4571
  • [XPU] [CI] Increase pytest timeout for XPU ep test by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4665
  • add noaux_tc to unitest fused_moe by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4656
  • [EP] fix several bugs in data parallel by @ltd0924 in https://github.com/PaddlePaddle/FastDeploy/pull/4657
  • [OP] Add InferShape&InferDtype for per_token_quant_padding by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4667
  • 【Hackathon 9th No.86】autogen MoeFastHardamardImplWrapper template_instantiation by @ccsuzzh in https://github.com/PaddlePaddle/FastDeploy/pull/4592
  • [UT] Add ut for speculative sampler by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4650
  • [Doc] update docs by @ApplEOFDiscord in https://github.com/PaddlePaddle/FastDeploy/pull/4675
  • [Graph Optimization] Add the CUDAGraph usage switch for Draft Model by @gongshaotian in https://github.com/PaddlePaddle/FastDeploy/pull/4601
  • [CI] Add test for paddleocr_vl by @Limerances in https://github.com/PaddlePaddle/FastDeploy/pull/4627
  • [unitest]add real gate_correction_bias weight to mock real data dispatch by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4676
  • [noauxtc_kernel] remove useless code by @zhoutianzi666 in https://github.com/PaddlePaddle/FastDeploy/pull/4643
  • [BugFix] fix offline llm chat "enable_thinking" is always "False" by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4686
  • [BugFix] fix total_block_num init error in worker_process and test_async_llm not throw error by @xyxinyang in https://github.com/PaddlePaddle/FastDeploy/pull/4687
  • [BugFix] fix --logprobs-mode raw_logits by @ckl117 in https://github.com/PaddlePaddle/FastDeploy/pull/4681
  • [XPU] xpu currently disable prefix cache for VL model by @ddchenhao66 in https://github.com/PaddlePaddle/FastDeploy/pull/4695
  • [XPU] [CI] Add Vl case by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4649
  • [BugFix] Fix finish reason in _create_chat_completion_choice by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4582
  • [Feature] Unify the registration name recognition for tool_parser and reasoning_parser to “-” by @kxz2002 in https://github.com/PaddlePaddle/FastDeploy/pull/4668
  • [BugFix] fix unittest of get_save_output_v1 by @Wanglongzhi2001 in https://github.com/PaddlePaddle/FastDeploy/pull/4701
  • [XPU] [CI] Lock xvllm version by @plusNew001 in https://github.com/PaddlePaddle/FastDeploy/pull/4715
  • [Graph Optimization] SOT+CUDAGraph support ERNIE4.5T VL 28B / 424B by @DrRyanHuang in https://github.com/PaddlePaddle/FastDeploy/pull/4645
  • [Feature] support mtp distribution equivalence verification by @Deleter-D in https://github.com/PaddlePaddle/FastDeploy/pull/4699
  • [KVCache] Support kv cache scale load by @Sunny-bot1 in https://github.com/PaddlePaddle/FastDeploy/pull/4624

*...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable deployment tool update