meituan-longcat/LongCat-Flash-Prover
Captured source
source ↗LongCat-Flash-Prover
Tech Report 📄
Introduction
We introduce LongCat-Flash-Prover, a flagship $560$-billion-parameter open-source Mixture-of-Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues.
Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1\% pass rate on MiniF2F-Test using only 72 inferences per problem. On more challenging benchmarks, it solves 70.8\% of ProverBench and 41.5\% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.
Key Features
🌟 Native formal reasoning
We define native formal reasoning as a core capability of LLMs, analogous to native multimodal and native tool calls. This paradigm enables the model to leverage formal operators to solve complex reasoning tasks without specialized architectural modifications. We decompose the native formal reasoning into three specific capabilities: 1) Agentic auto-formalization aims to transform the informal statement into a verified formal statement; 2) Agentic sketching aims to generate a lemma-style sketch based upon the given problem and corresponding formal statement; 3) Agentic proving aims to generate a whole-proof that completes the target theorem body, or to generate a lemma-style proof that introduces helper lemmas and finally proves the target theorem. These capabilities are further enhanced through a TIR strategy, where all experts can interact directly with the Lean4 tools for compilation and verification.
🌟 Hybrid-experts iteration framework
To facilitate native formal reasoning, we developed a framework to generate high-quality cold-start data. This framework employs several optimized expert models, each specialized in distinct domains such as auto-formalization, lemma-style sketching, and proving. We utilize this framework to synthesize a series of trajectories centered on native formal operators, using multiple verifiable formal tools as environmental feedback. By doing so, each expert is iteratively refined on these tool-assisted reasoning trajectories, emulating the human process of learning through trial, verification, and reflection.
🌟 Hierarchical Importance Sampling Policy Optimization (HisPO).
Following our prior works, we perform agentic reinforcement learning with verified reward (RLVR) by designing different tasks, including generating a formal statement based on a given informal problem, producing a proof directly from the statement, or a lemma-style sketch. To make the MoE model training stable, we introduce HisPO, which is a hierarchical clipping strategy that eliminates the gradient contributions who has large training-inference engine discrepancy by estimating sequence-wise or token-wise important sampling (IS) ratios. In addition to outcome-based rewards, we designed a legality detection strategy to explore the proof with obvious hacking features, for example, the proof that is inconsistent with the semantics of the formal statement, mismatching the pre-defined theorem conditions, containing unverified or model-created axioms that attempt to fool the Lean4 server, etc.
Evaluation Results
Auto-Formalization
Auto-formalization performance (Pass@8 metric, %) of different reasoning and specific auto-formalizer models across multiple benchmarks. Best in bold, second best in underlined.
Theorem Proving
Theorem-proving performance (Pass@32 metric, %) of different reasoning and specific prover models across multiple benchmarks. Best in bold, second best in underlined. † indicates the score is from external reports.
Theorem-proving performance (with different larger budgets, %) of different specific prover models across multiple benchmarks. Best in bold, second best in underlined. Each element a / b denotes to the accuracy a with limited budget b (i.e., Pass@b). “UNK” means unknown of the specific budget. † indicates the score is from external reports. Because different models may have different budget calculations, we directly extract the results from the report instead of conducting our own evaluations. Therefore, some benchmark results may not be available.
General Reasoning
Performance (%) comparison across multiple general reasoning benchmarks. Best in bold. The result indicates that our LongCat-Flash-Prover can retain the general reasoning ability.
Quick Start
Chat Template Overview
To support advanced tool-use scenarios and sophisticated reasoning paradigms, we have introduced significant updates to our chat template, as defined in the tokenizer_config.json file.
Basic Usage
The chat template can be applied using the ``apply_chat_template`` method. Below is a standard implementation:
text = tokenizer.apply_chat_template( messages, tools=tools, tokenize=False, enable_thinking=True, add_generation_prompt=True, save_history_reasoning_content=False )
Key Features
- Tool Declaration: Available tools are declared at the beginning of the session to activate the model's tool-use capabilities and define the scope of available actions.
- Interleaved Thinking: By default, the template employs an interleaved thinking approach. In this mode, the final response is preserved while thinking content from previous user interactions is discarded to maintain a concise context window. Tool calls and responses are retained…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low traction, routine release