ForkNovita AINovita AIpublished Apr 22, 2026seen 5d

novitalabs/MiniMax-Provider-Verifier

forked from MiniMax-AI/MiniMax-Provider-Verifier

Open original ↗

Captured source

source ↗

novitalabs/MiniMax-Provider-Verifier

Description: MiniMax-Provider-Verifier offers a rigorous, vendor-agnostic way to verify whether third-party deployments of the Minimax M2 model are correct and reliable.

License: MIT

Stars: 0

Forks: 0

Open issues: 0

Created: 2026-04-22T03:59:59Z

Pushed: 2026-04-01T15:31:13Z

Default branch: main

Fork: yes

Parent repository: MiniMax-AI/MiniMax-Provider-Verifier

Archived: no

README:

MiniMax-Provider-Verifier

[English](README.md) | [中文](README_CN.md)

MiniMax-Provider-Verifier offers a rigorous, vendor-agnostic way to verify whether third-party deployments of the Minimax M2 model are correct and reliable. Since the open-source release of M2, it has been widely adopted and integrated into production services by numerous users. To ensure this vast user base continues to benefit from an efficient, high-quality M2 experience—and to align with our vision of "Intelligence with Everyone"—this toolkit offers an objective, reproducible standard for validating model behavior.

Evaluation Metrics

We evaluate multiple dimensions of vendor deployments, including tool-calling behavior, schema correctness, and system stability (e.g., detecting potential misconfigurations like incorrect top-k settings).

The primary metrics are:

  • Query-Success-Rate: Measures the probability that a provider can eventually return a valid response successfully when allowed up to max_retry=10 attempts.
  • query_success_rate = successful_query_count / total_query_count
  • ToolCalls-Match-Rate: Measures how well the model's "whether to trigger tool-calls" behavior matches the expected labels. Each test case is annotated with expected_tool_call (whether a tool call is expected), and this metric calculates the proportion of cases where the actual result matches the expected result.
  • tool_calls_match_rate = (tool_calls_finish_tool_calls + stop_finish_stop) / success_count
  • Confusion Matrix Statistics:
  • tool_calls_finish_tool_calls: expected tool_call, actual tool_call (TP)
  • tool_calls_finish_stop: expected tool_call, actual stop (FN)
  • stop_finish_tool_calls: expected stop, actual tool_call (FP)
  • stop_finish_stop: expected stop, actual stop (TN)
  • ToolCalls-Schema-Accuracy: Measures the correctness rate of tool-call payloads (e.g., function name and arguments meeting the expected schema) conditional on tool-call being triggered.
  • schema_accuracy = tool_calls_successful_count / tool_calls_finish_tool_calls
  • Response-Success-Rate Not Only Reasoning: Detects a specific error pattern where the model outputs only Chain-of-Thought reasoning without providing valid content or the required tool calls. The presence of this pattern strongly indicates a deployment issue.
  • Response-success-rate = response_not_only_reasoning_count / only_reasoning_checked_count
  • Language-Following-Success-Rate: Checks whether the model follows language requirements in minor language scenarios; this is sensitive to top-k and related decoding parameters.
  • language_following_success-rate = language_following_valid_count / language_following_checked_count

Evaluation Results

The evaluation results below are computed using our initial release of test prompts, each executed 10 times per provider, with all metrics reported as the mean over the 10-run distribution. As a baseline, minimax represents the performance of our official MiniMax Open Platform deployment, providing a reference point for interpreting other providers' results.

MiniMax-M2.5/M2.7 Model – April 2026 Data (After Metrics Revision)

| Metric | Query-Success-Rate | ToolCalls-Match-Rate | ToolCalls-Accuracy | Response-Success-Rate | Language-Following-Success-Rate | |--------|--------------------|-----------------------------|--------------------|--------------------------------------------|----------------------------------| | MiniMax-M2.5 | 100% | 99.29% | 95.59% | 100% | 80% | | MiniMax-M2.7 | 100% | 99.29% | 96.55% | 100% | 90% |

MiniMax-M2.5 Model – Feb 2026 Data

| Metric | Query-Success-Rate | Finish-ToolCalls-Rate | ToolCalls-Trigger Similarity | ToolCalls-Accuracy | Response Success Rate - Not Only Reasoning | Language-Following-Success-Rate | |--------|--------------------|-----------------------|------------------------------|--------------------|--------------------------------------------|----------------------------------| | minimax-m2.5 | 100% | 84.75% | - | 97.26% | 100% | 90% | | openRouter-minimax-fp8 | 100% | 84.55% | 98.98% | 97.25% | 100% | 80% | | openRouter-minimax-highspeed | 100% | 84.14% | 99.22% | 97.24% | 100% | 80% | | openRouter-novita-bf16 | 100% | 84.65% | 99.05% | 97.5% | 100% | 70% | | openRouter-siliconflow/fp8 | 100% | 84.24% | 99.28% | 98.68% | 100% | 80% | | openRouter-atlas-cloud/fp8 | 100% | 84.75% | 99.10% | 96.18% | 100% | 70% | | openRouter-fireworks | 96.32% | 81.63% | 98.87% | 96.19% | 100% | 80% |

MiniMax-M2.1 Model – Jan 2026 Data

| Metric | Query-Success-Rate | Finish-ToolCalls-Rate | ToolCalls-Trigger Similarity | ToolCalls-Accuracy | Response Success Rate - Not Only Reasoning | Language-Following-Success-Rate | |--------|--------------------|-----------------------|------------------------------|--------------------|--------------------------------------------|----------------------------------| | minimax-m2.1 | 100% | 83.33% | - | 96.61% | 100% | 90.00% | | minimax-m2.1-vllm(without topk) | 99.90% | 81.84% | 98.78% | 96.42% | 100% | 60.00% | | minimax-m2.1-vllm | 100% | 82.83% | 98.90% | 93.91% | 100% | 90% | | minimax-m2.1-sglang | 100% | 83.03% | 99.15% | 95.01% | 100% | 90% | | infini-ai | 100% | 80.61% | 97.46% | 100% | 100% | 100% | | openRouter-minimax/fp8 | 100% | 83.23% | 99.03% | 96.11% | 100% | 90% | | openRouter-minimax/lightning | 99.90% | 83.15% | 98.97% | 96.48% | 100% | 80% | | openRouter-gmicloud/fp8 | 83.72% | 55.5% | 81.37% | 84.58% | 100% | 70% | | OpenRouter-novita/fp8 | 99.32% | 83.07% | 99.21% | 96.03% | 100% | 90% | | fireworks | 100% | 81.1% | 97.77% | 94.29% | 100% | 60% | | siliconflow | 100% | 82.42% | 98.47% | 96.19% | 100% | 60% |

MiniMax-M2 Model – Dec 2025 Data

| Metric | Query-Success-Rate | Finish-ToolCalls-Rate | ToolCalls-Trigger Similarity | ToolCalls-Accuracy | Response Success Rate - Not Only Reasoning | Language-Following-Success-Rate |…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine fork of repository