RepoMoonshot AI (Kimi)Moonshot AI (Kimi)published Sep 9, 2025seen 5d

MoonshotAI/K2-Vendor-Verifier

Python

Open original ↗

Captured source

source ↗
published Sep 9, 2025seen 5dcaptured 15hhttp 200method plain

MoonshotAI/K2-Vendor-Verifier

Description: Verify Precision of all Kimi K2 API Vendor

Language: Python

Stars: 575

Forks: 35

Open issues: 11

Created: 2025-09-09T11:37:33Z

Pushed: 2026-02-14T05:28:46Z

Default branch: main

Fork: no

Archived: no

README:

K2 Vendor Verifier

We've updated the evaluation approach for kimi-vendor-verifier. Click here for more details.

What's K2VV

Since the release of the Kimi K2 model, we have received numerous feedback on the precision of Kimi K2 in toolcall. Given that K2 focuses on the agentic loop, the reliability of toolcall is of utmost importance.

We have observed significant differences in the toolcall performance of various open-source solutions and vendors. When selecting a provider, users often prioritize lower latency and cost, but may inadvertently overlook more subtle yet critical differences in model accuracy.

These inconsistencies not only affect user experience but also impact K2's performance in various benchmarking results. To mitigate these problems, we launch K2 Vendor Verifier to monitor and enhance the quality of all K2 APIs.

We hope K2VV can help ensuring that everyone can access a consistent and high-performing Kimi K2 model.

K2-thinking Evaluation Results

Test Time: 2025-11-15

  • temperature=1.0
  • max_tokens=64000

Model Name Provider Api Source ToolCall-Trigger Similarity ToolCall-Schema Accuracy

count_finish_reason_tool_calls count_successful_tool_call schema_accuracy

kimi-k2-thinking MoonshotAI https://platform.moonshot.ai - 1958 1958 100.00%

Moonshot AI Turbo https://platform.moonshot.ai >=73% 1984 1984 100.00%

Fireworks https://fireworks.ai 1703 1703 100.00%

InfiniAI https://cloud.infini-ai.com 1827 1825 99.89%

SiliconFlow https://siliconflow.cn 2119 2097 98.96%

GMICloud https://openrouter.ai 1850 1775 95.95%

AtlasCloud https://openrouter.ai 1878 1798 95.74%

SGLang https://github.com/sgl-project/sglang 1874 1790 95.52%

vLLM https://github.com/vllm-project/vllm 2128 1856 87.22%

Parasail https://openrouter.ai 2108 1837 87.14%

DeepInfra https://openrouter.ai 2071 1800 86.91%

GoogleVertex https://openrouter.ai 1945 1668 85.76%

Together https://openrouter.ai 1893 1602 84.63%

NovitaAI https://openrouter.ai 72.22% 1778 1715 96.46%

Chutes https://openrouter.ai 68.10% 3657 3037 83.05%

##### We ran the official API multiple times to test the fluctuation of tool_call_f1. The lowest score was 75.81%, and the average was 76%. Given the inherent randomness of the model, we believe that an tool_call_f1 score above 73% is acceptable and can be used as a reference.

K2 0905 Evaluation Results

Test Time: 2025-11-15

  • temperature=0.6

Model Name Provider Api Source ToolCall-Trigger Similarity ToolCall-Schema Accuracy

count_finish_reason_tool_calls count_successful_tool_call schema_accuracy

kimi-k2-0905-preview MoonshotAI https://platform.moonshot.ai - 1274 1274 100.00%

Moonshot AI Turbo https://platform.moonshot.ai >=80% 1398 1398 100.00%

DeepInfra https://openrouter.ai 1365 1365 100.00%

Fireworks https://openrouter.ai 1453 1453 100.00%

Infinigence https://cloud.infini-ai.com 1257 1257 100.00%

NovitaAI https://openrouter.ai 1299 1299 100.00%

SiliconFlow https://siliconflow.cn 1305 1302 99.77%

Chutes https://openrouter.ai 1271 1229 96.70%

vLLM https://github.com/vllm-project/vllm 1325 1007 76.00%

SGLang https://github.com/sgl-project/sglang 1269 928 73.13%

Volc https://www.volcengine.com 1330 969 72.86%

Baseten https://openrouter.ai 1243 901 72.49%

AtlasCloud https://openrouter.ai 1277 925 72.44%

Together https://openrouter.ai 1266 911 71.96%

Groq https://groq.com 69.52% 1042 1042 100.00%

Nebius https://nebius.ai 50.60% 644 544 84.47%

##### We ran the official API multiple times to test the fluctuation of tool_call_f1. The lowest score was 82.71%, and the average was 84%. Given the inherent randomness of the model, we believe that an tool_call_f1 score above 80% is acceptable and can be used as a reference.

Evaluation Metrics

ToolCall-Trigger Similarity

We use tool_call_f1 to determine whether the model deployment is correct.

| Label / Metric | Formula | Meaning | | --- | --- | --- | | TP (True Positive) | — | Both model & official have finish_reason == "tool_calls". | | FP (False Positive) | — | Model finish_reason == "tool_calls" while official is "stop" or "others". | | FN (False Negative) | — | Model finish_reason == "stop" or "others" while official is "tool_calls". | | TN (True Negative) | — | Both model & official have finish_reason == "stop" or "others". | | tool_call_precision | TP / (TP + FP) | Proportion of triggered tool calls that should have been triggered. | | tool_call_recall | TP / (TP + FN) | Proportion of tool calls that should have been triggered and were. | | `tool_call_f1` | **2*tool_call_precision*tool_call_recall / (tool_call_precision+tool_call_recall) | Harmonic mean of precision and recall (primary metric for deployment check).** |

ToolCall-Schema Accuracy

We use schema_accuracy to measure the robustness of the engineering.

| Label / Metric | Formula / Condition | Description | | --- | --- | --- | | count_finish_reason_tool_calls | — | Number of responses with finish_reason == "tool_calls". | | count_successful_tool_call | — | Number of tool_calls responses that passed schema validation. | | `schema_accuracy` | `count_successful_tool_call / count_finish_reason_tool_calls` | Proportion of triggered tool calls whose JSON payload satisfies the schema. |

How we do the test

We test toolcall's response over a set of 4,000 requests. Each provider's responses are collected and compared against the official Moonshot AI API.

K2 vendors are periodically evaluated. If you are not on the list and would like to be included, feel free to contact us.

Sample Data: Detailed samples and MoonshotAI results are available in tool-calls-dataset (50% of the test set).

Suggestions to Vendors

1. Use the Correct Versions Some vendors may not meet the requirements due to using incorrect versions. We recommend using the following versions and newer versions:

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New repo with moderate stars, practical tool.