ForkNovita AINovita AIpublished Apr 20, 2026seen 5d

novitalabs/Kimi-Vendor-Verifier

forked from MoonshotAI/Kimi-Vendor-Verifier

Open original ↗

Captured source

source ↗
published Apr 20, 2026seen 5dcaptured 11hhttp 200method plain

novitalabs/Kimi-Vendor-Verifier

Description: Kimi-Vendor-Verifier

Language: Python

License: MIT

Stars: 0

Forks: 0

Open issues: 1

Created: 2026-04-20T07:50:31Z

Pushed: 2026-06-08T08:38:49Z

Default branch: main

Fork: yes

Parent repository: MoonshotAI/Kimi-Vendor-Verifier

Archived: no

README:

Kimi Vendor Verifier

English | [中文](README_zh.md)

A model evaluation tool based on inspect-ai framework for benchmarking Kimi models.

Supported Benchmarks

| Benchmark | Description | Dataset | |-----------|-------------|---------| | AIME 2025 | American Invitational Mathematics Examination | math-ai/aime25 | | MMMU Pro Vision | Multimodal understanding (vision, 10-way multiple choice) | MMMU/MMMU_Pro | | OCRBench | OCR text recognition | echo840/OCRBench |

Required Parameters

| Benchmark | Mode | Temperature | TopP | Max Tokens | Epochs | |-----------|------|-------------|------|------------|--------| | OCRBench | Non-Thinking | 0.6 | 0.95 | 8192 | 1 | | OCRBench | Thinking | 1.0 | 0.95 | 16384 | 1 | | MMMU | Non-Thinking | 0.6 | 0.95 | 16384 | 1 | | MMMU | Thinking | 1.0 | 0.95 | 65536 | 1 | | AIME 2025 | Non-Thinking | 0.6 | 0.95 | 16384 | 32 | | AIME 2025 | Thinking | 1.0 | 0.95 | 98304 | 32 |

Setup

1. Install Dependencies

uv sync && uv pip install -e .

2. Configure Environment

export KIMI_API_KEY="your-api-key"
export KIMI_BASE_URL="your-base-url"

Or copy .env.example to .env and fill in the values.

3. Pre-flight Check

Before running benchmarks, verify that the API correctly enforces parameter constraints:

# Kimi Official API
uv run python verify_params.py --model kimi/your-model-id --think-mode kimi --all

# Opensource deployments (vLLM/SGLang/KTransformers)
uv run python verify_params.py --model your-model-id --think-mode opensource --all

This checks that immutable parameters (temperature, top_p, etc.) are correctly enforced. All tests must pass before proceeding with benchmark evaluations.

Running Evaluations

OCRBench (Quick Validation)

Non-Thinking

uv run python eval.py ocrbench --model kimi/your-model-id \
--think-mode kimi --max-tokens 8192 --stream

Thinking

uv run python eval.py ocrbench --model kimi/your-model-id \
--thinking --think-mode kimi --max-tokens 16384 --stream

MMMU Pro Vision

Non-Thinking

uv run python eval.py mmmu --model kimi/your-model-id \
--think-mode kimi --max-tokens 16384 --stream

Thinking

uv run python eval.py mmmu --model kimi/your-model-id \
--thinking --think-mode kimi --max-tokens 65536 --stream

AIME 2025

Non-Thinking

uv run python eval.py aime2025 --model kimi/your-model-id \
--think-mode kimi --max-tokens 16384 --stream

Thinking

uv run python eval.py aime2025 --model kimi/your-model-id \
--thinking --think-mode kimi --max-tokens 98304 --stream

> Tip: Run OCRBench first for quick validation (~10 min). Once verified, proceed with MMMU and AIME full evaluations.

Reference

Parameters

| Parameter | Description | Default | |-----------|-------------|---------| | benchmark | Task: ocrbench, mmmu, aime2025 | ocrbench | | --model | Model identifier, e.g., kimi/your-model-id | Required | | --max-tokens | Max output tokens (see Required Parameters) | Required | | --thinking | Enable thinking mode (requires --think-mode kimi/opensource) | Off | | --think-mode | Thinking param format: kimi or opensource (vLLM/SGLang/KTransformers) | kimi | | --temperature | Sampling temperature | thinking: 1.0, non-thinking: 0.6 | | --top-p | Top-p sampling | 0.95 | | --stream | Enable streaming (recommended for long inference) | Off | | --max-connections | Max concurrent connections | Per benchmark | | --epochs | Number of sampling epochs | Per benchmark | | --client-timeout | HTTP timeout in seconds | 86400 |

Thinking Mode Parameters

| Model Type | Parameters | extra_body | |------------|------------|------------| | Kimi Official + thinking off | --think-mode kimi | {"thinking": {"type": "disabled"}} | | Kimi Official + thinking on | --thinking --think-mode kimi | {"thinking": {"type": "enabled"}} | | Opensource + thinking off | --think-mode opensource | {"chat_template_kwargs": {"thinking": false}} | | Opensource + thinking on | --thinking --think-mode opensource | {"chat_template_kwargs": {"thinking": true}} |

View Results

# Use inspect view to browse logs
uv run inspect view

# Logs are saved in logs/ directory

Resume Interrupted Evaluations

uv run inspect eval-retry logs/.eval

Notes

AIME 2025 Evaluation

AIME evaluation generates many output tokens. Keep in mind:

1. Timeout Settings

  • Client: Default --client-timeout 86400 (24h), usually no change needed
  • Server: Ensure server timeout is also set long enough
  • Gateway/Proxy: If using nginx/ALB, adjust proxy_read_timeout etc.

2. Streaming

  • Strongly recommended to use --stream
  • Non-streaming requests may timeout in thinking mode
  • Streaming keeps connection alive, avoiding gateway timeouts

3. Concurrency Control

  • Default max_connections=100, adjust based on server capacity
  • If seeing many 429s or RemoteProtocolError, reduce concurrency

4. Quick Validation

  • First run with --epochs 1 to verify configuration
  • Then run full --epochs 32 evaluation
# Step 1: Quick validation (30 samples x 1 epoch)
uv run python eval.py aime2025 --model kimi/your-model-id \
--thinking --think-mode kimi --max-tokens 98304 --stream --epochs 1

# Step 2: Full evaluation (30 samples x 32 epochs)
uv run python eval.py aime2025 --model kimi/your-model-id \
--thinking --think-mode kimi --max-tokens 98304 --stream

Automatic Retry

The following network errors are automatically retried (exponential backoff, 1-60s):

| Error Type | Description | |------------|-------------| | RateLimitError / 429 | Server rate limiting | | APIConnectionError | Connection failure | | ReadError / RemoteProtocolError | Network read error |

> Non-network errors (e.g., model…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork of a repository