RepoReka AIReka AIpublished Feb 16, 2026seen 5d

reka-ai/vllm-reka

Python

Open original ↗

Captured source

source ↗
published Feb 16, 2026seen 5dcaptured 11hhttp 200method plain

reka-ai/vllm-reka

Description: vLLM plugin for Reka models

Language: Python

License: Apache-2.0

Stars: 9

Forks: 0

Open issues: 1

Created: 2026-02-16T14:59:03Z

Pushed: 2026-05-18T16:26:50Z

Default branch: main

Fork: no

Archived: no

README:

vllm-reka

This plugin serves Reka Edge — a 7B multimodal model with frontier-class image understanding, video analysis, object detection, and tool use — via vLLM.

It registers model architectures, a custom tokenizer, and HuggingFace configs so that vLLM can load and serve Reka checkpoints out of the box.

Quickstart

# 1. Install the plugin
uv sync

# 2. Download model weights (~14 GB)
pip install huggingface_hub
hf download RekaAI/reka-edge-2603 --local-dir ./models/reka-edge-2603

# 3. Start the server
bash ./serve.sh ./models/reka-edge-2603

# 4. Query it
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"reka-edge-2603","messages":[{"role":"user","content":"Hello!"}]}'

Requirements

  • GPU: NVIDIA GPU, ideally with ≥24 GB VRAM. This has been tested to work on GTX 3090 GPUs with 40-50 tokens/s.
  • OS: Linux with CUDA. macOS is not supported for serving.
  • Python: 3.10 ≥ x > 3.14
  • vLLM: 0.15.x (0.15.0 ≥ x > 0.16.0)

Supported Models

| Model | Architecture | Vision Encoder | Description | |---|---|---|---| | Reka Edge | Yasa2ForConditionalGeneration | ConvNextV2 | 7B multimodal model (image + video) |

Installation

Recommended (reproducible, uses uv.lock):

uv sync

Fallback with pip:

pip install -e .

Or with Poetry:

poetry install

The plugin registers itself via the vllm.general_plugins entry point — vLLM discovers it automatically once installed.

Serving

serve.sh (recommended)

Use serve.sh as the default entrypoint. It applies the plugin-specific defaults that this repo is tested with.

bash ./serve.sh

Example with explicit host/port:

HOST=0.0.0.0 PORT=8000 bash ./serve.sh ./models/reka-edge-2603

You can also pass through additional vllm serve flags:

bash ./serve.sh ./models/reka-edge-2603 --max-num-seqs 32

serve.sh configuration

Common environment variables:

| Variable | Default | Description | |---|---|---| | HOST | 0.0.0.0 | Bind address | | PORT | 8000 | API port | | SERVED_MODEL_NAME | reka-edge-2603 | Model name exposed to OpenAI-compatible clients | | GPU_MEM | 0.95 | --gpu-memory-utilization | | MAX_LEN | 16384 | --max-model-len | | MAX_BATCH_TOKENS | 20000 | --max-num-batched-tokens | | MAX_IMAGES | 6 | Per-prompt image cap | | MAX_VIDEOS | 3 | Per-prompt video cap | | VIDEO_NUM_FRAMES | 6 | Frames sampled per video. Higher values improve temporal understanding but increase latency and memory usage. | | VIDEO_SAMPLING | chunk | Video sampling strategy | | TP_SIZE | 1 | Tensor parallel size | | DTYPE | bfloat16 | vLLM dtype | | QUANTIZATION | bitsandbytes | Quantization backend (see [Quantization](#quantization)) |

Optional runtime env vars:

  • VLLM_TORCH_PROFILER_DIR (only exported when set)
  • USE_IMAGE_PATCHING (default 1)
  • VLLM_VIDEO_LOADER_BACKEND (default yasa)
  • VLLM_USE_V1 (default 1)
  • VLLM_FLASH_ATTN_VERSION (default 3)
  • VLLM_HTTP_TIMEOUT_KEEP_ALIVE (default 300)

Quantization

The server defaults to 4-bit bitsandbytes quantization, which reduces VRAM usage enough to run on consumer GPUs (e.g., RTX 4090 with 24 GB). To run at full precision instead:

QUANTIZATION="" bash ./serve.sh ./models/reka-edge-2603

Full precision requires more VRAM (~14 GB in bfloat16) but avoids any quantization-related quality loss.

Advanced: direct vllm serve

Prefer serve.sh unless you need full manual control. Minimal direct command:

vllm serve \
--tokenizer-mode yasa \
--chat-template-content-format openai \
--trust-remote-code

Examples

Once the server is running, it exposes an OpenAI-compatible API at http://localhost:8000 (or your configured PORT).

Text

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reka-edge-2603",
"messages": [{"role": "user", "content": "Hello!"}]
}'

Image understanding

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reka-edge-2603",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "Describe this image in detail."}
]
}]
}'

Video analysis

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reka-edge-2603",
"messages": [{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}},
{"type": "text", "text": "Summarize what happens in this video."}
]
}]
}'

Object detection

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reka-edge-2603",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "Detect: eye, ear"}
]
}]
}'

Tool use / function calling

serve.sh enables tool use by default (--enable-auto-tool-choice --tool-call-parser hermes). Pass tools via the standard OpenAI tools parameter:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reka-edge-2603",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}]
}'

The model will return a tool_calls response when it decides to invoke a function.

Python client

The server is compatible with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

# Text query
response = client.chat.completions.create(…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low-star repo from reka, minor