novitalabs/autotuner
Python
Captured source
source ↗novitalabs/autotuner
Description: Optimize the performance of LLM inference engines by automatically tuning parameters for a specific model.
Language: Python
License: MIT
Stars: 11
Forks: 3
Open issues: 4
Created: 2025-10-21T11:17:53Z
Pushed: 2026-06-10T20:58:22Z
Default branch: main
Fork: no
Archived: no
README:
LLM Autotuner (for inference)
Automated parameter tuning for LLM inference engines (SGLang, vLLM) for best performance, while respecting SLOs and hardware constraints.
Why Autotuner?
Quantization and parameter tuning can unlock 60%+ performance gains. LLM inference engines like SGLang and vLLM ship with conservative defaults that work everywhere but are optimized for nowhere.
Performance Impact: Real-World Data
Testing on NVIDIA RTX 4090 (24GB) with typical production workloads (mixed prefill/decode).
See detailed benchmarks: [Baseline Benchmarks](docs/qwen-benchmarks.md)
| What You Get | Manual Tuning | Autotuner | |--------------|---------------|-----------| | Time to optimal config | Hours to Days | Minutes | | Parameter combinations tested | ~10 (limited by patience) | 50-100+ (automated) | | Performance gain | Unknown (untested) | 60%+ throughput (quantization + tuning) | | Reproducibility | Low (manual errors) | High (versioned configs) | | Cross-hardware portability | Manual rework | Re-run task (one command) |
How to Use
CLI Mode
Web UI Mode
Agent Mode
Core Concepts
- Task: A tuning job containing model config, parameter ranges, SLOs, and optimization strategy
- Experiment: Individual trial with specific parameter values; multiple experiments per task
- ARQ Worker: Background processor that deploys models, runs benchmarks, and scores results
Features
- Multiple Deployment Modes: Docker, Local (direct GPU), OME (Kubernetes)
- Web UI: React frontend with real-time monitoring
- Agent Assistant: LLM-powered assistant for task management and troubleshooting
- Optimization Strategies: Grid search, Bayesian optimization
- SLO-Aware Scoring: Exponential penalties for constraint violations
Quick Start
→ [Get started in 5 minutes with Docker](docs/getting-started/quickstart.md)
# Install pip install -r requirements.txt && pip install genai-bench # Run python src/run_autotuner.py examples/docker_task.yaml --mode docker
Web UI
# Start backend + worker ./scripts/start_dev.sh # Start frontend (separate terminal) cd frontend && npm run dev
Access at http://localhost:5173
Documentation
Project Overview
- [ROADMAP.md](docs/architecture/roadmap.md) - Product roadmap with completed milestones and future plans
Setup & Deployment
- [Installation Guide](docs/getting-started/installation.md) - Complete installation guide
- [Quick Start](docs/getting-started/quickstart.md) - Quick start tutorial
- [Docker Mode](docs/user-guide/docker-mode.md) - Docker deployment guide
- [Kubernetes/OME](docs/user-guide/kubernetes.md) - Kubernetes/OME setup
Features & Configuration
- [SLO Scoring](docs/features/slo-scoring.md) - SLO-aware scoring with exponential penalties
- [Parallel Execution](docs/features/parallel-execution.md) - Parallel experiment execution
- [WebSocket Implementation](docs/features/websocket.md) - Real-time updates via WebSocket
- [Quantization Parameters](docs/UNIFIED_QUANTIZATION_PARAMETERS.md) - Quantization configuration
- [Parameter Presets](docs/user-guide/presets.md) - Parameter preset system
- [Bayesian Optimization](docs/features/bayesian-optimization.md) - Bayesian optimization strategy
- [GPU Tracking](docs/features/gpu-tracking.md) - GPU intelligent scheduling
Operations & Troubleshooting
- [Troubleshooting](docs/troubleshooting.md) - Common issues and solutions
Contributing
See [DEVELOPMENT](docs/DEVELOPMENT.md) for development guidelines and project architecture.
Notability
notability 3.0/10Routine new repo with minimal stars