deepinfra/tensorrtllm_backend
forked from triton-inference-server/tensorrtllm_backend
Captured source
source ↗deepinfra/tensorrtllm_backend
Description: The Triton TensorRT-LLM Backend
Language: Python
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2023-12-15T23:21:37Z
Pushed: 2025-05-08T17:37:45Z
Default branch: main
Fork: yes
Parent repository: triton-inference-server/tensorrtllm_backend
Archived: no
README:
TensorRT-LLM Backend
The Triton backend for TensorRT-LLM. You can learn more about Triton backends in the backend repo. The goal of TensorRT-LLM Backend is to let you serve TensorRT-LLM models with Triton Inference Server. The [inflight_batcher_llm](./inflight_batcher_llm/) directory contains the C++ implementation of the backend supporting inflight batching, paged attention and more.
Where can I ask general questions about Triton and Triton backends? Be sure to read all the information below as well as the general Triton documentation available in the main server repo. If you don't find your answer there you can ask questions on the issues page.
Table of Contents
- [TensorRT-LLM Backend](#tensorrt-llm-backend)
- [Table of Contents](#table-of-contents)
- [Getting Started](#getting-started)
- [Quick Start](#quick-start)
- [Launch Triton TensorRT-LLM container](#launch-triton-tensorrt-llm-container)
- [Prepare TensorRT-LLM engines](#prepare-tensorrt-llm-engines)
- [Prepare the Model Repository](#prepare-the-model-repository)
- [Modify the Model Configuration](#modify-the-model-configuration)
- [Serving with Triton](#serving-with-triton)
- [Send an Inference Request](#send-an-inference-request)
- [Using the generate endpoint](#using-the-generate-endpoint)
- [Using the client scripts](#using-the-client-scripts)
- [Early stopping](#early-stopping)
- [Return context logits and/or generation logits](#return-context-logits-andor-generation-logits)
- [Requests with batch size \> 1](#requests-with-batch-size--1)
- [Building from Source](#building-from-source)
- [Supported Models](#supported-models)
- [Model Config](#model-config)
- [Model Deployment](#model-deployment)
- [TRT-LLM Multi-instance Support](#trt-llm-multi-instance-support)
- [Leader Mode](#leader-mode)
- [Orchestrator Mode](#orchestrator-mode)
- [Running Multiple Instances of LLaMa Model](#running-multiple-instances-of-llama-model)
- [Multi-node Support](#multi-node-support)
- [Model Parallelism](#model-parallelism)
- [Tensor Parallelism, Pipeline Parallelism and Expert Parallelism](#tensor-parallelism-pipeline-parallelism-and-expert-parallelism)
- [MIG Support](#mig-support)
- [Scheduling](#scheduling)
- [Key-Value Cache](#key-value-cache)
- [Decoding](#decoding)
- [Decoding Modes - Top-k, Top-p, Top-k Top-p, Beam Search, Medusa, ReDrafter, Lookahead and Eagle](#decoding-modes---top-k-top-p-top-k-top-p-beam-search-medusa-redrafter-lookahead-and-eagle)
- [Speculative Decoding](#speculative-decoding)
- [Chunked Context](#chunked-context)
- [Quantization](#quantization)
- [LoRa](#lora)
- [Launch Triton server *within Slurm based clusters*](#launch-triton-server-within-slurm-based-clusters)
- [Prepare some scripts](#prepare-some-scripts)
- [Submit a Slurm job](#submit-a-slurm-job)
- [Triton Metrics](#triton-metrics)
- [Benchmarking](#benchmarking)
- [Testing the TensorRT-LLM Backend](#testing-the-tensorrt-llm-backend)
Getting Started
Quick Start
Below is an example of how to serve a TensorRT-LLM model with the Triton TensorRT-LLM Backend on a 4-GPU environment. The example uses the GPT model from the TensorRT-LLM repository with the NGC Triton TensorRT-LLM container. Make sure you are cloning the same version of TensorRT-LLM backend as the version of TensorRT-LLM in the container. Please refer to the support matrix to see the aligned versions.
In this example, we will use Triton 24.07 with TensorRT-LLM v0.11.0.
Launch Triton TensorRT-LLM container
Launch Triton docker container nvcr.io/nvidia/tritonserver:-trtllm-python-py3 with TensorRT-LLM backend.
Make an engines folder outside docker to reuse engines for future runs. Make sure to replace the `` with the version of Triton that you want to use.
docker run --rm -it --net host --shm-size=2g \ --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ -v :/engines \ nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
Prepare TensorRT-LLM engines
You can skip this step if you already have the engines ready. Follow the guide in TensorRT-LLM repository for more details on how to to prepare the engines for all the supported models. You can also check out the tutorials to see more examples with serving TensorRT-LLM models.
cd /app/tensorrt_llm/examples/gpt # Download weights from HuggingFace Transformers rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2 pushd gpt2 && rm pytorch_model.bin model.safetensors && wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin && popd # Convert weights from HF Tranformers to TensorRT-LLM checkpoint python3 convert_checkpoint.py --model_dir gpt2 \ --dtype float16 \ --tp_size 4 \ --output_dir ./c-model/gpt2/fp16/4-gpu # Build TensorRT engines trtllm-build --checkpoint_dir ./c-model/gpt2/fp16/4-gpu \ --gpt_attention_plugin float16 \ --remove_input_padding enable \ --kv_cache_type paged \ --gemm_plugin float16 \ --output_dir /engines/gpt/fp16/4-gpu
See here for more details on the parameters.
Prepare the Model Repository
Next, create the model repository that will be used by the Triton server. The models can be found in the [all_models](./all_models) folder. The folder contains two groups of models:
- [
gpt](./all_models/gpt): Using TensorRT-LLM pure Python runtime.
-…
Excerpt shown — open the source for the full document.