ForkDeepInfraDeepInfrapublished Dec 15, 2023seen 5d

deepinfra/tensorrtllm_backend

forked from triton-inference-server/tensorrtllm_backend

Open original ↗

Captured source

source ↗
published Dec 15, 2023seen 5dcaptured 14hhttp 200method plain

deepinfra/tensorrtllm_backend

Description: The Triton TensorRT-LLM Backend

Language: Python

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2023-12-15T23:21:37Z

Pushed: 2025-05-08T17:37:45Z

Default branch: main

Fork: yes

Parent repository: triton-inference-server/tensorrtllm_backend

Archived: no

README:

TensorRT-LLM Backend

The Triton backend for TensorRT-LLM. You can learn more about Triton backends in the backend repo. The goal of TensorRT-LLM Backend is to let you serve TensorRT-LLM models with Triton Inference Server. The [inflight_batcher_llm](./inflight_batcher_llm/) directory contains the C++ implementation of the backend supporting inflight batching, paged attention and more.

Where can I ask general questions about Triton and Triton backends? Be sure to read all the information below as well as the general Triton documentation available in the main server repo. If you don't find your answer there you can ask questions on the issues page.

Table of Contents

  • [TensorRT-LLM Backend](#tensorrt-llm-backend)
  • [Table of Contents](#table-of-contents)
  • [Getting Started](#getting-started)
  • [Quick Start](#quick-start)
  • [Launch Triton TensorRT-LLM container](#launch-triton-tensorrt-llm-container)
  • [Prepare TensorRT-LLM engines](#prepare-tensorrt-llm-engines)
  • [Prepare the Model Repository](#prepare-the-model-repository)
  • [Modify the Model Configuration](#modify-the-model-configuration)
  • [Serving with Triton](#serving-with-triton)
  • [Send an Inference Request](#send-an-inference-request)
  • [Using the generate endpoint](#using-the-generate-endpoint)
  • [Using the client scripts](#using-the-client-scripts)
  • [Early stopping](#early-stopping)
  • [Return context logits and/or generation logits](#return-context-logits-andor-generation-logits)
  • [Requests with batch size \> 1](#requests-with-batch-size--1)
  • [Building from Source](#building-from-source)
  • [Supported Models](#supported-models)
  • [Model Config](#model-config)
  • [Model Deployment](#model-deployment)
  • [TRT-LLM Multi-instance Support](#trt-llm-multi-instance-support)
  • [Leader Mode](#leader-mode)
  • [Orchestrator Mode](#orchestrator-mode)
  • [Running Multiple Instances of LLaMa Model](#running-multiple-instances-of-llama-model)
  • [Multi-node Support](#multi-node-support)
  • [Model Parallelism](#model-parallelism)
  • [Tensor Parallelism, Pipeline Parallelism and Expert Parallelism](#tensor-parallelism-pipeline-parallelism-and-expert-parallelism)
  • [MIG Support](#mig-support)
  • [Scheduling](#scheduling)
  • [Key-Value Cache](#key-value-cache)
  • [Decoding](#decoding)
  • [Decoding Modes - Top-k, Top-p, Top-k Top-p, Beam Search, Medusa, ReDrafter, Lookahead and Eagle](#decoding-modes---top-k-top-p-top-k-top-p-beam-search-medusa-redrafter-lookahead-and-eagle)
  • [Speculative Decoding](#speculative-decoding)
  • [Chunked Context](#chunked-context)
  • [Quantization](#quantization)
  • [LoRa](#lora)
  • [Launch Triton server *within Slurm based clusters*](#launch-triton-server-within-slurm-based-clusters)
  • [Prepare some scripts](#prepare-some-scripts)
  • [Submit a Slurm job](#submit-a-slurm-job)
  • [Triton Metrics](#triton-metrics)
  • [Benchmarking](#benchmarking)
  • [Testing the TensorRT-LLM Backend](#testing-the-tensorrt-llm-backend)

Getting Started

Quick Start

Below is an example of how to serve a TensorRT-LLM model with the Triton TensorRT-LLM Backend on a 4-GPU environment. The example uses the GPT model from the TensorRT-LLM repository with the NGC Triton TensorRT-LLM container. Make sure you are cloning the same version of TensorRT-LLM backend as the version of TensorRT-LLM in the container. Please refer to the support matrix to see the aligned versions.

In this example, we will use Triton 24.07 with TensorRT-LLM v0.11.0.

Launch Triton TensorRT-LLM container

Launch Triton docker container nvcr.io/nvidia/tritonserver:-trtllm-python-py3 with TensorRT-LLM backend.

Make an engines folder outside docker to reuse engines for future runs. Make sure to replace the `` with the version of Triton that you want to use.

docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v :/engines \
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

Prepare TensorRT-LLM engines

You can skip this step if you already have the engines ready. Follow the guide in TensorRT-LLM repository for more details on how to to prepare the engines for all the supported models. You can also check out the tutorials to see more examples with serving TensorRT-LLM models.

cd /app/tensorrt_llm/examples/gpt

# Download weights from HuggingFace Transformers
rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2
pushd gpt2 && rm pytorch_model.bin model.safetensors && wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin && popd

# Convert weights from HF Tranformers to TensorRT-LLM checkpoint
python3 convert_checkpoint.py --model_dir gpt2 \
--dtype float16 \
--tp_size 4 \
--output_dir ./c-model/gpt2/fp16/4-gpu

# Build TensorRT engines
trtllm-build --checkpoint_dir ./c-model/gpt2/fp16/4-gpu \
--gpt_attention_plugin float16 \
--remove_input_padding enable \
--kv_cache_type paged \
--gemm_plugin float16 \
--output_dir /engines/gpt/fp16/4-gpu

See here for more details on the parameters.

Prepare the Model Repository

Next, create the model repository that will be used by the Triton server. The models can be found in the [all_models](./all_models) folder. The folder contains two groups of models:

  • [gpt](./all_models/gpt): Using TensorRT-LLM pure Python runtime.

-…

Excerpt shown — open the source for the full document.