ForkDeepInfra DeepInfrapublished Mar 13, 2024seen Jun 5

deepinfra/TensorRT-LLM

forked from NVIDIA/TensorRT-LLM

Open original ↗

Captured source

GH

GitHub/github.com/deepinfra/TensorRT-LLM

deepinfra/TensorRT-LLM repository metadata

published Mar 13, 2024seen Jun 5captured Jun 11http 200method plain

deepinfra/TensorRT-LLM

Description: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Language: Python

License: NOASSERTION

Stars: 0

Forks: 0

Open issues: 1

Created: 2024-03-13T18:57:47Z

Pushed: 2026-06-04T23:29:39Z

Default branch: main

Fork: yes

Parent repository: NVIDIA/TensorRT-LLM

Archived: no

README:

TensorRT LLM =========================== TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

![Ask DeepWiki](https://deepwiki.com/NVIDIA/TensorRT-LLM)

Architecture | Performance | Examples | Documentation | Roadmap

---

Tech Blogs

[02/06] Accelerating Long-Context Inference with Skip Softmax Attention

✨ ➡️ link

[01/09] Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs

✨ ➡️ link

[10/13] Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)

✨ ➡️ link

[09/26] Inference Time Compute Implementation in TensorRT LLM

✨ ➡️ link

[09/19] Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly

✨ ➡️ link

[08/29] ADP Balance Strategy

✨ ➡️ link

[08/05] Running a High-Performance GPT-OSS-120B Inference Server with TensorRT LLM

✨ ➡️ link

[08/01] Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)

✨ ➡️ link

[07/26] N-Gram Speculative Decoding in TensorRT LLM

✨ ➡️ link

[06/19] Disaggregated Serving in TensorRT LLM

✨ ➡️ link

[06/05] Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)

✨ ➡️ link

[05/30] Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers

✨ ➡️ link

[05/23] DeepSeek R1 MTP Implementation and Optimization

✨ ➡️ link

[05/16] Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs

✨ ➡️ link

Latest News

[08/05] 🌟 TensorRT LLM delivers Day-0 support for OpenAI's latest open-weights models: GPT-OSS-120B ➡️ link and GPT-OSS-20B ➡️ link
[07/15] 🌟 TensorRT LLM delivers Day-0 support for LG AI Research's latest model, EXAONE 4.0 ➡️ link
[06/17] Join NVIDIA and DeepInfra for a developer meetup on June 26 ✨ ➡️ link
[05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick

✨ ➡️ link

[04/10] TensorRT LLM DeepSeek R1 performance benchmarking best practices now published.

✨ ➡️ link

[04/05] TensorRT LLM can run Llama 4 at over 40,000 tokens per second on B200 GPUs!

[03/22] TensorRT LLM is now fully open-source, with developments moved to GitHub!
[03/18] 🚀🚀 NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance with TensorRT LLM ➡️ Link
[02/28] 🌟 NAVER Place Optimizes SLM-Based Vertical Services with TensorRT LLM ➡️ Link

[02/25] 🌟 DeepSeek-R1 performance now optimized for Blackwell ➡️ Link

[02/20] Explore the complete guide to achieve great accuracy, high throughput, and low latency at the lowest cost for your business here.

[02/18] Unlock #LLM inference with auto-scaling on @AWS EKS ✨ ➡️ link

*...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine fork of existing repo