ForkNovita AINovita AIpublished May 11, 2026seen 5d

novitalabs/gateway-api-inference-extension

forked from kubernetes-sigs/gateway-api-inference-extension

Open original ↗

Captured source

source ↗

novitalabs/gateway-api-inference-extension

Description: Gateway API Inference Extension

Language: Go

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2026-05-11T07:47:54Z

Pushed: 2026-05-21T07:12:06Z

Default branch: main

Fork: yes

Parent repository: kubernetes-sigs/gateway-api-inference-extension

Archived: no

README: ![Go Report Card](https://goreportcard.com/report/sigs.k8s.io/gateway-api-inference-extension) ![Go Reference](https://pkg.go.dev/sigs.k8s.io/gateway-api-inference-extension)

Gateway API Inference Extension

Gateway API Inference Extension optimizes self-hosting Generative Models on Kubernetes. This is achieved by leveraging Envoy's [External Processing] (ext-proc) to extend any gateway that supports both ext-proc and [Gateway API] into an [inference gateway].

[Inference Gateway]:#concepts-and-definitions

New!

Inference Gateway has partnered with vLLM to accelerate LLM serving optimizations with llm-d!

> [!IMPORTANT] > The Endpoint Picker (EPP), InferenceObjective and InferenceModelRewrite APIs, and Body Based Router (BBR) packages have moved to new repositories: > - EPP and associated APIs: llm-d/llm-d-inference-scheduler > - BBR: llm-d/llm-d-inference-payload-processor > > No new code will be accepted to these packages in this repository, and they will be archived soon. This move was proposed and discussed in issue #2430. > > This repository will continue to host the lightweight EPP (LWEPP) and the InferencePool API, and will remain the primary location for the development and maintenance of conformance tests.

Concepts and Definitions

The following specific terms to this project:

  • Inference Gateway (IGW): A proxy/load-balancer which has been coupled with an

Endpoint Picker. It provides optimized routing and load balancing for serving Kubernetes self-hosted generative Artificial Intelligence (AI) workloads. It simplifies the deployment, management, and observability of AI inference workloads.

  • Inference Scheduler: An extendable component that makes decisions about which endpoint is optimal (best cost /

best performance) for an inference request based on Metrics and Capabilities from [Model Serving](/docs/proposals/003-model-server-protocol/README.md).

  • Metrics and Capabilities: Data provided by model serving platforms about

performance, availability and capabilities to optimize routing. Includes things like [Prefix Cache] status or [LoRA Adapters] availability.

  • Endpoint Picker(EPP): An implementation of an Inference Scheduler with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP here.
  • Body Based Router(BBR): An optional additional ext-proc server that parses the http body of the inference prompt message and extracts information (currently the model name for OpenAI API style messages) into a format which can then be used by the gateway for routing purposes. Additional info here and in the documentation user guides.

The following are key industry terms that are important to understand for this project:

  • Model: A generative AI model that has learned patterns from data and is

used for inference. Models vary in size and architecture, from smaller domain-specific models to massive multi-billion parameter neural networks that are optimized for diverse language tasks.

  • Inference: The process of running a generative AI model, such as a large

language model, diffusion model etc, to generate text, embeddings, or other outputs from input data.

  • Model server: A service (in our case, containerized) responsible for

receiving inference requests and returning predictions from a model.

  • Accelerator: specialized hardware, such as Graphics Processing Units

(GPUs) that can be attached to Kubernetes nodes to speed up computations, particularly for training and inference tasks.

For deeper insights and more advanced concepts, refer to our [proposals](/docs/proposals).

[Inference]:https://www.digitalocean.com/community/tutorials/llm-inference-optimization [Gateway API]:https://github.com/kubernetes-sigs/gateway-api [Prefix Cache]:https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html [LoRA Adapters]:https://docs.vllm.ai/en/stable/features/lora.html [External Processing]:https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter

Technical Overview

This extension upgrades an ext-proc capable proxy or gateway - such as Envoy Gateway, kgateway, or the GKE Gateway - to become an [inference gateway] - supporting inference platform teams self-hosting Generative Models (with a current focus on large language models) on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee.

The Inference Gateway:

  • Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling algorithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases
  • Provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

routine fork of API extension