ForkCoreWeaveCoreWeavepublished Jul 24, 2024seen 6d

coreweave/exllamav2

forked from turboderp-org/exllamav2

Open original ↗

Captured source

source ↗
published Jul 24, 2024seen 6dcaptured 15hhttp 200method plain

coreweave/exllamav2

Description: A fast inference library for running LLMs locally on modern consumer-class GPUs

Language: Python

License: MIT

Stars: 0

Forks: 0

Open issues: 2

Created: 2024-07-24T20:45:32Z

Pushed: 2024-12-20T20:47:40Z

Default branch: master

Fork: yes

Parent repository: turboderp-org/exllamav2

Archived: no

README:

ExLlamaV2

ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs.

The official and recommended backend server for ExLlamaV2 is TabbyAPI, which provides an OpenAI-compatible API for local or remote inference, with extended features like HF model downloading, embedding model support and support for HF Jinja2 chat templates.

See the wiki for help getting started.

New in v0.1.0+:

  • ExLlamaV2 now supports paged attention via Flash Attention 2.5.7+
  • New generator with dynamic batching, smart prompt caching, K/V cache deduplication and simplified API

![alt_text](doc/dynamic_gen.gif)

Dynamic generator

The dynamic generator supports all inference, sampling and speculative decoding features of the previous two generators, consolidated into one API (with the exception of FP8 cache, though the Q4 cache mode is supported and performs better anyway, see [here](doc/qcache_eval.md).)

The generator is explained in detail [here](doc/dynamic.md).

  • Single generation:
output = generator.generate(prompt = "Hello, my name is", max_new_tokens = 200)
  • Batched generation:
outputs = generator.generate(
prompt = [
"Hello, my name is",
"Once upon a time,",
"Large language models are",
],
max_new_tokens = 200
)
  • Streamed generation with asyncio:
job = ExLlamaV2DynamicJobAsync(
generator,
input_ids = tokenizer.encode("You can lead a horse to water"),
banned_strings = ["make it drink"],
gen_settings = ExLlamaV2Sampler.Settings.greedy(),
max_new_tokens = 200
)
async for result in job:
text = result.get("text", "")
print(text, end = "")

See the full, updated examples here.

Performance

Some quick tests to compare performance with ExLlama V1. There may be more performance optimizations in the future, and speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:

| Model | Mode | Size | grpsz | act | 3090Ti | 4090 | |------------|--------------|-------|-------|-----|---------|-------------| | Llama | GPTQ | 7B | 128 | no | 181 t/s | 205 t/s | | Llama | GPTQ | 13B | 128 | no | 110 t/s | 114 t/s | | Llama | GPTQ | 33B | 128 | yes | 44 t/s | 48 t/s | | OpenLlama | GPTQ | 3B | 128 | yes | 259 t/s | 296 t/s | | CodeLlama | EXL2 4.0 bpw | 34B | - | - | 44 t/s | 50 t/s | | Llama2 | EXL2 3.0 bpw | 7B | - | - | 217 t/s | 257 t/s | | Llama2 | EXL2 4.0 bpw | 7B | - | - | 185 t/s | 211 t/s | | Llama2 | EXL2 5.0 bpw | 7B | - | - | 164 t/s | 179 t/s | | Llama2 | EXL2 2.5 bpw | 70B | - | - | 33 t/s | 38 t/s | | TinyLlama | EXL2 3.0 bpw | 1.1B | - | - | 656 t/s | 770 t/s | | TinyLlama | EXL2 4.0 bpw | 1.1B | - | - | 602 t/s | 700 t/s |

How to

To install from the repo you'll need the CUDA Toolkit and either gcc on Linux or (Build Tools for) Visual Studio on Windows). Also make sure you have an appropriate version of PyTorch, then run:

git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
pip install .

python test_inference.py -m -p "Once upon a time,"
# Append the '--gpu_split auto' flag for multi-GPU inference

A simple console chatbot is included. Run it with:

python examples/chat.py -m -mode llama -gs auto

The -mode argument chooses the prompt format to use. raw will produce a simple chatlog-style chat that works with base models and various other finetunes. Run with -modes for a list of all available prompt formats. You can also provide a custom system prompt with -sp.

Integration and APIs

  • TabbyAPI is a FastAPI-based server that provides an OpenAI-style web API

compatible with SillyTavern and other frontends.

  • ExUI is a simple, standalone single-user web UI that serves an ExLlamaV2 instance

directly with chat and notebook modes.

and exllamav2_HF loaders.

  • lollms-webui supports ExLlamaV2 through the exllamav2 binding.

Installation

Method 1: Install from source

To install the current dev version, clone the repo and run the setup script:

git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
pip install .

By default this will also compile and install the Torch C++ extension (exllamav2_ext) that the library relies on. You can skip this step by setting the EXLLAMA_NOCOMPILE environment variable:

EXLLAMA_NOCOMPILE= pip install .

This will install the "JIT version" of the package, i.e. it will install the Python components without building the C++ extension in the process. Instead, the extension will be built the first time the library is used, then cached in ~/.cache/torch_extensions for subsequent use.

Method 2: Install from release (with prebuilt extension)

Releases are available here, with prebuilt wheels that contain the extension binaries. Make sure to grab the right version, matching your platform, Python version (cp) and CUDA version. Crucially, you must also match the prebuilt wheel with your PyTorch version, since the Torch C++ extension ABI breaks with every new version of PyTorch.

Either download an appropriate wheel or install directly from the appropriate URL:

pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.12/exllamav2-0.0.12+cu121-cp311-cp311-linux_x86_64.whl

The py3-none-any.whl version is the JIT version which will build the extension on first launch. The .tar.gz file can also be installed this way, and it will build the extension while installing.

Method 3: Install from PyPI

A PyPI package is available as…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine fork; low impact.