ForkUpstage (Solar)Upstage (Solar)published Jan 17, 2025seen 5d

UpstageAI/tensorizer

forked from coreweave/tensorizer

Open original ↗

Captured source

source ↗
published Jan 17, 2025seen 5dcaptured 16hhttp 200method plain

UpstageAI/tensorizer

Description: Module, Model, and Tensor Serialization/Deserialization

Language: Python

License: MIT

Stars: 0

Forks: 0

Open issues: 1

Created: 2025-01-17T07:12:34Z

Pushed: 2025-12-02T08:49:28Z

Default branch: main

Fork: yes

Parent repository: coreweave/tensorizer

Archived: no

README:

tensorizer

Module, Model, and Tensor Serialization/Deserialization

TLDR

Extremely fast model loads from HTTP/HTTPS, Redis, and S3 endpoints. GPT-J (20GB) loads at wire-speed (~5GB/s) on a 40GbE network, and is only bottlenecked by the Linux kernel TCP stack.

Rationale

CoreWeave and our customers use KNative to deploy models as serverless functions. How long a model takes to load is a major factor in the latency of KNative scale-up. tensorizer is a tool to serialize models and their associated tensors into a single file that can be loaded quickly and efficiently off an HTTP/HTTPS or S3 endpoint.

By not embedding the model in the container image, we can reduce the container image size and the time it takes to load the model. This is especially important for models that are large in size, such as EleutherAI/gpt-neox-20B that weighs in at ~40GB.

This decoupling of the model from the container image also allows us to update the model without having to rebuild the container image. This allows us to quickly iterate on the model and deploy new versions without having to wait for the container image to build or for the container image cache to be populated.

tensorizer has S3 support, so we can store the serialized model in S3 object storage, and perform streaming loads from S3. This allows us to stream the model directly from S3 into the container without having to download the model to the container's local filesystem. This also pertains to HTTP/HTTPS endpoints, as S3 is just an HTTP/HTTPS endpoint.

tensorizer also has support for loading models from a local filesystem, so you can use it to serialize models locally and load them locally. This is extremely fast, as the same principles that make it fast for HTTP/HTTPS and S3 endpoints also apply to local filesystems.

tensorizer has preliminary support for Redis, but it is not recommended for model deployment due to the lack of distributed caching. It is intended for sharing state between inference pods, or for loading data on a per-request basis from a Redis cache.

Speed

tensorizer's deserialization speed is primarily network-bound.

The following graph presents data collected from the scripts and Kubernetes manifests in [examples/benchmark_buffer_size](examples/benchmark_buffer_size) comparing the various deserialization modes available in tensorizer release 2.5.0—along with the raw network speed, and the speed of torch.load().

!A letter-value plot comparing 7 deserialization modes and their respective deserialization speeds with a granularity of 0.125 GiB/sec. For local files, "torch.load()" has a median speed between 1.875 and 2.000 GiB/sec; "tensorizer file" has a median of 2.250; "tensorizer file, plaid_mode" has a median of about 4.625; "tensorizer file, lazy_load" has a median between 1.750 and 1.875. The raw network speed is also listed on the chart with a median between 1.250 and 1.375. For HTTP streaming, "tensorizer http" has a median between 0.875 and 1.000; "tensorizer http, plaid_mode" has a median between 1.000 and 1.125; and "tensorizer http, lazy_load" has a median between 0.875 and 1.000.

Installation

From PyPI

tensorizer can be installed from PyPI with pip:

python -m pip install tensorizer

From Source

You can also install tensorizer from source using pip.

To clone the repository and install tensorizer in editable mode, run:

git clone https://github.com/coreweave/tensorizer
cd tensorizer
python -m pip install -e .

Or, run the following for pip to install tensorizer directly from GitHub:

python -m pip install git+https://github.com/coreweave/tensorizer

Basic Usage

Serialization is done with the TensorSerializer class. It takes a path_uri argument that can be a local filesystem path, an HTTP/HTTPS endpoint, or an S3 endpoint.

write_module is the main method of the TensorSerializer class. It takes a torch.nn.Module and serializes the tensors to the path_uri endpoint.

The below example serializes the EleutherAI/gpt-j-6B model to an S3 endpoint. It assumes that you have already configured your S3 credentials in ~/.s3cfg.

NOTE: Loading and serializing gpt-j-6B will take a lot of CPU RAM, up to ~20GB. Additionally, when loading gpt-j-6B into a GPU, you will need about ~16GB of VRAM. If you don't have that much RAM or VRAM, you can use the smaller gpt-neo-125M model instead.

NOTE2: The below examples require the transformers and accelerate libraries. You can install them with pip:

python -m pip install transformers accelerate

[serialize.py](examples/serialize.py)

import torch
from tensorizer import TensorSerializer
from transformers import AutoModelForCausalLM

model_ref = "EleutherAI/gpt-j-6B"
# For less intensive requirements, swap above with the line below:
# model_ref = "EleutherAI/gpt-neo-125M"
model_name = model_ref.split("/")[-1]
# Change this to your S3 bucket.
s3_bucket = "bucket"
s3_uri = f"s3://{s3_bucket}/{model_name}.tensors"

model = AutoModelForCausalLM.from_pretrained(
model_ref,
revision="float16",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)

serializer = TensorSerializer(s3_uri)
serializer.write_module(model)
serializer.close()

Conversely, deserialization is done with the TensorDeserializer class. It takes a path_uri argument that can be a local filesystem path, an HTTP/HTTPS endpoint, or an S3 endpoint.

load_into_module is the main method of the TensorDeserializer class. It takes a torch.nn.Module and loads the tensors from the path_uri endpoint into the torch.nn.Module.

The below example loads the EleutherAI/gpt-j-6B model from an S3 endpoint.

[deserialize-simple.py](examples/deserialize-simple.py)

import time
import torch
from tensorizer import TensorDeserializer
from tensorizer.utils import…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine fork of an unremarkable repo.