ModelStepFunStepFunpublished Aug 14, 2025seen 5d

stepfun-ai/NextStep-1-f8ch16-Tokenizer

Open original ↗

Captured source

source ↗
published Aug 14, 2025seen 5dcaptured 11hhttp 200method plainlicense apache-2.0downloads 53likes 15

Improved Image Tokenizer

This is an improved image tokenizer of NextStep-1, featuring a fine-tuned decoder with a frozen encoder. The decoder refinement improves performance while preserving robust reconstruction quality. We recommend using this Image Tokenizer for optimal results with NextStep-1 models.

Usage

import torch
from PIL import Image
import numpy as np
import torchvision.transforms as transforms

from modeling_flux_vae import AutoencoderKL

device = "cuda"
dtype = torch.bfloat16

model_path = "/path/to/vae_dir"
vae = AutoencoderKL.from_pretrained(model_path).to(device=device, dtype=dtype)

pil2tensor = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
]
)

image = Image.open("/path/to/image.jpg")
pixel_values = pil2tensor(image).unsqueeze(0).to(device=device, dtype=dtype)

# encode
latents = vae.encode(pixel_values).latent_dist.sample()

# decode
sampled_images = vae.decode(latents).sample
sampled_images = sampled_images.detach().cpu().to(torch.float32)

def tensor_to_pil(tensor):
image = tensor.detach().cpu().to(torch.float32)
image = (image / 2 + 0.5).clamp(0, 1)
image = image.mul(255).round().to(dtype=torch.uint8)
image = image.permute(1, 2, 0).numpy()
return Image.fromarray(image, mode="RGB")

rec_image = tensor_to_pil(sampled_images[0])
rec_image.save("/path/to/output.jpg")

Evaluation

Reconstruction Performance on ImageNet-1K 256×256

| Tokenizer | Latent Shape | PSNR ↑ | SSIM ↑ | | ------------------------- | ------------ | --------- | -------- | | Discrete Tokenizers | | | | | SBER-MoVQGAN (270M) | 32×32 | 27.04 | 0.74 | | LlamaGen | 32×32 | 24.44 | 0.77 | | VAR | 680 | 22.12 | 0.62 | | TiTok-S-128 | 128 | 17.52 | 0.44 | | Sefltok | 1024 | 26.30 | 0.81 | | Continuous Tokenizers | | | | | Stable Diffusion 1.5 | 32×32×4 | 25.18 | 0.73 | | Stable Diffusion XL | 32×32×4 | 26.22 | 0.77 | | Stable Diffusion 3 Medium | 32×32×16 | 30.00 | 0.88 | | Flux.1-dev | 32×32×16 | 31.64 | 0.91 | | NextStep-1 | 32×32×16 | 30.60 | 0.89 |

Robustness of NextStep-1-f8ch16-Tokenizer

Impact of Noise Perturbation on Image Tokenizer Performance. The top panel displays quantitative metrics (rFID↓, PSNR↑, and SSIM↑) versus noise intensity. The bottom panel presents qualitative reconstruction examples at noise standard deviations of 0.2 and 0.5.

Notability

notability 2.0/10

Low traction tokenizer release