RepoTencent HunyuanTencent Hunyuanpublished Apr 13, 2026seen 5d

Tencent-Hunyuan/UniCom

Python

Open original ↗

Captured source

source ↗
published Apr 13, 2026seen 5dcaptured 14hhttp 200method plain

Tencent-Hunyuan/UniCom

Language: Python

License: NOASSERTION

Stars: 33

Forks: 4

Open issues: 0

Created: 2026-04-13T09:24:30Z

Pushed: 2026-04-13T10:38:53Z

Default branch: main

Fork: no

Archived: no

README:

UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

Official code for the paper UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations.

UniCom is a unified large-scale multimodal model that performs generation directly over compressed visual embeddings. This repository includes the inference pipeline for text-to-image generation, image editing, and image reconstruction.

![UniCom Framework](./UniCom/assets/framework.png)

*Figure: We compare different unified modeling choices in terms of convergence speed and consistency on editing tasks, and ultimately build UniCom with the Path I transfusion-style formulation rather than the Path II query-guided design.*

🔥 Key Contributions

  • Model: We propose UniCom, a unified large-scale multimodal model that performs generation directly over compressed visual embeddings and serves as a unified interface for both understanding and generation.
  • Paradigm: We establish an effective paradigm for unifying visual understanding and generation by predicting continuous compressed visual embeddings, and show that compressing visual features along the channel dimension is a particularly effective way to preserve both semantics and fine-grained details.
  • Results: UniCom achieves state-of-the-art or competitive performance across image reconstruction, text-to-image generation, and challenging image editing tasks, with especially strong performance on editing benchmarks.

Setup

1. Download Checkpoints

Download all checkpoints at once via huggingface-cli:

huggingface-cli download tencent/Unicom-Unified-Multimodal-Modeling-via-Compressed-Continuous-Semantic-Representations --repo-type model --local-dir ./model_zoo/ --resume-download

You can also download each component separately:

| Component | Local Path | Link | | --- | --- | --- | | UniCom (text → SigLIP) | model_zoo/unicom_hf_model/ | Download | | Decoder Transformer (SigLIP → image) | model_zoo/unicom_decoder_transformer.pt | Download | | Flux VAE | model_zoo/flux-vae/ | Download | | SigLIP2| model_zoo/siglip2-so400m-patch16-naflex/ | Download |

After downloading, verify the expected directory layout:

model_zoo/
├── unicom_hf_model/
├── unicom_decoder_transformer.pt
├── flux-vae/
└── siglip2-so400m-patch16-naflex/

2. Environment Setup

conda create -n unicom python=3.12 -y
conda activate unicom

Install PyTorch first according to your CUDA version. Example for CUDA 12.8:

pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt

🚀 Usage

Case 1: Text-to-image generation

python run_unicom_decoder_pipeline.py \
--model-path ./model_zoo/unicom_hf_model \
--prompt "A ginger kitten tangled in a ball of wool, looking puzzled." \
--output-dir ./output/t2i_demo \
--diff-infer-steps 50 \
--seed 42 \
--image-size auto \
--n-samples-per-prompt 4

Case 2: Single-image editing

python run_unicom_decoder_pipeline.py \
--model-path ./model_zoo/unicom_hf_model \
--prompt "Add a blue baseball cap on the boy's head" \
--image ./UniCom/assets/demo_imgs/input_0.jpg \
--image-size auto \
--seed 42 \
--output-dir ./output/ti2i_demo \
--diff-infer-steps 50

Case 3: Multi-image editing

python run_unicom_decoder_pipeline.py \
--model-path ./model_zoo/unicom_hf_model \
--prompt "Place the chair from the second image onto the snow in the third image, and then place the coffee cup from the first image onto the chair." \
--image ./UniCom/assets/demo_imgs/input_1_0.png ./UniCom/assets/demo_imgs/input_1_1.png ./UniCom/assets/demo_imgs/input_1_2.png \
--image-size auto \
--seed 42 \
--output-dir ./output/ti2i_multi_demo \
--diff-infer-steps 50

Case 4: CSV-based batch inference

python run_unicom_decoder_pipeline.py \
--model-path ./model_zoo/unicom_hf_model \
--csv-path ./UniCom/eval/t2i.csv \
--output-dir ./output/t2i_demo_csv \
--num-gpus 8 \
--decoder-device 0,1,2,3,4,5,6,7 \
--image-size auto \
--diff-infer-steps 50 \
--n-samples-per-prompt 4
# no cot
python run_unicom_decoder_pipeline.py \
--model-path ./model_zoo/unicom_hf_model \
--csv-path ./UniCom/eval/t2i.csv \
--output-dir ./output/t2i_demo_csv_nocot \
--num-gpus 8 \
--decoder-device 0,1,2,3,4,5,6,7 \
--image-size auto \
--diff-infer-steps 50 \
--bot-task vanilla \
--use-system-prompt en_vanilla \
--n-samples-per-prompt 4

Output structure

The pipeline first exports latent representations, then decodes them into images:

output_dir/
|-- latents/
| |-- results.csv
| `-- *.pt
`-- images/
`-- *.png

🧩 Reconstruction

UniCom_Decoder also supports reconstruction directly from input images.

Reconstruction demo

bash UniCom_Decoder/scripts/run.sh \
--config-file UniCom_Decoder/configs/reconstruction_demo.yaml

The demo images are stored in UniCom_Decoder/assets/demo_recon_imgs/.

Each saved output is a side-by-side comparison:

  • left: input image
  • right: reconstructed image

Recommended reconstruction settings

The default demo config already uses the recommended settings:

  • mode: eval_gt
  • aba_mode: compression_64_siglip
  • condition_mode: siglip2
  • cfg_scale: 1.0
  • infer_steps: 50
  • flow_shift: 3.0
  • siglip2_max_num_patches: 1024

🙏 Acknowledgement

This project builds upon several excellent open-source projects and research efforts.

-…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low stars, routine repo