RepoReka AIReka AIpublished Jul 2, 2025seen 5d

reka-ai/rekaquant

Python

Open original ↗

Captured source

source ↗
published Jul 2, 2025seen 5dcaptured 14hhttp 200method plain

reka-ai/rekaquant

Language: Python

Stars: 63

Forks: 6

Open issues: 0

Created: 2025-07-02T10:11:47Z

Pushed: 2025-07-10T16:29:16Z

Default branch: main

Fork: no

Archived: no

README:

Reka Quant

Reka Quant is a model quantization library. It supports:

  • NF4 and GGML (llama.cpp) quantization primitives. GGML primitives are added directly from its source code through python cffi bindings, making it easy to incorporate new ones.
  • Exporting of GGML quantized models to native GGUF format, for easy integration with the existing ecosystem.
  • Activation-aware quantization by leveraging precomputed activation statistics from a text sample, through the LDLQ method from QuIP.
  • Further eror reduction through self-distillation from the BF16 model, while quantizing the network gradually.
  • Fast multi-node training through full or hybrid FSDP, as well as fast parallel proxy Hessian computation for LDLQ.

Installation

Clone the library with submodules:

git clone --recurse-submodules git@github.com:reka-ai/quantization.git

Install requirements:

poetry install

Build the shared library in csrc, needed for python bindings.

cd csrc
gcc -shared -o quantize.so -fPIC quantize.c
cd ..

Exporting to GGUF formats requires a patch to the llama.cpp library, apply it and install the library.

cd third_party/llama.cpp
git apply ../../patches/RekaQuant.patch
cmake -B build
cmake --build build --config Release
cd ../..

Usage

The main script is train.py. The training data should be in jsonl format with documents in the "text" field.

torchrun \
...distributed flags.. \
python3 src/train.py \
--model_path $model_path \
--ref_model $ref_model \
--out_path $out_path \
--train_data $train_data \
--hessian_corr 1e-1 \
--hessian_train_seq 4096 \
--total_train_steps 1800 \
--lr 1e-5 \
--global_batch_size 512 \
--seq_len 8192 \
--micro_batch_size 1 \
--checkpoint_iters 100 \
--valid_seq 64 \
--quant_strategy typewise_Q3_K_S \
--use_checkpointing \

An example slurm script can be found in [run_train.slurm](run_train.slurm):

export REF_MODEL_PATH=/path/to/model
export OUT_PATH=/path/to/output
export TRAIN_DATA=/path/to/train.jsonl

sbatch run_train.slurm

When training smaller models, you can enable the --use_hybrid flag to use hybrid FSDP (shard intra-node, replicate across nodes) for reduced communication and higher efficiency, and remove the --use_checkpointing flag to disable activation checkpointing.

Once the model is trained, if you used GGUF quants you will need to export it to a native GGUF file. You can see the [scripts/prepare_ckpt.sh](scripts/prepare_ckpt.sh) script for an example of how to do this.

cd scripts
bash prepare_ckpt.sh $OUT_PATH/iter_001800/ #GGUF ckpt saved under $OUT_PATH/iter_001800/hf_model/Q3_K_S_RekaQuant_hf

NOTE: GGML K-Quants require tensors to have a number of columns divisible by 256. You can use the helper script in [scripts/pad_intermediate.py](scripts/pad_intermediate.py) if needed to preprocess models.

Notability

notability 4.0/10

New repo from notable lab but low stars