RepoDatabricks (DBRX)Databricks (DBRX)published Dec 20, 2024seen 5d

databricks/compose-rl

Python

Open original ↗

Captured source

source ↗
published Dec 20, 2024seen 5dcaptured 16hhttp 200method plain

databricks/compose-rl

Language: Python

License: Apache-2.0

Stars: 58

Forks: 18

Open issues: 3

Created: 2024-12-20T20:54:05Z

Pushed: 2026-03-25T04:50:36Z

Default branch: main

Fork: no

Archived: no

README:

Compose RL

Compose RL is a framework for Reinforcement Learning with Human Feedback (RLHF), designed to streamline the integration of various RLHF techniques. By leveraging the flexibility of MosaicML's Composer and LLM Foundry, Compose RL enables modular and composable experimentation, catering to researchers and practitioners exploring RLHF methodologies.

Note: This repository is currently in alpha. Expect frequent changes and potential instabilities. Use at your own risk!

Key Features

  • Composable RLHF Training:
  • Experiment with Policy Proximal Optimization (PPO), Direct Preference Optimization (DPO) and its variants, and reward model training.
  • Seamlessly integrate custom components or replace default implementations.
  • Powered by Databricks MosaicML:

This repo utilizes Composer, an open-source deep learning training library optimized for scalability and usability. Composer simplifies the implementation of distributed training workflows on large-scale clusters, abstracting complexities like parallelism techniques, distributed data loading, and memory optimization. Additionally, this repo leverages LLM Foundry, an open-source repository containing code for training large language models (LLMs) to help enable rapid experimentation with the latest techniques.

  • Modularity:
  • Decoupled components for policy optimization, preference modeling, and reward model training.
  • Flexible configurations to adapt to a variety of RLHF workflows.

You'll find in this repo:

  • compose-rl - source code for models, datasets, utilities
  • scripts - scripts for generating data
  • yamls - yamls for training runs

Installation

Clone the repository and install the dependencies:

git clone https://github.com/databricks/compose-rl.git
cd compose-rl
pip install -e .[gpu]
python3 -m spacy download en_core_web_sm

We *highly recommend* you use the llm-foundry images and install Compose RL and LLM Foundry and its dependencies on top of the images.

Note: when using the LLM Foundry images, please install LLM Foundry as it's not natively included with the images.

Quickstart

Here is an end-to-end workflow of performing data preparation for training along with how to launch model training.

Data preparation

Below is the set of commands to run to prepare datasets into the appropriate Mosaic Data Shard (MDS) format, which is either a pre-tokenized version of the data or the raw chat-message data, that we will use for training.

Below is the command to prepare preference data -- which can be used for reward model or offline RL (e.g. DPO) training:

cd scripts
python data/unified_tokenize_dataset.py --dataset_name allenai/ultrafeedback_binarized_cleaned \
--local_dir pref_data \
--dataset_type preference \
--tokenizer_name meta-llama/Llama-3.1-8B-Instruct \
--split train_prefs

Below is the command to prepare single-turn message data -- which can be used for online RL (e.g. PPO) training:

cd scripts
python data/messages_dataset_to_mds.py --dataset_name allenai/ultrafeedback_binarized_cleaned \
--local_dir ultrafeedback_summarization_data \
--split train_prefs

To further enable online RL with verifiable rewards you can use the following command to prepare the chat-message data and their corresponding verifiable answers:

cd scripts
python data/messages_dataset_to_mds.py --dataset_path \
--local_dir verifiable_data \
--split train \

For RLVR, We currently support the following two HuggingFace datasets for verifiable rewards:

  • GMS8k: openai/gsm8k
  • MATH: DigitalLearningGmbH/MATH-lighteval

The data preparation scripts also supports additional arguments for specifying the subset of the HuggingFace dataset --subset and max sequence length --max_length

For custom datasets, you can create a custom preprocessing function in the compose-rl/scripts/data/messages_preprocessing_utils.py file, or you can preprocess your own dataset directly, save it locally as a .jsonl file, and then use the following command to convert it to the MDS format:

cd scripts
python data/messages_dataset_to_mds.py --dataset_path \
--local_dir custom_dataset \

Model training

Below are the scripts to launch training runs assuming you ran the data preparation scripts above. Additionally, these scripts assume that we are in the root directory where Compose RL and LLM Foundry were cloned. This is because we utilize LLM Foundry's Registry System in order to take advantage of existing features in LLM Foundry.

Reward Model Training

Below is the command to run reward model training:

composer llm-foundry/scripts/train/train.py \
compose-rl/yamls/local_reward.yaml \
train_loader.dataset.local=/compose-rl/scripts/pref_data/ \
train_loader.dataset.split=train_prefs

DPO Training

Below is the command to run for DPO training (along with its variants):

composer llm-foundry/scripts/train/train.py \
compose-rl/yamls/local_dpo.yaml \
train_loader.dataset.local=/compose-rl/scripts/pref_data/ \
train_loader.dataset.split=train_prefs

For DPO we support other variants of DPO including: Reward Aware Preference Optimization (RPO), REgression to RElative REward Based RL (REBEL), Identity-PO (IPO), and Kahneman-Tversky Optimization (KTO).

PPO Training

Below is the command to run Online PPO training:

composer llm-foundry/scripts/train/train.py \
compose-rl/yamls/local_ppo.yaml \
train_loader.dataset.local=/compose-rl/scripts/ultrafeedback_summarization_data/ \
train_loader.dataset.split=train_prefs

Helpful code pointers

Adding new data processing In the...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low stars, routine new repo