WritingOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Dec 2, 2022seen 4d

BMTrain, an efficient toolkit for big model training, can save up to 90% of training costs

Open original ↗

Captured source

source ↗

BMTrain efficient tool big model training | Medium

Sign up

Get app

Sign up

BMTrain, an efficient toolkit for big model training, can save up to 90% of training costs

10 min read

Dec 2, 2022

--

Share

Press enter or click to view image in full size

Background

In 2018, the technology of pre-trained language models emerged and triggered a performance revolution in the field of artificial intelligence. Research shows that increasing the parameters and data scale is an effective way to further improve the performance of language models. The exploration of big models of one billion, ten billion , or even one hundred billion has become a hot topic in the industry, which triggers fierce competition between global research institutions and Internet enterprises, pushing the scale and performance of big model to a new height. In addition to well-known institutions such as Google and OpenAI, relevant Chinese research institutions and companies have also emerged in recent years, booming the research and application of big models. Thus, artificial intelligence has entered the “era of big models”.

However, in the “era of big models”, the huge demand for parameter scale and computing resources has brought about great training and tuning problems:

▶ The cost of computing resources is high

The huge number of parameters of the model cannot be stored and calculated in a single graphics card. OpenAI uses thousands of GPUs to train GPT-3 (about 175 billion parameters) , while Google used 6,000 TPUs to train PaLM (about 500 billion parameters). The cost of training is high and the configuration of large computing clusters is difficult.

▶ Programming is difficult

In order to utilize distributed computing power to speed up the training and tuning of big models, programmers need to write complex distributed programs to run big models. Existing frameworks, such as DeepSpeed, have done a good job of supporting distributed training of models, but still require more complex programming and configuration by the user.

▶ The computing speed is low

In the framework of distributed training, frequent communication is required between different computing nodes, thus the communication bandwidth becomes the training bottleneck. Unless the communication strategy can be designed reasonably and the code can be implemented effectively and correctly, the time required for training the model will be tremendously extended, which hinders the full utilization of computing devices.

To solve the above problems, we released the big model efficient training toolkit BMTrain and model warehouse ModelCenter.

BMTrain & ModelCenter

Press enter or click to view image in full size

The framework of the full-process acceleration tools from OpenBMB

Aiming to accelerate big model training, BMTrain is a core tool in the framework of the full-process acceleration tools. It has the following features:

▶ Efficient

As the “engine” of big model training, BMTrain can perform efficient big model pre-training and tuning on any number of GPUs. It optimizes the communication overhead of the distributed training algorithm. In super-large model training scenarios, BMTrain can save 90% on computing resource costs compared with DeepSpeed and other frameworks.

▶ Low-resource Requirement

To enable more labs and companies to train big models, we aim to minimize the hardware requirement while maintaining training efficiency. With BMTrain, everyone can fine-tune BERT-Large on a single consumer-lever GPU and train GPT-3 on a small A100 cluster (with 64 A100 GPUs).

▶ Extendable

We are committed to making the most concise and effective packaging to lower the programming difficulty. By only replacing a few lines of code, everyone can get the same programming experience as using the native PyTorch. BMTrain also supports one-click installation, which reduces the difficulty of configuration.

Press enter or click to view image in full size

The BMTrain Architecture

We also implemented a series of big models based on BMTrain and integrated them into ModelCenter. BMTrain and ModelCenter together form an efficient distributed pre-training framework, which is usable for any Transformer structure and can be run on a small number of GPUs. The framework is highly compatible with PyTorch and transformers libraries, and is very easy to learn. We have implemented typical English models such as BERT, GPT, T5, RoBERTa and Chinese models such as CPM-1 and CPM-2.

Easy-to-use

Based on the design principles of simple and easy-to-use, BMTrain makes the most concise and effective packaging. By only replacing a few lines of code, everyone can get the same programming experience as using the native PyTorch. We have also implemented ModelCenter, which makes it easier for users to quickly use big models with typical model architectures.

▶ Simple Replacement

Similar to the user experience of PyTorch, the threshold of using BMTrain is lower, and the training can be easily accelerated by simple replacement:

  • bmtrain.DistributedParameter replacestorch.nn.Parameter
  • bmtrain.DistributedModule replacestorch.nn.Module
  • bmtrain.CheckpointBlock replaces the module intorch.nn.ModuleList

A simple comparison diagram is provided below to illustrate the simplicity of using BMTrain (original code on the left, replacement code on the right).

Press enter or click to view image in full size

▶ One-line Conversion

For models implemented by PyTorch, they can be automatically wrapped usingBMTrainModelWrapper provided by BMTrain to realize distributed computing acceleration.

# Automatically wrap a model in a BMTrain model bmt_model = BMTrainModelWrapper(model) # model: torch.nn.module

▶ Model Center

We have implemented ModelCenter based on BMTrain, which can be directly imported and used. The use of ModelCenter is similar to that of HuggingFace Transformers, and ModelCenter is compatible with various data processing interfaces in Transformers.

Press enter or click to view image in full size

Excellent Performance

BMTrain has excellent performance on large-scale model training under different scales of computing power.

▶ Consumer-Level ( Single 2080Ti)

On the beginner-level GPU 2080Ti, users can fine-tune BERT-Large based on BMTrain (300 million parameters, sample length 512).

Press enter or click to view image in full size

▶ Entry-Level ( Single V100)

Under the condition of entry-level computing power (the GPU is V100 32GB),…

Excerpt shown — open the source for the full document.