BMTrain, an efficient toolkit for big model training, can save up to 90% of training costs
Captured source
source ↗BMTrain efficient tool big model training | Medium
Sign up
Get app
Sign up
BMTrain, an efficient toolkit for big model training, can save up to 90% of training costs
10 min read
Dec 2, 2022
--
Share
Press enter or click to view image in full size
Background
In 2018, the technology of pre-trained language models emerged and triggered a performance revolution in the field of artificial intelligence. Research shows that increasing the parameters and data scale is an effective way to further improve the performance of language models. The exploration of big models of one billion, ten billion , or even one hundred billion has become a hot topic in the industry, which triggers fierce competition between global research institutions and Internet enterprises, pushing the scale and performance of big model to a new height. In addition to well-known institutions such as Google and OpenAI, relevant Chinese research institutions and companies have also emerged in recent years, booming the research and application of big models. Thus, artificial intelligence has entered the “era of big models”.
However, in the “era of big models”, the huge demand for parameter scale and computing resources has brought about great training and tuning problems:
▶ The cost of computing resources is high
The huge number of parameters of the model cannot be stored and calculated in a single graphics card. OpenAI uses thousands of GPUs to train GPT-3 (about 175 billion parameters) , while Google used 6,000 TPUs to train PaLM (about 500 billion parameters). The cost of training is high and the configuration of large computing clusters is difficult.
▶ Programming is difficult
In order to utilize distributed computing power to speed up the training and tuning of big models, programmers need to write complex distributed programs to run big models. Existing frameworks, such as DeepSpeed, have done a good job of supporting distributed training of models, but still require more complex programming and configuration by the user.
▶ The computing speed is low
In the framework of distributed training, frequent communication is required between different computing nodes, thus the communication bandwidth becomes the training bottleneck. Unless the communication strategy can be designed reasonably and the code can be implemented effectively and correctly, the time required for training the model will be tremendously extended, which hinders the full utilization of computing devices.
To solve the above problems, we released the big model efficient training toolkit BMTrain and model warehouse ModelCenter.
BMTrain & ModelCenter
Press enter or click to view image in full size
The framework of the full-process acceleration tools from OpenBMB
Aiming to accelerate big model training, BMTrain is a core tool in the framework of the full-process acceleration tools. It has the following features:
▶ Efficient
As the “engine” of big model training, BMTrain can perform efficient big model pre-training and tuning on any number of GPUs. It optimizes the communication overhead of the distributed training algorithm. In super-large model training scenarios, BMTrain can save 90% on computing resource costs compared with DeepSpeed and other frameworks.
▶ Low-resource Requirement
To enable more labs and companies to train big models, we aim to minimize the hardware requirement while maintaining training efficiency. With BMTrain, everyone can fine-tune BERT-Large on a single consumer-lever GPU and train GPT-3 on a small A100 cluster (with 64 A100 GPUs).
▶ Extendable
We are committed to making the most concise and effective packaging to lower the programming difficulty. By only replacing a few lines of code, everyone can get the same programming experience as using the native PyTorch. BMTrain also supports one-click installation, which reduces the difficulty of configuration.
Press enter or click to view image in full size
The BMTrain Architecture
We also implemented a series of big models based on BMTrain and integrated them into ModelCenter. BMTrain and ModelCenter together form an efficient distributed pre-training framework, which is usable for any Transformer structure and can be run on a small number of GPUs. The framework is highly compatible with PyTorch and transformers libraries, and is very easy to learn. We have implemented typical English models such as BERT, GPT, T5, RoBERTa and Chinese models such as CPM-1 and CPM-2.
Easy-to-use
Based on the design principles of simple and easy-to-use, BMTrain makes the most concise and effective packaging. By only replacing a few lines of code, everyone can get the same programming experience as using the native PyTorch. We have also implemented ModelCenter, which makes it easier for users to quickly use big models with typical model architectures.
▶ Simple Replacement
Similar to the user experience of PyTorch, the threshold of using BMTrain is lower, and the training can be easily accelerated by simple replacement:
bmtrain.DistributedParameterreplacestorch.nn.Parameterbmtrain.DistributedModulereplacestorch.nn.Modulebmtrain.CheckpointBlockreplaces the module intorch.nn.ModuleList
A simple comparison diagram is provided below to illustrate the simplicity of using BMTrain (original code on the left, replacement code on the right).
Press enter or click to view image in full size
▶ One-line Conversion
For models implemented by PyTorch, they can be automatically wrapped usingBMTrainModelWrapper provided by BMTrain to realize distributed computing acceleration.
# Automatically wrap a model in a BMTrain model bmt_model = BMTrainModelWrapper(model) # model: torch.nn.module
▶ Model Center
We have implemented ModelCenter based on BMTrain, which can be directly imported and used. The use of ModelCenter is similar to that of HuggingFace Transformers, and ModelCenter is compatible with various data processing interfaces in Transformers.
Press enter or click to view image in full size
Excellent Performance
BMTrain has excellent performance on large-scale model training under different scales of computing power.
▶ Consumer-Level ( Single 2080Ti)
On the beginner-level GPU 2080Ti, users can fine-tune BERT-Large based on BMTrain (300 million parameters, sample length 512).
Press enter or click to view image in full size
▶ Entry-Level ( Single V100)
Under the condition of entry-level computing power (the GPU is V100 32GB),…
Excerpt shown — open the source for the full document.