Doubling Neural Network Finetuning Efficiency with 16-bit Precision Techniques
Captured source
source ↗Doubling Neural Network Finetuning Efficiency with 16-bit Precision Techniques - Lightning AI Lightning AI Studios: Never set up a local environment again →
Table of Contents Finetuning a Vision Transformer on a Single GPU Introducing the Open Source Fabric Library 16-bit Mixed Precision Training Full 16-bit Precision Training Full 16-bit Precision Training with BFloat-16 Conclusion
Takeaways This guide targets PyTorch model training, illustrating how you can adjust the floating point precision to drastically enhance training speed and halve memory consumption, all without compromising the prediction accuracy. An excerpt of the improvements we can gain from leveraging the techniques introduced in this article.
In this article, we will work with a vision transformer from PyTorch’s Torchvision library, providing simple code examples that you can execute on your own machine without the need to download and install numerous code and dataset dependencies. The self-contained baseline training script comprises approximately 100 lines of code, excluding whitespace and code comments. All benchmarks were executed using the open source Lightning 2.1 package with PyTorch 2.1 and CUDA 12.1 on a single A100 GPU. You can find all code examples here on GitHub .
Finetuning a Vision Transformer on a Single GPU While we are working with a vision transformer here (the ViT-L-16 model from the paper An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale ), all the techniques used in this article transfer to other models as well: Convolutional networks, large language models (LLMs), and others. Note that we are finetuning the model for classification instead of training it from scratch to optimize predictive performance. Let’s begin with a simple baseline in PyTorch. The complete code is available here on GitHub , which implements and finetunes a vision transformer: The core code for implementing this vision transformers is as follows:
Import torchvision
from torchvision.models import vit_l_16 from torchvision.models import ViT_L_16_Weights
Initialize pretrained vision transformer
model = vit_l_16(weights=ViT_L_16_Weights.IMAGENET1K_V1)
replace output layer
model.heads.head = torch.nn.Linear(in_features=1024, out_features=10)
finetune model The relevant benchmark numbers for this baseline are as follows:
Training runtime: 16.88 min GPU memory: 16.70 GB Test accuracy: 94.06%
Introducing the Open Source Fabric Library To simplify the PyTorch code for the experiments, we will be introducing the open-source Fabric library , which allows us to apply various advanced PyTorch techniques (automatic mixed-precision training, multi-GPU training, tensor sharding, etc.) with a handful (instead of dozens) lines of code. The difference between simple PyTorch code and the modified one to use Fabric is subtle and involves only minor modifications, as highlighted in the code below. It requires only a handful of changes to utilize the open-source Fabric library for PyTorch model training.
As mentioned above, these minor changes now provide a gateway to utilize advanced features in PyTorch, as we will see in a bit, without restructuring any more of the existing code. To summarize the figure above, the main 3 steps for converting plain PyTorch code to PyTorch+Fabric are as follows: Since Fabric is a wrapper around PyTorch, it should not affect the runtime of our code, a fact we can confirm via the performance benchmarks below. Training time, memory usage, and prediction accuracy are the same for plain PyTorch and PyTorch augmented with Fabric.
Note that if there are minor differences in the bar plots above, these can be attributed to the randomness inherent in training neural networks and machine fluctuations. If we were to repeat the runs multiple times and examine the averaged results, the bar plots would be exactly the same.
16-bit Mixed Precision Training In the previous section, we modified our PyTorch code using Fabric. Why go through all this hassle? As we will see below, we can now try advanced techniques, like mixed-precision training, by only changing one line of code. (Similarly, we can enable distributed training in one line of code, but this is a topic for a different article.)
Using Mixed-Precision Training We can use mixed-precision training with only one small modification, changing fabric = Fabric(accelerator="cuda", devices=1) to the following: fabric = Fabric(accelerator="cuda", devices=1, precision="16-mixed") As we can see in the charts below, using mixed-precision training, we cut down the training time by more than 30%. We also improved the peak memory consumption by more than 25% while maintaining the same prediction accuracy. Based on my personal experience, I observed even more significant gains when working with larger models. Mixed-precision training significantly reduces training time and memory consumption.
What Is Mixed-Precision Training? Mixed precision training utilizes both 16-bit and 32-bit precision to ensure no loss in accuracy. The computation of gradients in 16-bit representation is much faster than in 32-bit format, which also saves a significant amount of memory. This strategy is particularly beneficial when we are constrained by memory or computational resources. The term “mixed” rather than “low” precision training is used because not all parameters and operations are transferred to 16-bit floats. Instead, we alternate between 32-bit and 16-bit operations during training, hence the term “mixed” precision. As illustrated in the figure below, mixed precision training involves converting weights to lower precision (16-bit floats, or FP16) for faster computation, calculating gradients, converting gradients back to higher precision (FP32) for numerical stability, and updating the original weights with the scaled gradients. This approach enables efficient training while maintaining the accuracy and stability of the neural network. An overview of mixed-precision training. For additional details, I also cover this concept in Unit 9.1 of my Deep Learning Fundamentals class.
Full 16-bit Precision Training We can also take it a step further and attempt running with “full” lower 16-bit precision, as opposed to mixed precision, which converts intermediate results back to a 32-bit representation. We can enable lower-precision training by changing fabric = Fabric(accelerator="cuda",…
Excerpt shown — open the source for the full document.