Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments
Captured source
source ↗Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI Lightning AI Studios: Never set up a local environment again →
Table of Contents Introduction: Getting the Most out of LoRA Evaluation Tasks and Dataset Code Framework Choosing a Good Base Model Evaluating the LoRA Defaults Memory Savings with QLoRA Learning Rate Schedulers and SGD Iterating Over the Dataset Multiple Times LoRA Hyperparameter Tuning Part 1: LoRA for All Layers LoRA Hyperparameter Tuning Part 2: Increasing R LoRA Hyperparameter Tuning Part 3: Changing Alpha LoRA Hyperparameter Tuning Part 3: Very Large R Leaderboard Submission Conclusion
Takeaways LoRA is one of the most widely used, parameter-efficient finetuning techniques for training custom LLMs. From saving memory with QLoRA to selecting the optimal LoRA settings, this article provides practical insights for those interested in applying it.
Introduction: Getting the Most out of LoRA I’ve run hundreds, if not thousands, of experiments involving LoRA over the past few months. A few weeks ago, I took the time to delve deeper into some of the hyperparameter choices. This is more of an experimental diary presented in sequential order. I hope it proves useful to some. Specifically, I aim to address questions about the value of QLoRA, whether to replace AdamW with SGD, the potential use of a scheduler, and how to adjust the LoRA hyperparameters. There’s a lot to discuss on the experimental side, so I’ll keep the introduction to LoRA brief. In short, LoRA, short for Low-Rank Adaptation ( Hu et al 2021 ), adds a small number of trainable parameters to the model while the original model parameters remain frozen. LoRA decomposes a weight matrix into two smaller weight matrices, as illustrated below, to approximate full supervised finetuning in a more parameter-efficient manner.
For more details about LoRA, please see my in-depth article Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA) . The topics we are going to cover in this article as organized as follows: 1. Evaluation Tasks and Dataset 2. Code Framework 3. Choosing a Good Base Model 4. Evaluating the LoRA Defaults 5. Memory Savings with QLoRA 6. Learning Rate Schedulers and SGD 7. Iterating Over the Dataset Multiple Times 8. LoRA Hyperparameter Tuning Part 1: LoRA for All Layers 9. LoRA Hyperparameter Tuning Part 2: Increasing R 10. LoRA Hyperparameter Tuning Part 3: Changing Alpha 11. LoRA Hyperparameter Tuning Part 3: Very Large R 12. Leaderboard Submission 13. Conclusion
Evaluation Tasks and Dataset The focus of this article is on selecting the optimal settings. To stay within a reasonable scope, I’m keeping the dataset fixed and focusing solely on supervised instruction-finetuning of LLMs. (Modifications to the dataset or finetuning for classification might be addressed in future articles.) For the model evaluation, I selected a small subset of tasks from Eleuther AI’s Evaluation Harness , including TruthfulQA , BLiMP Causative, MMLU Global Facts , and simple arithmetic tasks with two (arithmetic 2ds) and four digits (arithmetic 4ds). In each benchmark, the model performance score is normalized between 0 and 1, where 1 is a perfect score. TruthfulQA reports two scores, which are defined as follows: MC1 (Single-true): Given a question and 4-5 answer choices, select the only correct answer. The model’s selection is the answer choice to which it assigns the highest log-probability of completion following the question, independent of the other answer choices. The score is the simple accuracy across all questions. MC2 (Multi-true): Given a question and multiple true / false reference answers, the score is the normalized total probability assigned to the set of true answers.
For reference, the 175B GPT-3 model has TruthfulQA MC1 and MC2 values of 0.21 and 0.33, respectively. Below are two examples to illustrate the difference between arithmetic 2ds and arithmetic 4ds: Arithmetic 2ds: “What is 59 minus 38”. “21”. Arithmetic 4ds: “What is 2762 plus 2751”. “5513”.
As mentioned above, I kept the dataset fixed, using the well-studied or rather commonly used Alpaca dataset for supervised instruction finetuning. Of course, many other datasets are available for instruction finetuning, including LIMA, Dolly, LongForm, FLAN, and more. However, exploring training on multiple datasets and dataset mixes will be an interesting topic for future studies. The Alpaca dataset consists of approximately 50k instruction-response pairs for training with a median length of 110 tokens for the input size (using the Llama 2 SentencePiece tokenizer), as shown in the histogram below.
The dataset tasks themselves can be structured as shown in the figure below.
Code Framework The custom LLM finetuning code I used for this article is based on the open-source Lit-GPT repository . To keep the preamble of this article brief, I won’t go into the usage details, but you can find a more detailed guide in the Lit-GPT tutorials section here . In brief, the usage is as follows: 1) Clone the repository and install the requirements git clone https://github.com/Lightning-AI/lit-gpt
cd lit-gpt
pip install -r requirements.txt 2) Download and prepare a model checkpoint python scripts/download.py \ --repo_id mistralai/Mistral-7B-Instruct-v0.1
there are many other supported models python scripts/convert_hf_checkpoint.py \
--checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.1 3) Prepare a dataset python scripts/prepare_alpaca.py \ --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.1 # or from a custom CSV file python scripts/prepare_csv.py \ --csv_dir MyDataset.csv \ --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.1 4) Finetune python finetune/lora.py \ --checkpoint_dir checkpoints/mistralai/Mistral-7B-Instruct-v0.1/ \ --precision bf16-true 5) Merge LoRA weights python scripts/merge_lora.py \ --checkpoint_dir "checkpoints/mistralai/Mistral-7B-Instruct-v0.1" \ --lora_path "out/lora/alpaca/Mistral-7B-Instruct-v0.1/lit_model_lora_finetuned.pth" \ --out_dir "out/lora_merged/Mistral-7B-Instruct-v0.1/"
cp checkpoints/mistralai/Mistral-7B-Instruct-v0.1/*.json \ out/lora_merged/Mistral-7B-Instruct-v0.1/ 6) Evaluate python eval/lm_eval_harness.py \ --checkpoint_dir "out/lora_merged/Mistral-7B-Instruct-v0.1/" \ --eval_tasks "[arithmetic_2ds, ..., truthfulqa_mc]" \ --precision...
Excerpt shown — open the source for the full document.