[ICML 2024 (Oral)] DoRA: Weight-Decomposed Low-Rank Adaptation


1NVIDIA Research
2HKUST

*Work done during internship at NVIDIA Research


DoRA consistently outperforms LoRA on the LLaMA family models for commonsense reasoning tasks.

Abstract

Among the widely used parameter-efficient finetuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed LowRank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing DoRA, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding.

Method Overview

Method Overview

An overview of our proposed DoRA, which decomposes the pre-trained weight into magnitude and direction components for fine-tuning, especially with LoRA to efficiently update the direction component.

Results

Finetuning LLaMA-7B/13B, LLaMA2-7B and LLaMA3-8B for the commonsense reasoning tasks



Instruction tuning LLaMA-7B and LLaMA2-7B with cleaned Alpaca dataset.



Finetuning VL-BART for the image/video-text understanding tasks



Visual instruction tuning (LLaVA-1.5-7B)

QDoRA vs. QLoRA vs. FT

Implementation of DoRA compared to LoRA in PyTorch-like code


  ## LoRA forward pass
  def forward(self, x: torch.Tensor):
    base_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
    dropout_x = self.lora_dropout(x)

    result += (self.lora_B(self.lora_A(dropout_x.to(self.lora_A.weight.dtype)))) * self.scaling
    return result

  ## DoRA forward pass
  def forward(self, x: torch.Tensor):
    base_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out))
    dropout_x = self.lora_dropout(x)

    new_weight_v = self.weight + (self.lora_B.weight @ self.lora_A.weight) * self.scaling
    norm_scale = self.weight_m_wdecomp.weight.view(-1) / (torch.linalg.norm(new_weight_v,dim=1)).detach()
    result = base_result + (norm_scale-1) * (F.linear(dropout_x, transpose(self.weight, self.fan_in_fan_out)))
    result += ( norm_scale * (self.lora_B(self.lora_A(dropout_x.to(self.lora_A.weight.dtype))))) * self.scaling
    if not self.bias is None:
      result += self.bias.view(1, -1).expand_as(result)
    return result
  

Quick Start with FSDP/QDoRA from Answer.AI


QDoRA + Fully Sharded Data Parallel(FSDP) is now supported by Answer.AI
git clone https://github.com/AnswerDotAI/fsdp_qlora
Start finetuning LLMs on consumer GPUs!

Quick Start with Hugging Face PEFT and Diffusers


Hugging Face PEFT


DoRA is now supported by the Huggingface PEFT package. You can install the PEFT package using

pip install git+https://github.com/huggingface/peft.git -q

After PEFT is installed, you can simply set the use_dora argument of LoRAConfig to True for applying DoRA. An example could be as follows:


    from peft import LoraConfig, get_peft_model

    # Initialize DoRA configuration
    config = LoraConfig(
        ...
        use_dora=True
        ...
    )
  

Please refer to the official documentation for more details.

Hugging Face Diffusers


You can also toy with DoRA on finetuning Diffusion Model. See huggingface/diffusers. Another good tutorial would be this collab notebook from Linoy Tsaban.

Some DoRA v.s LoRA diffusion finetuning results

Example From Linoy Tsaban (Images generated by DoRA are on the left and LoRA on the right):


Example From merve



BibTeX

@article{liu2024dora,
        title={DoRA: Weight-Decomposed Low-Rank Adaptation},
        author={Liu, Shih-Yang and Wang, Chien-Yi and Yin, Hongxu and Molchanov, Pavlo and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Chen, Min-Hung},
        journal={arXiv preprint arXiv:2402.09353},
        year={2024}
      }