What is Lora in LLMs?

title: 'What is LoRA (Low-Rank Adaption)? | IBM'

LoRA, or Low-Rank Adaptation, is a parameter-efficient fine-tuning technique designed to adapt large language models (LLMs) to specific tasks without the high costs of full retraining^[1]^[3]. Because traditional fine-tuning requires updating billions of model parameters, it is resource-intensive and expensive^[3].

Instead of modifying the original weights of a pre-trained model, LoRA freezes them and injects smaller, trainable rank decomposition matrices into the model's layers^[1]^[4]. This approach significantly reduces the number of trainable parameters, often by thousands of times, which lowers GPU memory requirements and speeds up the training process^[1]^[4].

When deployment occurs, these smaller matrices can be merged with the frozen pre-trained weights, meaning there is no additional latency during inference^[1]^[4]. Furthermore, practitioners often use extensions like QLoRA, which combines quantization with LoRA to further reduce memory usage by using 4-bit precision for the base model^[3]^[5]. Libraries such as Hugging Face's PEFT make it simple to integrate and configure these adapters for various models^[1]^[4].