In deep learning, fine-tuning is an approach to transfer learning in which the parameters of a pre-trained model are trained on new data. [1] Fine-tuning can be done on the entire neural network, or on only a subset of its layers, in which case the layers that are not being fine-tuned are "frozen" (not updated during the backpropagation step). [2] A model may also be augmented with "adapters" that consist of far fewer parameters than the original model, and fine-tuned in a parameter–efficient way by tuning the weights of the adapters and leaving the rest of the model's weights frozen. [3]
For some architectures, such as convolutional neural networks, it is common to keep the earlier layers (those closest to the input layer) frozen because they capture lower-level features, while later layers often discern high-level features that can be more related to the task that the model is trained on. [2] [4]
Models that are pre-trained on large and general corpora are usually fine-tuned by reusing the model's parameters as a starting point and adding a task-specific layer trained from scratch. [5] Fine-tuning the full model is common as well and often yields better results, but it is more computationally expensive. [6]
Fine-tuning is typically accomplished with supervised learning, but there are also techniques to fine-tune a model using weak supervision. [7] Fine-tuning can be combined with a reinforcement learning from human feedback-based objective to produce language models like ChatGPT (a fine-tuned version of GPT-3) and Sparrow. [8] [9]
Fine-tuning can degrade a model's robustness to distribution shifts. [10] [11] One mitigation is to linearly interpolate a fine-tuned model's weights with the weights of the original model, which can greatly increase out-of-distribution performance while largely retaining the in-distribution performance of the fine-tuned model. [12]
Low-rank adaptation (LoRA) is an adapter-based technique for efficiently fine-tuning models. The basic idea is to design a low- rank matrix that is then added to the original matrix. [13] An adapter, in this context, is a collection of low-rank matrices which, when added to a base model, produces a fine-tuned model. It allows for performance that approaches full-model fine-tuning with less space requirement. A language model with billions of parameters may be LoRA fine-tuned with only several millions of parameters.
LoRA-based fine-tuning has become popular in the Stable Diffusion community. [14] Support for LoRA was integrated into the Diffusers library from Hugging Face. [15] Support for LoRA and similar techniques is also available for a wide range of other models through Hugging Face's Parameter-Efficient Fine-Tuning (PEFT) package. [16]
This section relies largely or entirely upon a
single source. (May 2024) |
Representation fine-tuning (ReFT) is a novel technique developed by researchers at Stanford University aimed at fine-tuning large language models (LLMs) by modifying less than 1% of their representations. Unlike traditional parameter-efficient fine-tuning (PEFT) methods, which mainly focus on updating weights, ReFT targets specific parts of the model relevant to the task being fine-tuned. This approach is based on the understanding that deep learning models encode rich semantic information in their representations, suggesting that modifying representations might be a more effective strategy than updating weights. [17]
ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations and train interventions that manipulate a small fraction of model representations to steer model behaviors towards solving downstream tasks at inference time. One specific method within the ReFT family is Low-rank Linear Subspace ReFT (LoReFT), which intervenes on hidden representations in the linear subspace spanned by a low-rank projection matrix. [17] LoReFT can be seen as the representation-based equivalent of Low-rank Adaptation (LoRA).
Fine-tuning is common in natural language processing (NLP), especially in the domain of language modeling. Large language models like OpenAI's series of GPT foundation models can be fine-tuned on data for specific downstream NLP tasks (tasks that use a pre-trained model) to improve performance over the unmodified pre-trained model. [6]
Commercially-offered large language models can sometimes be fine-tuned if the provider offers a fine-tuning API. As of June 19, 2023, language model fine-tuning APIs are offered by OpenAI and Microsoft Azure's Azure OpenAI Service for a subset of their models, as well as by Google Cloud Platform for some of their PaLM models, and by others. [18] [19] [20] Not all commercial models currently support fine-tuning.
Companies such as Meta ( Llama LLM family), Alibaba (Qwen LLM family) and Mistral AI (Mixtral) have published open source large language models with different sizes on GitHub, which can be fine-tuned. Open-source models can be advantageous for companies in terms of data security, because they can control where the model is hosted.
{{
cite book}}
: CS1 maint: location missing publisher (
link)
{{
cite journal}}
: Cite journal requires |journal=
(
help)
{{
cite journal}}
: Cite journal requires |journal=
(
help)
In deep learning, fine-tuning is an approach to transfer learning in which the parameters of a pre-trained model are trained on new data. [1] Fine-tuning can be done on the entire neural network, or on only a subset of its layers, in which case the layers that are not being fine-tuned are "frozen" (not updated during the backpropagation step). [2] A model may also be augmented with "adapters" that consist of far fewer parameters than the original model, and fine-tuned in a parameter–efficient way by tuning the weights of the adapters and leaving the rest of the model's weights frozen. [3]
For some architectures, such as convolutional neural networks, it is common to keep the earlier layers (those closest to the input layer) frozen because they capture lower-level features, while later layers often discern high-level features that can be more related to the task that the model is trained on. [2] [4]
Models that are pre-trained on large and general corpora are usually fine-tuned by reusing the model's parameters as a starting point and adding a task-specific layer trained from scratch. [5] Fine-tuning the full model is common as well and often yields better results, but it is more computationally expensive. [6]
Fine-tuning is typically accomplished with supervised learning, but there are also techniques to fine-tune a model using weak supervision. [7] Fine-tuning can be combined with a reinforcement learning from human feedback-based objective to produce language models like ChatGPT (a fine-tuned version of GPT-3) and Sparrow. [8] [9]
Fine-tuning can degrade a model's robustness to distribution shifts. [10] [11] One mitigation is to linearly interpolate a fine-tuned model's weights with the weights of the original model, which can greatly increase out-of-distribution performance while largely retaining the in-distribution performance of the fine-tuned model. [12]
Low-rank adaptation (LoRA) is an adapter-based technique for efficiently fine-tuning models. The basic idea is to design a low- rank matrix that is then added to the original matrix. [13] An adapter, in this context, is a collection of low-rank matrices which, when added to a base model, produces a fine-tuned model. It allows for performance that approaches full-model fine-tuning with less space requirement. A language model with billions of parameters may be LoRA fine-tuned with only several millions of parameters.
LoRA-based fine-tuning has become popular in the Stable Diffusion community. [14] Support for LoRA was integrated into the Diffusers library from Hugging Face. [15] Support for LoRA and similar techniques is also available for a wide range of other models through Hugging Face's Parameter-Efficient Fine-Tuning (PEFT) package. [16]
This section relies largely or entirely upon a
single source. (May 2024) |
Representation fine-tuning (ReFT) is a novel technique developed by researchers at Stanford University aimed at fine-tuning large language models (LLMs) by modifying less than 1% of their representations. Unlike traditional parameter-efficient fine-tuning (PEFT) methods, which mainly focus on updating weights, ReFT targets specific parts of the model relevant to the task being fine-tuned. This approach is based on the understanding that deep learning models encode rich semantic information in their representations, suggesting that modifying representations might be a more effective strategy than updating weights. [17]
ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations and train interventions that manipulate a small fraction of model representations to steer model behaviors towards solving downstream tasks at inference time. One specific method within the ReFT family is Low-rank Linear Subspace ReFT (LoReFT), which intervenes on hidden representations in the linear subspace spanned by a low-rank projection matrix. [17] LoReFT can be seen as the representation-based equivalent of Low-rank Adaptation (LoRA).
Fine-tuning is common in natural language processing (NLP), especially in the domain of language modeling. Large language models like OpenAI's series of GPT foundation models can be fine-tuned on data for specific downstream NLP tasks (tasks that use a pre-trained model) to improve performance over the unmodified pre-trained model. [6]
Commercially-offered large language models can sometimes be fine-tuned if the provider offers a fine-tuning API. As of June 19, 2023, language model fine-tuning APIs are offered by OpenAI and Microsoft Azure's Azure OpenAI Service for a subset of their models, as well as by Google Cloud Platform for some of their PaLM models, and by others. [18] [19] [20] Not all commercial models currently support fine-tuning.
Companies such as Meta ( Llama LLM family), Alibaba (Qwen LLM family) and Mistral AI (Mixtral) have published open source large language models with different sizes on GitHub, which can be fine-tuned. Open-source models can be advantageous for companies in terms of data security, because they can control where the model is hosted.
{{
cite book}}
: CS1 maint: location missing publisher (
link)
{{
cite journal}}
: Cite journal requires |journal=
(
help)
{{
cite journal}}
: Cite journal requires |journal=
(
help)