Delta Models¶
Lora¶
- class LoraModel(backbone_model: Module, lora_r=8, lora_alpha=16, lora_dropout=0.0, modified_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False, backend: Optional[str] = 'hf')[source]¶
The implementation of LoRA: Low-Rank Adaptation of Large Language Models . Thanks for their loralib.
Note
In our implementation, we did not use loralib.linear to replace the linear layer of the backbone model. Instead, we insert a parallel module into the backbone. In other words, we treat \((W + A^TB) X\) as \(WX+ A^TBX\), and insert the \(A^TBX\) as a parallel insertion module. If you want to use the original implementation, please refer to lora_old.py
class attributes:
default_modified_modules = [‘attn.q’, ‘attn.v’] According to the paper, they modify q and v matrix in the attention layer. However, other linears can also be modified, and may lead to better performance.
Note
modified_modules should point to linear layer. We currently don’t support broadcast to all linears in a module’s child modules.
delta_type = “lora”
- Parameters
backbone_model (
transformers.PretrainedModels
) – The backbone model to be modified.lora_r (
int
, optional) – the rank of the lora parameters. The smaller lora_r is , the fewer parameters lora has.lora_alpha (
int
, optional) – A hyper-parameter to control the init scale of loralib.linear .lora_dropout (
float
, optional) – The dropout rate in lora.linear.modified_modules (
List[str]
) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones)unfrozen_modules (
List[str]
, optional, default toNone
) – The modules that should be unfrozen together with the prefix parameters.common_structure (
bool
) – whether using name-based addressing with a common structure mapping.backend (
str
) – choose the backend of plm, ‘hf’ for huggingface transformers,’bmt’ for bmtrain
- config_class¶
alias of
LoraConfig
BitFit¶
- class BitFitModel(backbone_model: Module, modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False, backend: Optional[str] = 'hf')[source]¶
The implementation of BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models . Unfreeze bias term (or add bias term if bias term is absent in the backbone, e.g. T5) to the modules of a transformer block.
Note
Broadcast to Submodule: We modify all potential positions of the specified
modified_modules
. That is to say, if we specifyattn
in the modified_modules, then all position including the q, k, v and out linear layer of the attention layer are added bias layer (or unfreezing). The potential position is determined according to equation (1)-(5) and the previous three equations.- class attributes:
default_modified_modules = [“attn”, “ff”, “layer_norm”,”lm_head.proj”] According to the paper and the implementation in Compacter’s baseline , we modify the bias term in the above modules.
delta_type = “bitfit”
- Parameters
backbone_model (
transformers.PretrainedModels
) – The backbone model to be modified.modified_modules (
List[str]
) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones)unfrozen_modules (
List[str]
, optional, default toNone
) – The modules that should be unfrozen together with the prefix parameters.common_structure (
bool
) – whether using name-based addressing with a common structure mapping.
- config_class¶
alias of
BitFitConfig
Adapter¶
- class AdapterModel(backbone_model: Module, bottleneck_dim: Optional[int] = 24, non_linearity: Optional[str] = 'gelu_new', modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[bool] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False, backend: Optional[str] = 'hf')[source]¶
The implementation of Adapter(Parameter-Efficient Transfer Learning for NLP ) . Add adapter to the designated
modified_modules
. In sequential paradigm, The modules’ output is then passed into the adapter’s post_forward.Note
We assume the output of the modified module is the hidden state or a tuple where hidden state is the first element. This is true for most PLMs. However, we admit that currently it’s not rigorous, We will improve it in the next version. Currently, if you encount an error here for you backbone, you can modify the code to get the hidden state.
- class attributes:
default_modified_modules = [“attn”, “ff”] According to the Adapter paper, we add adapter to the attention layer and feed forward layer.
delta_type = “adapter”
- Parameters
backbone_model (
transformers.PretrainedModels
) – The backbone model to be modified.bottleneck_dim (
int
) – The dimension of the adapter’s bottleneck.non_linearity (
str
) – The non linearity of the adapter.modified_modules (
List[str]
) – modules to add adapter after them.unfrozen_modules (
List[str]
, optional, default toNone
) – The modules that should be unfrozen together with the adapter parameters.common_structure (
bool
) – whether using name-based addressing witha common structure mapping.backend (
str
) – choose the backend of plm, ‘hf’ for huggingface transformers,’bmt’ for bmtrain.
- config_class¶
alias of
AdapterConfig
LowRankAdapter¶
- class LowRankAdapterModel(backbone_model: Module, reduction_factor=32, non_linearity='gelu_new', low_rank_w_init='glorot-uniform', low_rank_rank=1, modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False, backend: Optional[str] = 'hf')[source]¶
The implementation of LowRankAdapter, proposed as a baseline in Compacter: Efficient Low-Rank Hypercomplex Adapter Layers . We found that it enjoys very few parameters but competitive performance, thus add it into OpenDelta. Low Rank Adapter parameterize each adapter’s weight as a product of two rank-one(low) weights.
Add lowrank adapter layer to the designated
modified_modules
. In sequential paradigm, The modules’ output is then passed into the low rank adapter’s post_forward.Note
We assume the output of the modified module is the hidden state or a tuple where hidden state is the first element. This is true for most PLMs. However, we admit that currently it’s not rigorous, We will improve it in the next version. Currently, if you encount an error here for you backbone, you can modify the code to get the hidden state.
All the hyperparameter is adopted from the compacter code base .
- class attributes:
default_modified_modules = [“attn”, “ff”] According to the compacter paper, we add low rank adapter to the attention layer and feed forward layer.
delta_type = “lowrankadapter”
- Parameters
backbone_model (
transformers.PretrainedModels
) – The backbone model to be modified.reduction_factor (
int
, optional, default to16
) – bottleneck_dim = hidden_dim//reduction_factornon_linearity (
str
, optional, default to"gelu_new"
) – The non linearity activation used in between the down projecter and the up projecter.low_rank_w_init (
str
, optional, default to"glorot-uniform"
) – The weight init method of the factorized linear weight.low_rank_rank (
int
, optional, default to 1) – The rank of the low-rank decomposition.modified_modules (
List[str]
) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones)unfrozen_modules (
List[str]
, optional, default toNone
) – The modules that should be unfrozen together with the prefix parameters.common_structure (
bool
, optional, default toNone
) – whether using name-based addressing with a common structure mapping.
- config_class¶
alias of
LowRankAdapterConfig
Compacter¶
- class CompacterModel(backbone_model, modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False, backend: Optional[str] = 'hf', reduction_factor=16, non_linearity='gelu_new', phm_c_init='normal', hypercomplex_division=4, learn_phm=True, hypercomplex_nonlinearity='glorot-uniform', shared_phm_rule=False, factorized_phm=True, shared_W_phm=False, factorized_phm_rule=False, phm_rank=1, phm_init_range=0.0001, kronecker_prod=None, use_bias_up_sampler=True, use_bias_down_sampler=True)[source]¶
The implementation of Compacter: Efficient Low-Rank Hypercomplex Adapter Layers . Add compacter layer to the designated
modified_modules
. In sequential paradigm, The modules’ output is then passed into the compacter’s post_forward.Note
We assume the output of the modified module is the hidden state or a tuple where hidden state is the first element. This is true for most PLMs. However, we admit that currently it’s not rigorous, We will improve it in the next version. Currently, if you encount an error here for you backbone, you can modify the code to get the hidden state.
All the hyperparameter is adopted from the compacter code base .
- class attributes:
default_modified_modules = [“attn”, “ff”] According to the compacter paper, we add compacter to the attention layer and feed forward layer.
delta_type = “compacter”
- Parameters
backbone_model (
transformers.PretrainedModels
) – The backbone model to be modified.modified_modules (
List[str]
) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones)unfrozen_modules (
List[str]
, optional, default toNone
) – The modules that should be unfrozen together with the prefix parameters.common_structure (
bool
, optional, default toNone
) – whether using name-based addressing with a common structure mapping.backend (
str
) – choose the backend of plm, ‘hf’ for huggingface transformers,’bmt’ for bmtrainreduction_factor (
int
, optional, default to16
) – bottleneck_dim = hidden_dim//reduction_factornon_linearity (
str
, optional, default to"gelu_new"
) – The non linearity activation used in between the down projecter and the up projecter.phm_c_init (
str
, optional, default to"normal"
) – The initialize method of the C in compacter.hypercomplex_division (
str
, optional, default to 4) – Then
in the paper. The number of division along a dimension in compector.learn_phm (
bool
, optional, default toTrue
) – Whether the phm rule requires_grad. Note that we didn’t check the performance of learn_phm=False.hypercomplex_nonlinearity (
str
, optional, default to"glorot-uniform"
) – The initialize method of the W in compacter.shared_phm_rule (
str
, optional , default toFalse
) – Whether the phm rule is shared accross layer.factorized_phm (
str
, optional, default toTrue
) – Whether to factorize the phm into low rank product.shared_W_phm (
str
, optional , default toFalse
) – Whether the W_phm is shared accross layer.factorized_phm_rule (
str
, optional , default toFalse
) – Whether to factorize the phm rule into low rank product.phm_rank=1 (
int
, optional, default to 1) – The rank of low rank decomposition of phm.phm_init_range (
float
, optional, default to 0.0001) – The range of phm initialization.kronecker_prod (
bool
, optional, default to False) – Whether to perform kronecker_prod in matvec_product, proposed by Parameterization of Hypercomplex Multiplicationsuse_bias_up_sampler (
float
, optional, default toTrue
) – Whether add bias to the up projector. Note that the bias for this is ahidden_dim
vector.use_bias_down_sampler (
float
, optional, default toTrue
) – Whether add bias to the down projector. Note that the bias for this is abottleneck_dim
vector.
- config_class¶
alias of
CompacterConfig
Prefix tuning¶
- class PrefixModel(backbone_model: Module, prefix_token_num=6, reparameterize=True, embed_dim: Optional[int] = 512, mid_dim: Optional[int] = 512, modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False)[source]¶
The implementation of Prefix-Tuning: Optimizing Continuous Prompts for Generation . However, as attention block of different PLM differs substantially, e.g., the input arguments, the name convention of
past_key_value
, we have to implement different prefixlayer for different PLM. Given the inconvenience in the code level, we only support several commonly used backbone models (Currently: T5, DistilBert,Bert, Roberta, GPT2, BART). If you are trying to apply delta tuning to other backbone models, we suggest you trying other delta models or implementing it and making a pull request.Experimental Feature:
Support inserting prefix token before each layer. For example, layer 3 4 6 10 and other layer untouched.
Note
If using reparameterize, the parameters will be in a reparameterization network, not in the prefix, which we attach to the first prefix layer. We will add a function to save only the generated prefix parameters for saving in the next version.
- Parameters
backbone_model (
transformers.PretrainedModels
) – The backbone model to be modified.prefix_token_num (
int
) – the number of prefix tokenreparameterize (
bool
) – Whether use the reparameterization for prefix tuning.embed_dim (
int
) – The embeding dimension of prefix token when using the reparameterization.mid_dim (
int
) – The dimension of the hiddens of the reparameterization network.modified_modules (
List[str]
) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones)unfrozen_modules (
List[str]
, optional, default toNone
) – The modules that should be unfrozen together with the prefix parameters.common_structure (
bool
) – whether using name-based addressing with a common structure mapping.
Soft Prompt Tuning¶
- class SoftPromptModel(backbone_model: Module, soft_token_num=100, init_range=0.5, token_init=True, other_expand_ids={'attention_mask': 1, 'token_type_ids': 0}, modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False)[source]¶
This is the implementation of The Power of Scale for Parameter-Efficient Prompt Tuning . Similar to
PrefixTuningTemplate
, This template also does not need any textual template. Addition tokens are directly concatenated into the input ids. There are two initializations of the new tokens. (1). random initialization. (2) initialize with the tokens of the plm (We simply take the first n_tokens similar to their implementation).Note that this template can be simply achieved by
SoftManualTemplate
, in which you setn_token
<soft> tokens template before the <text_a> will give the same result.- Parameters
backbone_model (
transformers.PretrainedModels
) – The backbone model to be modified.soft_token_num (
int
, optional) – num of new tokens to add in the front of the input.init_range (
float
, optional) – If initialize new tokens randomly, the random range of uniform distribution.token_init (
bool
, optional, default toTrue
) – Whether to initialize the new tokens with tokens of the PLM.other_expand_ids (
dict
, optional, default to{'attention_mask':1, 'token_type_ids':0}
) – The name of other tokens and its default value that expand along with the input sequence. For example, when you prepend 100 tokens to the input_ids, the attention_mask should be extended, and the token_type_ids should be extended as well.modified_modules (
List[str]
) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones).unfrozen_modules (
List[str]
, optional, default toNone
) – The modules that should be unfrozen together with the prefix parameters.common_structure (
bool
) – whether using name-based addressing with a common structure mapping.
- config_class¶
alias of
SoftPromptConfig