Delta Models

Lora

class LoraModel(backbone_model: Module, lora_r=8, lora_alpha=16, lora_dropout=0.0, modified_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False, backend: Optional[str] = 'hf')[source]

The implementation of LoRA: Low-Rank Adaptation of Large Language Models . Thanks for their loralib.

Note

In our implementation, we did not use loralib.linear to replace the linear layer of the backbone model. Instead, we insert a parallel module into the backbone. In other words, we treat \((W + A^TB) X\) as \(WX+ A^TBX\), and insert the \(A^TBX\) as a parallel insertion module. If you want to use the original implementation, please refer to lora_old.py

class attributes:

  • default_modified_modules = [‘attn.q’, ‘attn.v’] According to the paper, they modify q and v matrix in the attention layer. However, other linears can also be modified, and may lead to better performance.

Note

modified_modules should point to linear layer. We currently don’t support broadcast to all linears in a module’s child modules.

  • delta_type = “lora”

Parameters
  • backbone_model (transformers.PretrainedModels) – The backbone model to be modified.

  • lora_r (int, optional) – the rank of the lora parameters. The smaller lora_r is , the fewer parameters lora has.

  • lora_alpha (int, optional) – A hyper-parameter to control the init scale of loralib.linear .

  • lora_dropout (float, optional) – The dropout rate in lora.linear.

  • modified_modules (List[str]) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones)

  • unfrozen_modules (List[str], optional, default to None) – The modules that should be unfrozen together with the prefix parameters.

  • common_structure (bool) – whether using name-based addressing with a common structure mapping.

  • backend (str) – choose the backend of plm, ‘hf’ for huggingface transformers,’bmt’ for bmtrain

config_class

alias of LoraConfig

BitFit

class BitFitModel(backbone_model: Module, modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False, backend: Optional[str] = 'hf')[source]

The implementation of BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models . Unfreeze bias term (or add bias term if bias term is absent in the backbone, e.g. T5) to the modules of a transformer block.

Note

Broadcast to Submodule: We modify all potential positions of the specified modified_modules. That is to say, if we specify attn in the modified_modules, then all position including the q, k, v and out linear layer of the attention layer are added bias layer (or unfreezing). The potential position is determined according to equation (1)-(5) and the previous three equations.

class attributes:
  • default_modified_modules = [“attn”, “ff”, “layer_norm”,”lm_head.proj”] According to the paper and the implementation in Compacter’s baseline , we modify the bias term in the above modules.

  • delta_type = “bitfit”

Parameters
  • backbone_model (transformers.PretrainedModels) – The backbone model to be modified.

  • modified_modules (List[str]) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones)

  • unfrozen_modules (List[str], optional, default to None) – The modules that should be unfrozen together with the prefix parameters.

  • common_structure (bool) – whether using name-based addressing with a common structure mapping.

config_class

alias of BitFitConfig

add_bias_to_modules_have_bias_or_known_type(c)[source]

If it has bias, unfreeze it. If it doesn’t have bias: if it is Linear of LN, add to it, else pass.

detach(module)[source]

Not implemented for BitFit yet. Please wait for the next version.

attach(module)[source]

Not implemented for BitFit yet. Please wait for the next version.

Adapter

class AdapterModel(backbone_model: Module, bottleneck_dim: Optional[int] = 24, non_linearity: Optional[str] = 'gelu_new', modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[bool] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False, backend: Optional[str] = 'hf')[source]

The implementation of Adapter(Parameter-Efficient Transfer Learning for NLP ) . Add adapter to the designated modified_modules. In sequential paradigm, The modules’ output is then passed into the adapter’s post_forward.

Note

We assume the output of the modified module is the hidden state or a tuple where hidden state is the first element. This is true for most PLMs. However, we admit that currently it’s not rigorous, We will improve it in the next version. Currently, if you encount an error here for you backbone, you can modify the code to get the hidden state.

class attributes:
  • default_modified_modules = [“attn”, “ff”] According to the Adapter paper, we add adapter to the attention layer and feed forward layer.

  • delta_type = “adapter”

Parameters
  • backbone_model (transformers.PretrainedModels) – The backbone model to be modified.

  • bottleneck_dim (int) – The dimension of the adapter’s bottleneck.

  • non_linearity (str) – The non linearity of the adapter.

  • modified_modules (List[str]) – modules to add adapter after them.

  • unfrozen_modules (List[str], optional, default to None) – The modules that should be unfrozen together with the adapter parameters.

  • common_structure (bool) – whether using name-based addressing witha common structure mapping.

  • backend (str) – choose the backend of plm, ‘hf’ for huggingface transformers,’bmt’ for bmtrain.

config_class

alias of AdapterConfig

LowRankAdapter

class LowRankAdapterModel(backbone_model: Module, reduction_factor=32, non_linearity='gelu_new', low_rank_w_init='glorot-uniform', low_rank_rank=1, modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False, backend: Optional[str] = 'hf')[source]

The implementation of LowRankAdapter, proposed as a baseline in Compacter: Efficient Low-Rank Hypercomplex Adapter Layers . We found that it enjoys very few parameters but competitive performance, thus add it into OpenDelta. Low Rank Adapter parameterize each adapter’s weight as a product of two rank-one(low) weights.

Add lowrank adapter layer to the designated modified_modules. In sequential paradigm, The modules’ output is then passed into the low rank adapter’s post_forward.

Note

We assume the output of the modified module is the hidden state or a tuple where hidden state is the first element. This is true for most PLMs. However, we admit that currently it’s not rigorous, We will improve it in the next version. Currently, if you encount an error here for you backbone, you can modify the code to get the hidden state.

All the hyperparameter is adopted from the compacter code base .

class attributes:
  • default_modified_modules = [“attn”, “ff”] According to the compacter paper, we add low rank adapter to the attention layer and feed forward layer.

  • delta_type = “lowrankadapter”

Parameters
  • backbone_model (transformers.PretrainedModels) – The backbone model to be modified.

  • reduction_factor (int, optional, default to 16) – bottleneck_dim = hidden_dim//reduction_factor

  • non_linearity (str, optional, default to "gelu_new") – The non linearity activation used in between the down projecter and the up projecter.

  • low_rank_w_init (str, optional, default to "glorot-uniform") – The weight init method of the factorized linear weight.

  • low_rank_rank (int, optional, default to 1) – The rank of the low-rank decomposition.

  • modified_modules (List[str]) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones)

  • unfrozen_modules (List[str], optional, default to None) – The modules that should be unfrozen together with the prefix parameters.

  • common_structure (bool, optional, default to None) – whether using name-based addressing with a common structure mapping.

config_class

alias of LowRankAdapterConfig

Compacter

class CompacterModel(backbone_model, modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False, backend: Optional[str] = 'hf', reduction_factor=16, non_linearity='gelu_new', phm_c_init='normal', hypercomplex_division=4, learn_phm=True, hypercomplex_nonlinearity='glorot-uniform', shared_phm_rule=False, factorized_phm=True, shared_W_phm=False, factorized_phm_rule=False, phm_rank=1, phm_init_range=0.0001, kronecker_prod=None, use_bias_up_sampler=True, use_bias_down_sampler=True)[source]

The implementation of Compacter: Efficient Low-Rank Hypercomplex Adapter Layers . Add compacter layer to the designated modified_modules. In sequential paradigm, The modules’ output is then passed into the compacter’s post_forward.

Note

We assume the output of the modified module is the hidden state or a tuple where hidden state is the first element. This is true for most PLMs. However, we admit that currently it’s not rigorous, We will improve it in the next version. Currently, if you encount an error here for you backbone, you can modify the code to get the hidden state.

All the hyperparameter is adopted from the compacter code base .

class attributes:
  • default_modified_modules = [“attn”, “ff”] According to the compacter paper, we add compacter to the attention layer and feed forward layer.

  • delta_type = “compacter”

Parameters
  • backbone_model (transformers.PretrainedModels) – The backbone model to be modified.

  • modified_modules (List[str]) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones)

  • unfrozen_modules (List[str], optional, default to None) – The modules that should be unfrozen together with the prefix parameters.

  • common_structure (bool, optional, default to None) – whether using name-based addressing with a common structure mapping.

  • backend (str) – choose the backend of plm, ‘hf’ for huggingface transformers,’bmt’ for bmtrain

  • reduction_factor (int, optional, default to 16) – bottleneck_dim = hidden_dim//reduction_factor

  • non_linearity (str, optional, default to "gelu_new") – The non linearity activation used in between the down projecter and the up projecter.

  • phm_c_init (str, optional, default to "normal") – The initialize method of the C in compacter.

  • hypercomplex_division (str, optional, default to 4) – The n in the paper. The number of division along a dimension in compector.

  • learn_phm (bool, optional, default to True) – Whether the phm rule requires_grad. Note that we didn’t check the performance of learn_phm=False.

  • hypercomplex_nonlinearity (str, optional, default to "glorot-uniform") – The initialize method of the W in compacter.

  • shared_phm_rule (str, optional , default to False) – Whether the phm rule is shared accross layer.

  • factorized_phm (str, optional, default to True) – Whether to factorize the phm into low rank product.

  • shared_W_phm (str, optional , default to False) – Whether the W_phm is shared accross layer.

  • factorized_phm_rule (str, optional , default to False) – Whether to factorize the phm rule into low rank product.

  • phm_rank=1 (int, optional, default to 1) – The rank of low rank decomposition of phm.

  • phm_init_range (float, optional, default to 0.0001) – The range of phm initialization.

  • kronecker_prod (bool, optional, default to False) – Whether to perform kronecker_prod in matvec_product, proposed by Parameterization of Hypercomplex Multiplications

  • use_bias_up_sampler (float, optional, default to True) – Whether add bias to the up projector. Note that the bias for this is a hidden_dim vector.

  • use_bias_down_sampler (float, optional, default to True) – Whether add bias to the down projector. Note that the bias for this is a bottleneck_dim vector.

config_class

alias of CompacterConfig

Prefix tuning

class PrefixModel(backbone_model: Module, prefix_token_num=6, reparameterize=True, embed_dim: Optional[int] = 512, mid_dim: Optional[int] = 512, modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False)[source]

The implementation of Prefix-Tuning: Optimizing Continuous Prompts for Generation . However, as attention block of different PLM differs substantially, e.g., the input arguments, the name convention of past_key_value, we have to implement different prefixlayer for different PLM. Given the inconvenience in the code level, we only support several commonly used backbone models (Currently: T5, DistilBert,Bert, Roberta, GPT2, BART). If you are trying to apply delta tuning to other backbone models, we suggest you trying other delta models or implementing it and making a pull request.

Experimental Feature:

Support inserting prefix token before each layer. For example, layer 3 4 6 10 and other layer untouched.

Note

If using reparameterize, the parameters will be in a reparameterization network, not in the prefix, which we attach to the first prefix layer. We will add a function to save only the generated prefix parameters for saving in the next version.

Parameters
  • backbone_model (transformers.PretrainedModels) – The backbone model to be modified.

  • prefix_token_num (int) – the number of prefix token

  • reparameterize (bool) – Whether use the reparameterization for prefix tuning.

  • embed_dim (int) – The embeding dimension of prefix token when using the reparameterization.

  • mid_dim (int) – The dimension of the hiddens of the reparameterization network.

  • modified_modules (List[str]) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones)

  • unfrozen_modules (List[str], optional, default to None) – The modules that should be unfrozen together with the prefix parameters.

  • common_structure (bool) – whether using name-based addressing with a common structure mapping.

Soft Prompt Tuning

class SoftPromptModel(backbone_model: Module, soft_token_num=100, init_range=0.5, token_init=True, other_expand_ids={'attention_mask': 1, 'token_type_ids': 0}, modified_modules: Optional[List[str]] = None, exclude_modules: Optional[List[str]] = None, unfrozen_modules: Optional[List[str]] = None, common_structure: Optional[bool] = None, interactive_modify: Optional[Union[bool, int]] = False)[source]

This is the implementation of The Power of Scale for Parameter-Efficient Prompt Tuning . Similar to PrefixTuningTemplate, This template also does not need any textual template. Addition tokens are directly concatenated into the input ids. There are two initializations of the new tokens. (1). random initialization. (2) initialize with the tokens of the plm (We simply take the first n_tokens similar to their implementation).

Note that this template can be simply achieved by SoftManualTemplate, in which you set n_token <soft> tokens template before the <text_a> will give the same result.

Parameters
  • backbone_model (transformers.PretrainedModels) – The backbone model to be modified.

  • soft_token_num (int, optional) – num of new tokens to add in the front of the input.

  • init_range (float, optional) – If initialize new tokens randomly, the random range of uniform distribution.

  • token_init (bool, optional, default to True) – Whether to initialize the new tokens with tokens of the PLM.

  • other_expand_ids (dict, optional, default to {'attention_mask':1, 'token_type_ids':0}) – The name of other tokens and its default value that expand along with the input sequence. For example, when you prepend 100 tokens to the input_ids, the attention_mask should be extended, and the token_type_ids should be extended as well.

  • modified_modules (List[str]) – For prefix tuning, the it must refer to an attention layer (Currently, only the implemented ones).

  • unfrozen_modules (List[str], optional, default to None) – The modules that should be unfrozen together with the prefix parameters.

  • common_structure (bool) – whether using name-based addressing with a common structure mapping.

config_class

alias of SoftPromptConfig