Skip to content

Commit

Permalink
Merge pull request #2 from TobiasHeOl/seq_input
Browse files Browse the repository at this point in the history
updated sequence input documentation
  • Loading branch information
TobiasHeOl authored Feb 25, 2024
2 parents 1a2aa7b + 4b15f10 commit 3176f0a
Show file tree
Hide file tree
Showing 2 changed files with 72 additions and 65 deletions.
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@

# AbLang-2
## Addressing the antibody germline bias and its effect on language models for improved antibody design

[![DOI:10.1101/2022.01.20.477061](http://img.shields.io/badge/DOI-10.1101/2022.01.20.477061-B31B1B.svg)](https://doi.org/10.1101/2024.02.02.578678)

</div>

**Motivation:** The versatile pathogen-binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive and time-consuming task, with the final antibody needing to not only have strong and specific binding, but also be minimally impacted by any developability issues. The success of transformer-based language models for language tasks and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody discovery and design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, or from a small number of mutations away from the germline outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias towards germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.
**Motivation:** The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive and time-consuming task, with the final antibody needing to not only have strong and specific binding, but also be minimally impacted by any developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody discovery and design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a small number of mutations away from the germline outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias towards germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline.

**Results:** In this study, we explored the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We developed and trained a series of new antibody-specific language models optimised for predicting non-germline residues. We then compared our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability. AbLang-2 is trained on both unpaired and paired data, and is the first freely available paired VH-VL language model (https://github.com/oxpig/AbLang2.git).
**Results:** In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimised for predicting non-germline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability. AbLang-2 is trained on both unpaired and paired data, and is freely available (https://github.com/oxpig/AbLang2.git).

**Availability and implementation:** AbLang2 is a python package available at https://github.com/oxpig/AbLang2.git.

Expand Down Expand Up @@ -57,12 +59,12 @@ import ablang2
ablang = ablang2.pretrained(model_to_use='ablang2-paired', random_init=False, ncpu=1, device='cpu')
seq = [
'EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS',
'DIQLTQSPLSLPVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKISNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK'
'EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS', # The heavy chain (VH) needs to be the first element
'DIQLTQSPLSLPVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKISNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK' # The light chain (VL) needs to be the second element
]
# Tokenize input sequences
seqs = [f"{seq[0]}|{seq[1]}"] # Input needs to be a list
seqs = [f"{seq[0]}|{seq[1]}"] # Input needs to be a list, with | used to separated the VH and VL
tokenized_seq = ablang.tokenizer(seqs, pad=True, w_extra_tkns=False, device="cpu")
# Generate rescodings
Expand All @@ -74,7 +76,7 @@ with torch.no_grad():
likelihoods = ablang.AbLang(tokenized_seq)
```

**We have build a wrapper for specific usecases which can be explored via a the following [Jupyter notebook](https://github.com/TobiasHeOl/AbLang2/blob/main/notebooks/pretrained_module.ipynb).**
**We have build a wrapper for specific usecases which can be explored via a the following [Jupyter notebook](https://github.com/oxpig/AbLang2/blob/main/notebooks/pretrained_module.ipynb).**



Expand All @@ -83,7 +85,7 @@ with torch.no_grad():
@article{Olsen2024,
title={Addressing the antibody germline bias and its effect on language models for improved antibody design},
author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
journal={in-preparation},
doi={},
journal={bioRxiv},
doi={https://doi.org/10.1101/2024.02.02.578678},
year={2024}
}
119 changes: 62 additions & 57 deletions notebooks/pretrained_module.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,20 @@
"import ablang2"
]
},
{
"cell_type": "markdown",
"id": "10801511-770d-46ac-a15d-a02d4ef9ec87",
"metadata": {},
"source": [
"# **0. Sequence input and its format**\n",
"\n",
"AbLang2 takes as input either the individual heavy variable domain (VH), light variable domain (VL), or the full variable domain (Fv).\n",
"\n",
"Each record (antibody) needs to be a list with the VH as the first element and the VL as the second. If either the VH or VL is not known, leave an empty string.\n",
"\n",
"An asterisk (\\*) is used for masking. It is recommended to mask residues which you are interested in mutating."
]
},
{
"cell_type": "code",
"execution_count": 2,
Expand All @@ -23,23 +37,23 @@
"outputs": [],
"source": [
"seq1 = [\n",
" 'EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS',\n",
" 'DIQLTQSPLSLPVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKISNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK'\n",
" 'EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS', # VH sequence\n",
" 'DIQLTQSPLSLPVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKISNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK' # VL sequence\n",
"]\n",
"seq2 = [\n",
" 'EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTT',\n",
" 'PVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKISNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK'\n",
"]\n",
"seq3 = [\n",
" 'EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS',\n",
" ''\n",
" '' # The VL sequence is not known, so an empty string is left instead. \n",
"]\n",
"seq4 = [\n",
" '',\n",
" 'DIQLTQSPLSLPVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKISNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK'\n",
"]\n",
"seq5 = [\n",
" 'EVQ***SGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCAR**PGHGAAFMDVWGTGTTVTVSS',\n",
" 'EVQ***SGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCAR**PGHGAAFMDVWGTGTTVTVSS', # (*) is used to mask certain residues\n",
" 'DIQLTQSPLSLPVTLGQPASISCRSS*SLEASDTNIYLSWFQQRPGQSPRRLIYKI*NRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK'\n",
"]\n",
"\n",
Expand All @@ -59,7 +73,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 3,
"id": "0e7419e4-db22-49ea-8e12-6db2b3681545",
"metadata": {},
"outputs": [],
Expand All @@ -68,7 +82,7 @@
"ablang = ablang2.pretrained(model_to_use='ablang2-paired', random_init=False, ncpu=1, device='cpu')\n",
"\n",
"# Tokenize input sequences\n",
"seq = f\"{seq1[0]}|{seq1[1]}\"\n",
"seq = f\"{seq1[0]}|{seq1[1]}\" # VH first, VL second, with | used to separated the two sequences \n",
"tokenized_seq = ablang.tokenizer([seq], pad=True, w_extra_tkns=False, device=\"cpu\")\n",
" \n",
"# Generate rescodings\n",
Expand Down Expand Up @@ -303,58 +317,49 @@
" '101 ' '102 ' '103 ' '104 ' '105 ' '106 ' '107 ' '108 ' '109 ' '114 '\n",
" '115 ' '116 ' '117 ' '118 ' '119 ' '120 ' '121 ' '122 ' '123 ' '124 '\n",
" '125 ' '126 ' '127 ' '>']\n",
"['<EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS>|<DIQLTQSPLSLPVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKI-SNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK>', '<EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTT----->|<-----------PVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKI-SNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK>', '<------SGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCAR**PGHGAAFMDVWGTGTTVTVSS>|<DIQLTQSPLSLPVTLGQPASISCRSS*SLEASDTNIYLSWFQQRPGQSPRRLIYKI*N-RDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK>']\n"
"['<EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTTVTVSS>|<DIQLTQSPLSLPVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKI-SNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK>', '<EVQLLESGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCARDVPGHGAAFMDVWGTGTT----->|<-----------PVTLGQPASISCRSSQSLEASDTNIYLSWFQQRPGQSPRRLIYKI-SNRDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK>', '<------SGGEVKKPGASVKVSCRASGYTFRNYGLTWVRQAPGQGLEWMGWISAYNGNTNYAQKFQGRVTLTTDTSTSTAYMELRSLRSDDTAVYFCAR**PGHGAAFMDVWGTGTTVTVSS>|<DIQLTQSPLSLPVTLGQPASISCRSS*SLEASDTNIYLSWFQQRPGQSPRRLIYKI*N-RDSGVPDRFSGSGSGTHFTLRISRVEADDVAVYYCMQGTHWPPAFGQGTKVDIK>']\n",
"[[[ 9.31621552 -3.42184424 -3.59398293 ... -14.73707485 -6.8935895\n",
" -0.23662716]\n",
" [ -3.54718328 -5.8486681 -4.02423763 ... -12.9396677 -9.56145287\n",
" -4.48474121]\n",
" [-11.94997597 -2.2455442 -5.69481659 ... -15.1963892 -17.97455025\n",
" -12.56952667]\n",
" ...\n",
" [ -8.94505119 -0.42261413 -4.95588017 ... -16.66817665 -15.22247696\n",
" -10.37267685]\n",
" [-11.65150261 -5.44477367 -2.95585799 ... -16.25555801 -9.75158882\n",
" -11.75897026]\n",
" [ 1.79469967 -1.95846725 -3.59784651 ... -14.95585823 -7.47080421\n",
" -0.95226705]]\n",
"\n",
" [[ 8.55518723 -3.83663583 -2.33596039 ... -13.87456799 -8.14840603\n",
" -0.42472461]\n",
" [ -4.4070158 -5.53201628 -3.69397473 ... -12.97877884 -9.86258984\n",
" -4.95414734]\n",
" [-11.95642948 -3.86210847 -5.80935097 ... -14.89213085 -16.94556236\n",
" -11.36959457]\n",
" ...\n",
" [ -7.75924206 -0.66524088 -4.08643246 ... -16.16580582 -14.76506901\n",
" -8.35070801]\n",
" [-11.91039467 -4.86995649 -2.74777317 ... -16.07694817 -8.44974518\n",
" -10.45223522]\n",
" [ 0.86006927 -2.37964129 -3.58130884 ... -15.35423565 -7.7303524\n",
" -1.11989462]]\n",
"\n",
" [[ -4.37903118 -7.55587101 1.21958244 ... -15.48622799 -6.02184772\n",
" -3.7964797 ]\n",
" [ 0. 0. 0. ... 0. 0.\n",
" 0. ]\n",
" [ 0. 0. 0. ... 0. 0.\n",
" 0. ]\n",
" ...\n",
" [ -8.94207573 -0.51090133 -5.09760666 ... -16.69521904 -15.45450687\n",
" -10.50823021]\n",
" [-11.92355251 -5.55152798 -2.87667084 ... -16.40608025 -10.19431782\n",
" -12.13288021]\n",
" [ 2.42199802 -2.01573205 -3.61701035 ... -14.9590435 -7.19029284\n",
" -0.89830101]]]\n"
]
},
{
"data": {
"text/plain": [
"array([[[ 9.31621552, -3.42184424, -3.59398293, ..., -14.73707485,\n",
" -6.8935895 , -0.23662716],\n",
" [ -3.54718328, -5.8486681 , -4.02423763, ..., -12.9396677 ,\n",
" -9.56145287, -4.48474121],\n",
" [-11.94997597, -2.2455442 , -5.69481659, ..., -15.1963892 ,\n",
" -17.97455025, -12.56952667],\n",
" ...,\n",
" [ -8.94505119, -0.42261413, -4.95588017, ..., -16.66817665,\n",
" -15.22247696, -10.37267685],\n",
" [-11.65150261, -5.44477367, -2.95585799, ..., -16.25555801,\n",
" -9.75158882, -11.75897026],\n",
" [ 1.79469967, -1.95846725, -3.59784651, ..., -14.95585823,\n",
" -7.47080421, -0.95226705]],\n",
"\n",
" [[ 8.55518723, -3.83663583, -2.33596039, ..., -13.87456799,\n",
" -8.14840603, -0.42472461],\n",
" [ -4.4070158 , -5.53201628, -3.69397473, ..., -12.97877884,\n",
" -9.86258984, -4.95414734],\n",
" [-11.95642948, -3.86210847, -5.80935097, ..., -14.89213085,\n",
" -16.94556236, -11.36959457],\n",
" ...,\n",
" [ -7.75924206, -0.66524088, -4.08643246, ..., -16.16580582,\n",
" -14.76506901, -8.35070801],\n",
" [-11.91039467, -4.86995649, -2.74777317, ..., -16.07694817,\n",
" -8.44974518, -10.45223522],\n",
" [ 0.86006927, -2.37964129, -3.58130884, ..., -15.35423565,\n",
" -7.7303524 , -1.11989462]],\n",
"\n",
" [[ -4.37903118, -7.55587101, 1.21958244, ..., -15.48622799,\n",
" -6.02184772, -3.7964797 ],\n",
" [ 0. , 0. , 0. , ..., 0. ,\n",
" 0. , 0. ],\n",
" [ 0. , 0. , 0. , ..., 0. ,\n",
" 0. , 0. ],\n",
" ...,\n",
" [ -8.94207573, -0.51090133, -5.09760666, ..., -16.69521904,\n",
" -15.45450687, -10.50823021],\n",
" [-11.92355251, -5.55152798, -2.87667084, ..., -16.40608025,\n",
" -10.19431782, -12.13288021],\n",
" [ 2.42199802, -2.01573205, -3.61701035, ..., -14.9590435 ,\n",
" -7.19029284, -0.89830101]]])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
Expand Down

0 comments on commit 3176f0a

Please sign in to comment.