Vitale, R., Bugnon, L. A., Fenoy, E. L., Milone, D. H. & Stegmayer, G. Evaluating massive language fashions for annotating proteins. Temporary. Bioinform. 25, bbae177 (2024).
Google Scholar
Quintana, F., Treangen, T. & Kavraki, L. Leveraging massive language fashions for predicting microbial virulence from protein construction and sequence. In Proc. 14th ACM Worldwide Convention on Bioinformatics, Computational Biology, and Well being Informatics 103 (Affiliation for Computing Equipment, 2023).
Zhou, Ok., Lei, C., Zheng, J., Huang, Y. & Zhang, Z. Pre-trained protein language mannequin sheds new mild on the prediction of Arabidopsis protein–protein interactions. Plant Strategies 19, 141 (2023).
Google Scholar
Snider, J. et al. Fundamentals of protein interplay community mapping. Mol. Syst. Biol. 11, 848 (2015).
Google Scholar
Cafarelli, T. M. et al. Mapping, modeling, and characterization of protein-protein interactions on a proteomic scale. Curr. Opin. Struct. Biol. 44, 201–210 (2017).
Google Scholar
Low, T. Y. et al. Latest progress in mass spectrometry-based methods for elucidating protein-protein interactions. Cell. Mol. Life Sci. 78, 5325–5339 (2021).
Google Scholar
Szklarczyk, D. et al. The STRING database in 2023: protein-protein affiliation networks and purposeful enrichment analyses for any sequenced genome of curiosity. Nucleic Acids Res. 51, D638–D646 (2022).
Google Scholar
Park, Y. & Marcotte, E. M. Flaws in analysis schemes for pair-input computational predictions. Nat. Strategies 9, 1134–1136 (2012).
Google Scholar
Bernett, J., Blumenthal, D. B. & Listing, M. Cracking the black field of deep sequence-based protein-protein interplay prediction. Temporary. Bioinform. 25, bbae076 (2024).
Google Scholar
Hamp, T. & Rost, B. Extra challenges for machine-learning protein interactions. Bioinformatics 31, 1521–1525 (2015).
Google Scholar
Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M. ProteinBERT: a common deep-learning mannequin of protein sequence and performance. Bioinformatics 38, 2102–2110 (2022).
Google Scholar
Elnaggar, A. et al. ProtTrans: towards understanding the language of life by self-supervised studying. IEEE Trans. Sample Anal. Mach. Intell. 44, 7112–7127 (2022).
Google Scholar
Bepler, T. & Berger, B. Studying the protein language: evolution, construction, and performance. Cell Techniques 12, 654–669.e3 (2021).
Google Scholar
Verkuil, R. et al. Language fashions generalize past pure proteins. Preprint at bioRxiv (2022).
Szymborski, J. & Emad, A. RAPPPID: in direction of generalizable protein interplay prediction with AWD-LSTM twin networks. Bioinformatics 38, 3958–3967 (2022).
Google Scholar
Chen, M. et al. Multifaceted protein–protein interplay prediction primarily based on Siamese residual RCNN. Bioinformatics 35, i305–i314 (2019).
Google Scholar
Sledzieski, S., Singh, R., Cowen, L. & Berger, B. D-SCRIPT interprets genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. Cell Syst. 12, 969–982.e6 (2021).
Richoux, F., Servantie, C., Borès, C. & Téletchéa, S. Evaluating two deep studying sequence-based fashions for protein–protein interplay prediction. Preprint at (2019).
Li, Y. and Ilie, L. SPRINT: ultrafast protein-protein interplay prediction of your complete human interactome. BMC Bioinformatics 18, 485 (2017).
Iandola, F. N., Shaw, A. E., Krishna, R. & Keutzer, Ok. W. SqueezeBERT: what can laptop imaginative and prescient educate NLP about environment friendly neural networks? Preprint at (2020).
Devlin, J., Chang, M.-W., Lee, Ok. & Toutanova, Ok. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at (2019).
The UniProt Consortium UniProt: the Common Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Google Scholar
Järvelin, Ok. & Kekäläinen, J. Cumulated gain-based analysis of IR methods. ACM Trans. Inf. Syst. 20, 422–446 (2002).
Google Scholar
Wu, X.-Z. & Zhou, Z.-H. A unified view of multi-label efficiency measures. Preprint at (2016).
McHugh, M. L. Interrater reliability: the kappa statistic. Biochem. Med. 22, 276–282 (2012).
Cohen, J. A coefficient of settlement for nominal scales. Educ. Psychol. Meas. 20, 37–46 (1960).
Google Scholar
Fallon, T. R. et al. Big polyketide synthase enzymes within the biosynthesis of large marine polyether toxins. Science 385, 671–678 (2024).
Google Scholar
Gordon, D. E. et al. A SARS-CoV-2 protein interplay map reveals targets for drug repurposing. Nature 583, 459–468 (2020).
Google Scholar
Jankauskaitė, J., Jiménez-García, B., Dapkūnas, J., Fernández-Recio, J. & Moal, I. H. SKEMPI 2.0: an up to date benchmark of adjustments in protein–protein binding power, kinetics and thermodynamics upon mutation. Bioinformatics 35, 462–469 (2019).
Google Scholar
Szymborski, J. & Emad, A. INTREPPPID—an orthologue-informed quintuplet community for cross-species prediction of protein–protein interplay. Temporary. Bioinform. 25, bbae405 (2024).
Google Scholar
Anfinsen, C. B. Ideas that govern the folding of protein chains. Science 181, 223–230 (1973).
Google Scholar
Jumper, J. et al. Extremely correct protein construction prediction with AlphaFold. Nature 596, 583–589 (2021).
Google Scholar
Bolouri, N., Szymborski, J. & Emad, A. Multi-modal protein illustration studying with CLASP. Preprint at bioRxiv (2025).
Szymborski, J. Datasets used within the INTREPPPID manuscript. Zenodo (2024).
Szymborski, J. Emad-COMBINE-lab/ppi_origami: preprint V1. Zenodo (2024).
Suzek, B. E. et al. UniRef clusters: a complete and scalable different for enhancing sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a brand new technology of protein database search packages. Nucleic Acids Res. 25, 3389–3402 (1997).
Google Scholar
Steinegger, M. & Söding, J. MMseqs2 allows delicate protein sequence looking for the evaluation of large knowledge units. Nat. Biotechnol. 35, 1026–1028 (2017).
Google Scholar
Iandola, F. N. et al. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5MB mannequin measurement. Preprint at (2016).
Hendrycks, D. & Gimpel, Ok. Gaussian error linear models (GELUs). Preprint at (2023).
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a easy solution to forestall neural networks from overfitting. J. Mach. Be taught. Res. 15, 1929–1958 (2014).
Google Scholar
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In Proc. seventh Worldwide Convention on Studying Representations (2019).
Smith, L. N. & Topin, N. Tremendous-convergence: very quick coaching of neural networks utilizing massive studying charges. Preprint at (2018).
Misra, D. Mish: a self regularized non-monotonic activation operate. Preprint at (2019).
Wan, L., Zeiler, M., Zhang, S., Le Cun, Y. & Fergus, R. Regularization of neural networks utilizing DropConnect. In Proc. thirtieth Worldwide Convention on Machine Studying 1058–1066 (Proceedings of Machine Studying Analysis, 2013).
Ridnik, T. et al. Uneven loss for multi-label classification. In Proc. 2021 IEEE/CVF Worldwide Convention on Pc Imaginative and prescient 82–91 (IEEE, 2021).
Lin, T.-Y., Goyal, P., Girshick, R., He, Ok. & Dollár, P. Focal loss for dense object detection. In Proc. 2017 IEEE Worldwide Convention on Pc Imaginative and prescient 2999–3007 (IEEE, 2017).
Strokach, A., Lu, T. Y. & Kim, P. M. ELASPIC2 (EL2): combining contextualized language fashions and graph neural networks to foretell results of mutations. J. Mol. Biol. 433, 166810 (2021).
Google Scholar
Virtanen, P. et al. SciPy 1.0: basic algorithms for scientific computing in Python. Nat. Strategies 17, 261–272 (2020).
Google Scholar
Szymborski, J. Information for “A flaw in using pre-trained pLMs in protein-protein interaction inference models”. Zenodo (2025).
Szymborski, J. Emad-COMBINE-lab/pllm-ppi-data-leakage: v1. Zenodo (2025).



