{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T21:09:55Z","timestamp":1769202595035,"version":"3.49.0"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,3,1]],"date-time":"2025-03-01T00:00:00Z","timestamp":1740787200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,3,1]],"date-time":"2025-03-01T00:00:00Z","timestamp":1740787200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"UKRI Medical Research Council (MRC) career development fellowship","award":["MR\/V010182\/1"],"award-info":[{"award-number":["MR\/V010182\/1"]}]},{"name":"UKRI Medical Research Council (MRC) career development fellowship","award":["MR\/V010182\/1"],"award-info":[{"award-number":["MR\/V010182\/1"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>The volume of protein sequence data has grown exponentially in recent years, driven by advancements in metagenomics. Despite this, a substantial proportion of these sequences remain poorly annotated, underscoring the need for robust bioinformatics tools to facilitate efficient characterisation and annotation for functional studies.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>We present PyPropel, a Python-based computational tool developed to streamline the large-scale analysis of protein data, with a particular focus on applications in machine learning. PyPropel integrates sequence and structural data pre-processing, feature generation, and post-processing for model performance evaluation and visualisation, offering a comprehensive solution for handling complex protein datasets.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusion<\/jats:title>\n            <jats:p>PyPropel provides added value over existing tools by offering a unified workflow that encompasses the full spectrum of protein research, from raw data pre-processing to functional annotation and model performance analysis, thereby supporting efficient protein function studies.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/s12859-025-06079-3","type":"journal-article","created":{"date-parts":[[2025,3,1]],"date-time":"2025-03-01T06:32:24Z","timestamp":1740810744000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["PyPropel: a Python-based tool for efficiently processing and characterising protein data"],"prefix":"10.1186","volume":"26","author":[{"given":"Jianfeng","family":"Sun","sequence":"first","affiliation":[]},{"given":"Jinlong","family":"Ru","sequence":"additional","affiliation":[]},{"given":"Adam P.","family":"Cribbs","sequence":"additional","affiliation":[]},{"given":"Dapeng","family":"Xiong","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,3,1]]},"reference":[{"key":"6079_CR1","doi-asserted-by":"publisher","first-page":"D506","DOI":"10.1093\/nar\/gky1049","volume":"47","author":"Consortium TU","year":"2019","unstructured":"Consortium TU. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47:D506\u201315.","journal-title":"Nucleic Acids Res"},{"key":"6079_CR2","doi-asserted-by":"publisher","first-page":"434","DOI":"10.1016\/j.csbj.2021.12.030","volume":"20","author":"Q Hou","year":"2022","unstructured":"Hou Q, Pucci F, Pan F, Xue F, Rooman M, Feng Q. Using metagenomic data to boost protein structure prediction and discovery. Comput Struct Biotechnol J. 2022;20:434\u201342.","journal-title":"Comput Struct Biotechnol J"},{"key":"6079_CR3","doi-asserted-by":"publisher","first-page":"D480","DOI":"10.1093\/nar\/gkaa1100","volume":"49","author":"Consortium TU","year":"2020","unstructured":"Consortium TU. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2020;49:D480\u20139.","journal-title":"Nucleic Acids Res"},{"key":"6079_CR4","doi-asserted-by":"publisher","first-page":"204","DOI":"10.1089\/cmb.2022.0241","volume":"30","author":"A Pande","year":"2023","unstructured":"Pande A, Patiyal S, Lathwal A, Arora C, Kaur D, Dhall A, et al. Pfeature: a tool for computing wide range of protein features and building prediction models. J Comput Biol. 2023;30:204\u201322.","journal-title":"J Comput Biol"},{"key":"6079_CR5","doi-asserted-by":"publisher","first-page":"D202","DOI":"10.1093\/nar\/gkm998","volume":"36","author":"S Kawashima","year":"2007","unstructured":"Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2007;36:D202\u20135.","journal-title":"Nucleic Acids Res."},{"key":"6079_CR6","doi-asserted-by":"publisher","first-page":"1512","DOI":"10.1016\/j.csbj.2021.03.005","volume":"19","author":"J Sun","year":"2021","unstructured":"Sun J, Frishman D. Improved sequence-based prediction of interaction sites in \u03b1-helical transmembrane proteins by deep learning. Comput Struct Biotechnol J. 2021;19:1512\u201330.","journal-title":"Comput Struct Biotechnol J"},{"key":"6079_CR7","doi-asserted-by":"publisher","first-page":"581","DOI":"10.1002\/humu.23961","volume":"41","author":"A Kulandaisamy","year":"2020","unstructured":"Kulandaisamy A, Zaucha J, Sakthivel R, Frishman D, Michael GM. Pred-MutHTP: prediction of disease-causing and neutral mutations in human transmembrane proteins. Hum Mutat. 2020;41:581\u201390.","journal-title":"Hum Mutat"},{"key":"6079_CR8","doi-asserted-by":"publisher","first-page":"2499","DOI":"10.1093\/bioinformatics\/bty140","volume":"34","author":"Z Chen","year":"2018","unstructured":"Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018;34:2499\u2013502.","journal-title":"Bioinformatics"},{"key":"6079_CR9","doi-asserted-by":"publisher","first-page":"e60","DOI":"10.1093\/nar\/gkab122","volume":"49","author":"Z Chen","year":"2021","unstructured":"Chen Z, Zhao P, Li C, Li F, Xiang D, Chen Y-Z, et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021;49:e60\u2013e60.","journal-title":"Nucleic Acids Res"},{"key":"6079_CR10","doi-asserted-by":"publisher","first-page":"1047","DOI":"10.1093\/bib\/bbz041","volume":"21","author":"Z Chen","year":"2019","unstructured":"Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2019;21:1047\u201357.","journal-title":"Brief Bioinform"},{"key":"6079_CR11","doi-asserted-by":"publisher","first-page":"884","DOI":"10.1093\/bioinformatics\/btt607","volume":"30","author":"U Omasits","year":"2013","unstructured":"Omasits U, Ahrens CH, M\u00fcller S, Wollscheid B. Protter: interactive protein feature visualization and integration with experimental proteomic data. Bioinformatics. 2013;30:884\u20136.","journal-title":"Bioinformatics"},{"key":"6079_CR12","doi-asserted-by":"publisher","first-page":"796","DOI":"10.1016\/j.csbj.2022.12.044","volume":"21","author":"D Guevara-Barrientos","year":"2023","unstructured":"Guevara-Barrientos D, Kaundal R. ProFeatX: a parallelized protein feature extraction suite for machine learning. Comput Struct Biotechnol J. 2023;21:796\u2013801.","journal-title":"Comput Struct Biotechnol J"},{"key":"6079_CR13","doi-asserted-by":"publisher","first-page":"e0253411","DOI":"10.1371\/journal.pone.0253411","volume":"16","author":"B Faezov","year":"2021","unstructured":"Faezov B, Dunbrack RL Jr. PDBrenum: a webserver and program providing protein data bank files renumbered according to their UniProt sequences. PLoS One. 2021;16:e0253411.","journal-title":"PLoS One"},{"key":"6079_CR14","doi-asserted-by":"publisher","first-page":"173","DOI":"10.1038\/nmeth.1818","volume":"9","author":"M Remmert","year":"2012","unstructured":"Remmert M, Biegert A, Hauser A, S\u00f6ding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2012;9:173\u20135.","journal-title":"Nat Methods"},{"key":"6079_CR15","doi-asserted-by":"publisher","first-page":"2592","DOI":"10.1093\/bioinformatics\/btu352","volume":"30","author":"CN Magnan","year":"2014","unstructured":"Magnan CN, Baldi P. SSpro\/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics. 2014;30:2592\u20137.","journal-title":"Bioinformatics"},{"key":"6079_CR16","doi-asserted-by":"publisher","first-page":"152","DOI":"10.3390\/physchem1020010","volume":"1","author":"M Pons","year":"2021","unstructured":"Pons M. Basic residue clusters in intrinsically disordered regions of peripheral membrane proteins: modulating 2D diffusion on cell membranes. Physchem. 2021;1:152\u201362.","journal-title":"Physchem"},{"key":"6079_CR17","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbad288","author":"J Sun","year":"2023","unstructured":"Sun J, Kulandaisamy A, Ru J, Gromiha MM, Cribbs AP. TMKit: a Python interface for computational analysis of transmembrane proteins. Brief Bioinform. 2023. https:\/\/doi.org\/10.1093\/bib\/bbad288.","journal-title":"Brief Bioinform"},{"key":"6079_CR18","doi-asserted-by":"publisher","first-page":"D572","DOI":"10.1093\/nar\/gkad897","volume":"52","author":"L Dobson","year":"2023","unstructured":"Dobson L, Gerd\u00e1n C, Tusn\u00e1dy S, Szekeres L, Kuffa K, Lang\u00f3 T, et al. UniTmp: unified resources for transmembrane proteins. Nucleic Acids Res. 2023;52:D572\u20138.","journal-title":"Nucleic Acids Res"},{"issue":"suppl_1","key":"6079_CR19","doi-asserted-by":"publisher","first-page":"D234","DOI":"10.1093\/nar\/gkm751","volume":"36","author":"GE Tusn\u00e1dy","year":"2007","unstructured":"Tusn\u00e1dy GE, Kalm\u00e1r L, Simon I. TOPDB: topology data bank of transmembrane proteins. Nucleic Acids Res. 2007;36(suppl_1):D234\u20139.","journal-title":"Nucleic Acids Res"},{"key":"6079_CR20","doi-asserted-by":"publisher","first-page":"1452","DOI":"10.1093\/bioinformatics\/btab813","volume":"38","author":"S Bittrich","year":"2021","unstructured":"Bittrich S, Rose Y, Segura J, Lowe R, Westbrook JD, Duarte JM, et al. RCSB protein data bank: improved annotation, search and visualization of membrane protein structures archived in the PDB. Bioinformatics. 2021;38:1452\u20134.","journal-title":"Bioinformatics"},{"key":"6079_CR21","doi-asserted-by":"publisher","first-page":"1582","DOI":"10.1093\/bioinformatics\/bty862","volume":"35","author":"TA Hopf","year":"2018","unstructured":"Hopf TA, Green AG, Schubert B, Mersmann S, Sch\u00e4rfe CPI, Ingraham JB, et al. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics. 2018;35:1582\u20134.","journal-title":"Bioinformatics"},{"key":"6079_CR22","doi-asserted-by":"publisher","first-page":"73","DOI":"10.1038\/s42003-023-04462-5","volume":"6","author":"Z Hou","year":"2023","unstructured":"Hou Z, Yang Y, Ma Z, Wong K, Li X. Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning. Commun Biol. 2023;6:73.","journal-title":"Commun Biol"},{"key":"6079_CR23","doi-asserted-by":"publisher","first-page":"3312","DOI":"10.1093\/bioinformatics\/btm515","volume":"23","author":"A Fuchs","year":"2007","unstructured":"Fuchs A, Martin-Galiano AJ, Kalman M, Fleishman S, Ben-Tal N, Frishman D. Co-evolving residues in membrane proteins. Bioinformatics. 2007;23:3312\u20139.","journal-title":"Bioinformatics"},{"key":"6079_CR24","doi-asserted-by":"publisher","first-page":"119","DOI":"10.1002\/prot.23160","volume":"79","author":"B Monastyrskyy","year":"2011","unstructured":"Monastyrskyy B, Fidelis K, Tramontano A, Kryshtafovych A. Evaluation of residue\u2013residue contact predictions in CASP9. Proteins Struct Funct Bioinform. 2011;79:119\u201325.","journal-title":"Proteins Struct Funct Bioinform"},{"issue":"2\u20133","key":"6079_CR25","doi-asserted-by":"publisher","first-page":"565","DOI":"10.1111\/j.1432-1033.1982.tb07002.x","volume":"128","author":"P Argos","year":"1982","unstructured":"Argos P, Rao JKM, Hargrave PA. structural prediction of membrane-bound proteins. Eur J Biochem. 1982;128(2\u20133):565\u201375.","journal-title":"Eur J Biochem"},{"issue":"185","key":"6079_CR26","first-page":"862","volume":"1974","author":"R Grantham","year":"1979","unstructured":"Grantham R. Amino acid difference formula to help explain protein evolution. Science. 1979;1974(185):862\u20134.","journal-title":"Science"},{"key":"6079_CR27","doi-asserted-by":"crossref","unstructured":"Hopp TP, Woods KR. Prediction of protein antigenic determinants from amino acid sequences. In: Proceedings of the national academy of sciences. 1981;78: pp. 3824\u20133828.","DOI":"10.1073\/pnas.78.6.3824"},{"key":"6079_CR28","doi-asserted-by":"crossref","unstructured":"Betts MJ, Russell RB. Amino Acid Properties and Consequences of Substitutions. Bioinformatics for Geneticists. 2003. pp. 289\u2013316.","DOI":"10.1002\/0470867302.ch14"},{"key":"6079_CR29","doi-asserted-by":"publisher","first-page":"999","DOI":"10.1093\/bioinformatics\/btu791","volume":"31","author":"DT Jones","year":"2015","unstructured":"Jones DT, Singh T, Kosciolek T, Tetchner S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics. 2015;31:999\u20131006.","journal-title":"Bioinformatics"},{"key":"6079_CR30","doi-asserted-by":"publisher","first-page":"1875","DOI":"10.1093\/bioinformatics\/btm270","volume":"23","author":"JA Capra","year":"2007","unstructured":"Capra JA, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007;23:1875\u201382.","journal-title":"Bioinformatics"},{"key":"6079_CR31","doi-asserted-by":"publisher","first-page":"1400","DOI":"10.1038\/s42003-024-07066-9","volume":"7","author":"Y Zhang","year":"2024","unstructured":"Zhang Y, Dong M, Deng J, Wu J, Zhao Q, Gao X, et al. Graph masked self-distillation learning for prediction of mutation impact on protein\u2013protein interactions. Commun Biol. 2024;7:1400.","journal-title":"Commun Biol"},{"key":"6079_CR32","doi-asserted-by":"publisher","DOI":"10.1038\/s41587-024-02428-4","author":"D Xiong","year":"2024","unstructured":"Xiong D, Qiu Y, Zhao J, Zhou Y, Lee D, Gupta S, et al. A structurally informed human protein\u2013protein interactome reveals proteome-wide perturbations caused by disease mutations. Nat Biotechnol. 2024. https:\/\/doi.org\/10.1038\/s41587-024-02428-4.","journal-title":"Nat Biotechnol"},{"key":"6079_CR33","doi-asserted-by":"publisher","first-page":"W344","DOI":"10.1093\/nar\/gkw408","volume":"44","author":"H Ashkenazy","year":"2016","unstructured":"Ashkenazy H, Abadi S, Martz E, Chay O, Mayrose I, Pupko T, et al. ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules. Nucleic Acids Res. 2016;44:W344\u201350.","journal-title":"Nucleic Acids Res"},{"key":"6079_CR34","doi-asserted-by":"publisher","DOI":"10.1038\/s41587-023-01773-0","author":"M van Kempen","year":"2023","unstructured":"van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2023. https:\/\/doi.org\/10.1038\/s41587-023-01773-0.","journal-title":"Nat Biotechnol"},{"key":"6079_CR35","doi-asserted-by":"publisher","first-page":"2577","DOI":"10.1002\/bip.360221211","volume":"22","author":"W Kabsch","year":"1983","unstructured":"Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577\u2013637.","journal-title":"Biopolymers"},{"key":"6079_CR36","doi-asserted-by":"publisher","first-page":"16","DOI":"10.1186\/s13321-018-0270-2","volume":"10","author":"J Dong","year":"2018","unstructured":"Dong J, Yao Z-J, Zhang L, Luo F, Lin Q, Lu A-P, et al. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminform. 2018;10:16.","journal-title":"J Cheminform"},{"key":"6079_CR37","unstructured":"Rideout JR, Caporaso G, Bolyen E, McDonald D, Baeza YV, Alastuey JC, et al. scikit-bio\/scikit-bio: scikit-bio 0.6.2. 2024."},{"key":"6079_CR38","doi-asserted-by":"publisher","first-page":"172","DOI":"10.1016\/j.neucom.2021.07.102","volume":"484","author":"AM Sequeira","year":"2022","unstructured":"Sequeira AM, Lousa D, Rocha M. ProPythia: a Python package for protein classification based on machine and deep learning. Neurocomputing. 2022;484:172\u201382.","journal-title":"Neurocomputing"},{"key":"6079_CR39","doi-asserted-by":"publisher","first-page":"960","DOI":"10.1093\/bioinformatics\/btt072","volume":"29","author":"D-S Cao","year":"2013","unstructured":"Cao D-S, Xu Q-S, Liang Y-Z. propy: a tool to generate various modes of Chou\u2019s PseAAC. Bioinformatics. 2013;29:960\u20132.","journal-title":"Bioinformatics"},{"key":"6079_CR40","unstructured":"Mckenna A. protPy. GitHub repository. 2024."},{"key":"6079_CR41","doi-asserted-by":"publisher","unstructured":"Kozlova E, Valentin A, Khadhraoui A, Nakhaee-Zadeh Gutierrez D. ProteinFlow: a Python Library to Pre-Process Protein Structure Data for Deep Learning Applications. bioRxiv. 2023. https:\/\/doi.org\/10.1101\/2023.09.25.559346.","DOI":"10.1101\/2023.09.25.559346"},{"key":"6079_CR42","doi-asserted-by":"publisher","first-page":"4","DOI":"10.32614\/RJ-2015-001","volume":"7","author":"D Osorio","year":"2015","unstructured":"Osorio D, Rond\u00f3n-Villarreal P, Torres R. Peptides: a package for data mining of antimicrobial peptides. R J. 2015;7:4\u201314.","journal-title":"R J"},{"key":"6079_CR43","doi-asserted-by":"publisher","first-page":"3831","DOI":"10.1093\/bioinformatics\/btz165","volume":"35","author":"R Muhammod","year":"2019","unstructured":"Muhammod R, Ahmed S, Md Farid D, Shatabda S, Sharma A, Dehzangi A. PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences. Bioinformatics. 2019;35:3831\u20133.","journal-title":"Bioinformatics"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-025-06079-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-025-06079-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-025-06079-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,1]],"date-time":"2025-03-01T06:32:32Z","timestamp":1740810752000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-025-06079-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,1]]},"references-count":43,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["6079"],"URL":"https:\/\/doi.org\/10.1186\/s12859-025-06079-3","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,1]]},"assertion":[{"value":"21 October 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 February 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 March 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"This study does not report on or involve the use of any animal or human data or tissue, therefore the ethics approval and consent to participate are not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"This study does not contain data from any individual person, therefore the consent for publication is not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"A.P.C is cofounder of Caeruleus Genomics Ltd and inventor on several patents related to sequencing technologies filed by Oxford University Innovations. The other authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"70"}}