{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,5]],"date-time":"2026-05-05T12:31:34Z","timestamp":1777984294395,"version":"3.51.4"},"reference-count":22,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T00:00:00Z","timestamp":1771545600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T00:00:00Z","timestamp":1771545600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Chalmers Gender Initiative for Excellence"},{"name":"Wallenberg AI, Autonomous Systems, and Software Program"},{"DOI":"10.13039\/501100002835","name":"Chalmers University of Technology","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100002835","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules composed of an E3 ligase ligand, a linker, and a warhead targeting a protein of interest. Despite their modular structure, accurately identifying and annotating these components in PROTACs is challenging and typically relies on manual curation and predefined substructure matching. To address this, we developed PROTAC-Splitter, a machine learning framework designed for automated annotation of PROTAC substructures. To address data scarcity, we generated and openly released a synthetic dataset containing approximately 1.3 million PROTAC structures with annotated ligand splits. Leveraging this dataset, we developed two complementary approaches for PROTAC substructure annotation: a Transformer-based sequence-to-sequence model and a graph-based XGBoost model. We evaluated both approaches on held-out public data and structurally novel PROTACs from AstraZeneca\u2019s proprietary collection. The Transformer-based model achieved high exact-match accuracy (86%) on public data but dropped significantly (18%) on structurally novel internal PROTACs due to occasional hallucinations. In contrast, the XGBoost model can ensure chemical validity and perfect reassembly accuracy on both sets, with lower exact-match accuracy on open-data (42.2%) but comparable performance on the internal set (23%). To improve reliability, we implemented a wrapper function for the Transformer (Transformer-\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$\\Delta$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:mi>\u0394<\/mml:mi>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    ), which corrects partial prediction errors, raising reassembly accuracy to 96% on public and 70% on internal datasets. Combining the strengths of both models, we propose a hybrid approach that reliably annotates PROTACs across diverse chemical spaces. PROTAC-Splitter provides a robust, scalable tool to facilitate automated PROTAC analysis and is available open-source at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/ribesstefano\/PROTAC-Splitter\" ext-link-type=\"uri\">https:\/\/github.com\/ribesstefano\/PROTAC-Splitter<\/jats:ext-link>\n                  <\/jats:p>","DOI":"10.1186\/s13321-025-01135-9","type":"journal-article","created":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T11:04:47Z","timestamp":1771585487000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["PROTAC-Splitter: a machine learning framework for automated identification of PROTAC substructures"],"prefix":"10.1186","volume":"18","author":[{"given":"Stefano","family":"Ribes","sequence":"first","affiliation":[]},{"given":"Ranxuan","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"T\u00e9lio","family":"Cropsal","sequence":"additional","affiliation":[]},{"given":"Anders","family":"K\u00e4llberg","sequence":"additional","affiliation":[]},{"given":"Christian","family":"Tyrchan","sequence":"additional","affiliation":[]},{"given":"Eva","family":"Nittinger","sequence":"additional","affiliation":[]},{"given":"Roc\u00edo","family":"Mercado","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,2,20]]},"reference":[{"key":"1135_CR1","doi-asserted-by":"publisher","unstructured":"Hinterndorfer M, Spiteri VA, Ciulli A, Winter GE (2025) Targeted protein degradation for cancer therapy. Nature Reviews Cancer 1\u201324. https:\/\/doi.org\/10.1038\/s41568-025-00817-8","DOI":"10.1038\/s41568-025-00817-8"},{"key":"1135_CR2","doi-asserted-by":"publisher","DOI":"10.1039\/D4DD00177J","author":"Y Gharbi","year":"2024","unstructured":"Gharbi Y, Mercado R (2024) A comprehensive review of emerging approaches in machine learning for de novo PROTAC design. Digital Discovery. https:\/\/doi.org\/10.1039\/D4DD00177J","journal-title":"Digital Discovery"},{"issue":"4","key":"1135_CR3","doi-asserted-by":"publisher","first-page":"681","DOI":"10.1021\/acsmedchemlett.5c00068","volume":"16","author":"V Poongavanam","year":"2025","unstructured":"Poongavanam V, Peintner S, Abeje Y, Ko\u00f6lling F, Meibom D, Erdelyi M, Kihlberg J (2025) Linker-Determined Folding and Hydrophobic Interactions Explain a Major Difference in PROTAC Cell Permeability. ACS Med Chem Lett 16(4):681\u2013687. https:\/\/doi.org\/10.1021\/acsmedchemlett.5c00068","journal-title":"ACS Med Chem Lett"},{"key":"1135_CR4","unstructured":"London N, Prilusky J PROTACpedia. https:\/\/protacpedia.weizmann.ac.il\/. Accessed: 2025\/01\/21"},{"issue":"D1","key":"1135_CR5","doi-asserted-by":"publisher","first-page":"1367","DOI":"10.1093\/nar\/gkac946","volume":"51","author":"G Weng","year":"2023","unstructured":"Weng G, Cai X, Cao D, Du H, Shen C, Deng Y, He Q, Yang B, Li D, Hou T (2023) PROTAC-DB 2.0: an updated database of PROTACs. Nucleic Acids Research 51(D1):1367\u20131372. https:\/\/doi.org\/10.1093\/nar\/gkac946","journal-title":"Nucleic Acids Research"},{"issue":"1","key":"1135_CR6","doi-asserted-by":"publisher","DOI":"10.1088\/2632-2153\/ac3ffb","volume":"3","author":"R Irwin","year":"2022","unstructured":"Irwin R, Dimitriadis S, He J, Bjerrum EJ (2022) Chemformer: a pre-trained transformer for computational chemistry. Mach Learn Sci Technol 3(1):015022. https:\/\/doi.org\/10.1088\/2632-2153\/ac3ffb","journal-title":"Mach Learn Sci Technol"},{"key":"1135_CR7","doi-asserted-by":"publisher","unstructured":"Ahmad W, Simon E, Chithrananda S, Grand G, Ramsundar B (2022) ChemBERTa-2: Towards Chemical Foundation Models. https:\/\/doi.org\/10.48550\/arXiv.2209.01712","DOI":"10.48550\/arXiv.2209.01712"},{"issue":"31","key":"1135_CR8","doi-asserted-by":"publisher","first-page":"8312","DOI":"10.1039\/D0SC03126G","volume":"11","author":"Y Yang","year":"2020","unstructured":"Yang Y, Zheng S, Su S, Zhao C, Xu J, Chen H (2020) SyntaLinker: automatic fragment linking with deep conditional transformer neural networks. Chem Sci 11(31):8312\u20138322. https:\/\/doi.org\/10.1039\/D0SC03126G","journal-title":"Chem Sci"},{"issue":"9","key":"1135_CR9","doi-asserted-by":"publisher","first-page":"739","DOI":"10.1038\/s42256-022-00527-y","volume":"4","author":"S Zheng","year":"2022","unstructured":"Zheng S, Tan Y, Wang Z, Li C, Zhang Z, Sang X, Chen H, Yang Y (2022) Accelerated rational PROTAC design via deep learning and molecular simulations. Nat Mach Intell 4(9):739\u2013748. https:\/\/doi.org\/10.1038\/s42256-022-00527-y","journal-title":"Nat Mach Intell"},{"issue":"23","key":"1135_CR10","doi-asserted-by":"publisher","first-page":"5907","DOI":"10.1021\/acs.jcim.2c00982","volume":"62","author":"Y Tan","year":"2022","unstructured":"Tan Y, Dai L, Huang W, Guo Y, Zheng S, Lei J, Chen H, Yang Y (2022) DRlinker: deep reinforcement learning for optimization in fragment linking design. J Chem Inf Model 62(23):5907\u20135917. https:\/\/doi.org\/10.1021\/acs.jcim.2c00982","journal-title":"J Chem Inf Model"},{"key":"1135_CR11","doi-asserted-by":"publisher","DOI":"10.1016\/j.ailsci.2024.100104","volume":"6","author":"S Ribes","year":"2024","unstructured":"Ribes S, Nittinger E, Tyrchan C, Mercado R (2024) Modeling PROTAC degradation activity with machine learning. Art Int Life Sci 6:100104. https:\/\/doi.org\/10.1016\/j.ailsci.2024.100104","journal-title":"Art Int Life Sci"},{"issue":"1","key":"1135_CR12","doi-asserted-by":"publisher","first-page":"7133","DOI":"10.1038\/s41467-022-34807-3","volume":"13","author":"F Li","year":"2022","unstructured":"Li F, Hu Q, Zhang X, Sun R, Liu Z, Wu S, Tian S, Ma X, Dai Z, Yang X, Gao S, Bai F (2022) DeepPROTACs is a deep learning-based targeted degradation predictor for PROTACs. Nat Commun 13(1):7133. https:\/\/doi.org\/10.1038\/s41467-022-34807-3","journal-title":"Nat Commun"},{"key":"1135_CR13","doi-asserted-by":"publisher","unstructured":"Chen Z, Gu C, Tan S, Wang X, Li Y, He M, Lu R, Sun S, Hsieh C-Y, Yao X et al (2024) Interpretable PROTAC degradation prediction with structure-informed deep ternary attention framework. bioRxiv, 2024\u201311. https:\/\/doi.org\/10.1101\/2024.11.05.622005","DOI":"10.1101\/2024.11.05.622005"},{"key":"1135_CR14","doi-asserted-by":"publisher","unstructured":"Liu J, Roy M, Isbel L, Li F (2025) Accurate PROTAC targeted degradation prediction with DegradeMaster. bioRxiv, 2025\u201302. https:\/\/doi.org\/10.1101\/2025.02.03.636343","DOI":"10.1101\/2025.02.03.636343"},{"issue":"D1","key":"1135_CR15","doi-asserted-by":"publisher","first-page":"1510","DOI":"10.1093\/nar\/gkae768","volume":"53","author":"J Ge","year":"2024","unstructured":"Ge J, Li S, Weng G, Wang H, Fang M, Sun H, Deng Y, Hsieh C-Y, Li D, Hou T (2024) PROTAC-DB 3.0: an updated database of PROTACs with extended pharmacokinetic parameters. Nucleic Acids Research 53(D1):1510\u20131515. https:\/\/doi.org\/10.1093\/nar\/gkae768","journal-title":"Nucleic Acids Research"},{"key":"1135_CR16","doi-asserted-by":"publisher","unstructured":"Chen T, Guestrin C (2016) XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD \u201916. ACM, San Francisco, CA, USA. https:\/\/doi.org\/10.1145\/2939672.2939785","DOI":"10.1145\/2939672.2939785"},{"key":"1135_CR17","doi-asserted-by":"publisher","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention Is All You Need. arXiv. https:\/\/doi.org\/10.48550\/arXiv.1706.03762","DOI":"10.48550\/arXiv.1706.03762"},{"issue":"2","key":"1135_CR18","doi-asserted-by":"publisher","first-page":"163","DOI":"10.1080\/0022250X.2001.9990249","volume":"25","author":"U Brandes","year":"2001","unstructured":"Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163\u2013177. https:\/\/doi.org\/10.1080\/0022250X.2001.9990249","journal-title":"J Math Sociol"},{"key":"1135_CR19","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","volume":"16","author":"NV Chawla","year":"2002","unstructured":"Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling TEchnique. J Artif Intell Res 16:321\u2013357. https:\/\/doi.org\/10.1613\/jair.953","journal-title":"J Artif Intell Res"},{"issue":"1","key":"1135_CR20","doi-asserted-by":"publisher","first-page":"5764","DOI":"10.1038\/s41467-024-49979-3","volume":"15","author":"G Peteani","year":"2024","unstructured":"Peteani G, Huynh MTD, Gerebtzoff G, Rodr\u00edguez-P\u00e9rez R (2024) Application of machine learning models for property prediction to targeted protein degraders. Nat Commun 15(1):5764. https:\/\/doi.org\/10.1038\/s41467-024-49979-3","journal-title":"Nat Commun"},{"issue":"3","key":"1135_CR21","doi-asserted-by":"publisher","first-page":"1641","DOI":"10.1021\/acs.jmedchem.3c01835","volume":"67","author":"TT Dean","year":"2024","unstructured":"Dean TT, Jel\u00fa-Reyes J, Allen AC, Moore TW (2024) Peptide-drug conjugates: an emerging direction for the next generation of peptide therapeutics. J Med Chem 67(3):1641\u20131661 (acs.jmedchem.3c01835)","journal-title":"J Med Chem"},{"issue":"9","key":"1135_CR22","doi-asserted-by":"publisher","first-page":"937","DOI":"10.1038\/s41589-021-00770-1","volume":"17","author":"G Ahn","year":"2021","unstructured":"Ahn G, Banik SM, Miller CL, Riley NM, Cochran JR, Bertozzi CR (2021) LYTACs that engage the asialoglycoprotein receptor for targeted protein degradation. Nat Chem Biol 17(9):937\u2013946. https:\/\/doi.org\/10.1038\/s41589-021-00770-1","journal-title":"Nat Chem Biol"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-025-01135-9","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01135-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01135-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,9]],"date-time":"2026-04-09T12:02:18Z","timestamp":1775736138000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1186\/s13321-025-01135-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,20]]},"references-count":22,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,12]]}},"alternative-id":["1135"],"URL":"https:\/\/doi.org\/10.1186\/s13321-025-01135-9","relation":{"has-preprint":[{"id-type":"doi","id":"10.26434\/chemrxiv-2025-bn1nv","asserted-by":"object"}]},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,20]]},"assertion":[{"value":"8 July 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 November 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 February 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 April 2026","order":6,"name":"change_date","label":"Change Date","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Update","order":7,"name":"change_type","label":"Change Type","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The Supplementary material has been updated","order":8,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"CT and EN are employees of AstraZeneca and may own stock options.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"30"}}