{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T15:10:16Z","timestamp":1759158616950,"version":"3.44.0"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T00:00:00Z","timestamp":1759104000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T00:00:00Z","timestamp":1759104000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Automatic extraction of molecules from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of molecule and Markush structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting fingerprints directly from images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables the retrieval of molecules and Markush structures. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecule and Markush structure depictions. The benchmark datasets, models, and inference code are publicly available..<\/jats:p>","DOI":"10.1186\/s13321-025-01091-4","type":"journal-article","created":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T14:39:01Z","timestamp":1759156741000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Subgrapher: visual fingerprinting of chemical structures"],"prefix":"10.1186","volume":"17","author":[{"given":"Lucas","family":"Morin","sequence":"first","affiliation":[]},{"given":"Gerhard Ingmar","family":"Meijer","sequence":"additional","affiliation":[]},{"given":"Val\u00e9ry","family":"Weber","sequence":"additional","affiliation":[]},{"given":"Luc","family":"Van Gool","sequence":"additional","affiliation":[]},{"given":"Peter W. J.","family":"Staar","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,9,29]]},"reference":[{"key":"1091_CR1","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1038\/s41524-025-01538-0","volume":"11","author":"EO Pyzer-Knapp","year":"2025","unstructured":"Pyzer-Knapp EO et al (2025) Foundation models for materials discovery-current state and future directions. npj Computat Mater 11:61","journal-title":"npj Computat Mater"},{"key":"1091_CR2","unstructured":"Livathinos N et al (2025) Docling: an Efficient Open-Source Toolkit for AI-driven Document Conversion. arxiv:https:\/\/arxiv.org\/abs\/2501.17887"},{"key":"1091_CR3","unstructured":"Auer C et al (2024) Docling Technical Report. https:\/\/arxiv.org\/abs\/2408.09869"},{"key":"1091_CR4","unstructured":"Nassar A et al (2025) SmolDocling: an ultra-compact vision-language model for end-to-end multi-modal document conversion. arxiv:https:\/\/arxiv.org\/abs\/2503.11576"},{"key":"1091_CR5","doi-asserted-by":"publisher","first-page":"6532","DOI":"10.1038\/s41467-024-50779-y","volume":"15","author":"L Morin","year":"2024","unstructured":"Morin L, Weber V, Meijer GI, Yu F, Staar PWJ (2024) PatCID: an open-access dataset of chemical structures in patent documents. Nat Commun 15:6532","journal-title":"Nat Commun"},{"key":"1091_CR6","doi-asserted-by":"publisher","first-page":"D1220","DOI":"10.1093\/nar\/gkv1253","volume":"44","author":"G Papadatos","year":"2015","unstructured":"Papadatos G et al (2015) SureChEMBL: a large-scale, chemically annotated patent document database. Nucleic Acids Res 44:D1220\u2013D1228","journal-title":"Nucleic Acids Res"},{"key":"1091_CR7","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","volume":"28","author":"D Weininger","year":"1988","unstructured":"Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31\u201336","journal-title":"J Chem Inf Comput Sci"},{"key":"1091_CR8","doi-asserted-by":"publisher","first-page":"740","DOI":"10.1021\/ci800067r","volume":"49","author":"IV Filippov","year":"2009","unstructured":"Filippov IV, Nicklaus MC (2009) Optical structure recognition software to recover chemical information: OSRA, an open source solution. J Chem Inf Model 49:740\u2013743","journal-title":"J Chem Inf Model"},{"key":"1091_CR9","doi-asserted-by":"crossref","unstructured":"Smolov V, Zentsev F, Rybalkin M, Voorhees EM, Buckland LP (eds) (2011) Imago: Open-Source Toolkit for 2D Chemical Structure Image Recognition. In: Voorhees EM, Buckland LP (eds.) Text Retrieval Conference National Institute of Standards and Technology (NIST)","DOI":"10.6028\/NIST.SP.500-296.chemical-GGA"},{"key":"1091_CR10","unstructured":"Peryea T et al (2013) MolVec. https:\/\/github.com\/ncats\/molvec . Accessed: January (2025)"},{"key":"1091_CR11","doi-asserted-by":"crossref","unstructured":"Morin L, Danelljan M, Agea MI, Nassar A, Weber V, Meijer I, Staar P, Yu F (2023) MolGrapher: Graph-based Visual Recognition of Chemical Structures. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV) pp 19552\u201319561","DOI":"10.1109\/ICCV51070.2023.01791"},{"key":"1091_CR12","doi-asserted-by":"publisher","first-page":"4506","DOI":"10.1021\/acs.jcim.0c00459","volume":"60","author":"M Oldenhof","year":"2020","unstructured":"Oldenhof M, Arany A, Moreau Y, Simm J (2020) ChemGrapher: optical Graph Recognition of Chemical Compounds by Deep Learning. J Chem Inf Model 60:4506\u20134517","journal-title":"J Chem Inf Model"},{"key":"1091_CR13","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1186\/s13321-021-00538-8","volume":"13","author":"K Rajan","year":"2021","unstructured":"Rajan K, Zielesny A, Steinbeck C (2021) DECIMER 1.0: deep learning for chemical image recognition using transformers. J Cheminform 13:61","journal-title":"J Cheminform"},{"key":"1091_CR14","doi-asserted-by":"publisher","first-page":"1925","DOI":"10.1021\/acs.jcim.2c01480","volume":"63","author":"Y Qian","year":"2023","unstructured":"Qian Y et al (2023) Molscribe: robust molecular structure recognition with image-to-graph generation. J Chem Inf Model 63:1925\u20131934","journal-title":"J Chem Inf Model"},{"key":"1091_CR15","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1016\/S0172-2190(03)00073-5","volume":"25","author":"ES Simmons","year":"2003","unstructured":"Simmons ES (2003) Markush structure searching over the years. World Patent Inf 25:195\u2013202","journal-title":"World Patent Inf"},{"key":"1091_CR16","doi-asserted-by":"publisher","first-page":"5002","DOI":"10.1021\/acs.jcim.9b00798","volume":"59","author":"M Yang","year":"2019","unstructured":"Yang M et al (2019) Machine learning models based on molecular fingerprints and an extreme gradient boosting method lead to the discovery of jak2 inhibitors. J Chem Inf Model 59:5002\u20135012","journal-title":"J Chem Inf Model"},{"key":"1091_CR17","doi-asserted-by":"publisher","first-page":"71","DOI":"10.1186\/s13321-022-00650-3","volume":"14","author":"N Wen","year":"2022","unstructured":"Wen N et al (2022) A fingerprints based molecular property prediction method using the BERT model. J Cheminform 14:71","journal-title":"J Cheminform"},{"key":"1091_CR18","doi-asserted-by":"publisher","first-page":"1273","DOI":"10.1021\/ci010132r","volume":"42","author":"JL Durant","year":"2002","unstructured":"Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 42:1273\u20131280","journal-title":"J Chem Inf Comput Sci"},{"key":"1091_CR19","doi-asserted-by":"publisher","first-page":"D1373","DOI":"10.1093\/nar\/gkac956","volume":"51","author":"S Kim","year":"2022","unstructured":"Kim S et al (2022) PubChem 2023 update. Nucleic Acids Res 51:D1373\u2013D1380","journal-title":"Nucleic Acids Res"},{"key":"1091_CR20","doi-asserted-by":"publisher","first-page":"36","DOI":"10.1186\/s13321-017-0225-z","volume":"9","author":"P Ertl","year":"2017","unstructured":"Ertl P (2017) An algorithm to identify functional groups in organic molecules. J Cheminform 9:36","journal-title":"J Cheminform"},{"key":"1091_CR21","doi-asserted-by":"publisher","first-page":"983","DOI":"10.1021\/ci9800211","volume":"38","author":"P Willett","year":"1998","unstructured":"Willett P, Barnard JM, Downs GM (1998) Chemical Similarity Searching. J Chem Inf Comput Sci 38:983\u2013996","journal-title":"J Chem Inf Comput Sci"},{"key":"1091_CR22","doi-asserted-by":"publisher","first-page":"742","DOI":"10.1021\/ci100050t","volume":"50","author":"D Rogers","year":"2010","unstructured":"Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742\u2013754","journal-title":"J Chem Inf Model"},{"key":"1091_CR23","doi-asserted-by":"publisher","first-page":"66","DOI":"10.1186\/s13321-018-0321-8","volume":"10","author":"D Probst","year":"2018","unstructured":"Probst D, Reymond J-L (2018) A probabilistic molecular fingerprint for big data settings. J Cheminform 10:66","journal-title":"J Cheminform"},{"key":"1091_CR24","doi-asserted-by":"publisher","first-page":"1256","DOI":"10.1038\/s42256-022-00580-7","volume":"4","author":"J Ross","year":"2022","unstructured":"Ross J et al (2022) Large-scale chemical language representations capture molecular structure and properties. Nature Mach Intell 4:1256\u20131264","journal-title":"Nature Mach Intell"},{"key":"1091_CR25","unstructured":"Fan S et al (2025) OCSU: optical Chemical Structure Understanding for Molecule-centric Scientific Discovery. arxiv:https:\/\/arxiv.org\/abs\/2501.15415"},{"key":"1091_CR26","unstructured":"Fang X et al (2025) MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild. arxiv:https:\/\/arxiv.org\/abs\/2411.11098"},{"key":"1091_CR27","doi-asserted-by":"publisher","first-page":"1004","DOI":"10.1038\/s42256-022-00557-6","volume":"4","author":"X Zeng","year":"2022","unstructured":"Zeng X et al (2022) Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nature Mach Intell 4:1004\u20131016","journal-title":"Nature Mach Intell"},{"key":"1091_CR28","doi-asserted-by":"crossref","unstructured":"He K, Gkioxari G, Doll\u00e1r P, Girshick RB (2017) Mask R-CNN. arXiv:1703.06870","DOI":"10.1109\/ICCV.2017.322"},{"key":"1091_CR29","unstructured":"Daylight Chemical Information\u00a0Systems, I. Daylight theory manual: SMARTS: a language for describing molecular patterns. http:\/\/www.daylight.com\/dayhtml\/doc\/theory\/theory.smarts.html. Accessed: January (2025)"},{"key":"1091_CR30","doi-asserted-by":"publisher","first-page":"8408","DOI":"10.1021\/acs.jmedchem.0c00754","volume":"63","author":"P Ertl","year":"2020","unstructured":"Ertl P, Altmann E, McKenna JM (2020) The most common functional groups in bioactive molecules and how their popularity has evolved over time. J Med Chem 63:8408\u20138418","journal-title":"J Med Chem"},{"key":"1091_CR31","doi-asserted-by":"publisher","first-page":"E1","DOI":"10.3390\/molecules21010001","volume":"21","author":"ES Salmina","year":"2015","unstructured":"Salmina ES, Haider N, Tetko IV (2015) Extended functional groups (efg): an efficient set for chemical characterization and structure-activity relationship studies of chemical compounds. Molecules 21:E1","journal-title":"Molecules"},{"key":"1091_CR32","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1186\/s13321-024-00813-4","volume":"16","author":"C Manelfi","year":"2024","unstructured":"Manelfi C et al (2024) DompeKeys: a set of novel substructure-based descriptors for efficient chemical space mapping, development and structural interpretation of machine learning models, and indexing of large databases. J Cheminform 16:21","journal-title":"J Cheminform"},{"key":"1091_CR33","unstructured":"Landrum G (2025) RDKit: open-Source Cheminformatics Software. http:\/\/www.rdkit.org\/. Accessed: January"},{"key":"1091_CR34","unstructured":"ChemAxon Extended SMILES (CXSMILES) Documentation. (2025) https:\/\/docs.chemaxon.com\/display\/docs\/formats_chemaxon-extended-smiles-and-smarts-cxsmiles-and-cxsmarts. Accessed: January"},{"key":"1091_CR35","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1186\/s13321-017-0220-4","volume":"9","author":"EL Willighagen","year":"2017","unstructured":"Willighagen EL et al (2017) The chemistry development kit (cdk) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J Cheminform 9:33","journal-title":"J Cheminform"},{"key":"1091_CR36","unstructured":"Fujiyoshi A, Nakagawa K, Suzuki M (2011) Robust method of segmentation and recognition of chemical structure images in cheminfty. In: Pre-proceedings of the 9th IAPR international workshop on graphics recognition, GREC"},{"key":"1091_CR37","doi-asserted-by":"crossref","unstructured":"Morin L et al (2025) MarkushGrapher: joint Visual and Textual Recognition of Markush Structures. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp 14505\u201314515","DOI":"10.1109\/CVPR52734.2025.01352"},{"key":"1091_CR38","doi-asserted-by":"publisher","first-page":"244","DOI":"10.1021\/ci00007a012","volume":"32","author":"A Dalby","year":"1992","unstructured":"Dalby A et al (1992) Description of several chemical structure file formats used by computer programs developed at molecular design limited. J Chem Inf Comput Sci 32:244\u2013255","journal-title":"J Chem Inf Comput Sci"},{"key":"1091_CR39","unstructured":"Japan Patent Office (2025) https:\/\/www.jpo.go.jp Accessed: January"},{"key":"1091_CR40","unstructured":"United States Patent and Trademark Office (2025) http:\/\/uspto.gov Accessed: January"},{"key":"1091_CR41","doi-asserted-by":"publisher","first-page":"5045","DOI":"10.1038\/s41467-023-40782-0","volume":"14","author":"K Rajan","year":"2023","unstructured":"Rajan K, Brinkhaus HO, Agea MI, Zielesny A, Steinbeck C (2023) DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications. Nat Commun 14:5045","journal-title":"Nat Commun"},{"key":"1091_CR42","doi-asserted-by":"publisher","first-page":"1925","DOI":"10.1021\/acs.jcim.2c01480","volume":"63","author":"Y Qian","year":"2023","unstructured":"Qian Y et al (2023) MolScribe: robust Molecular Structure Recognition with Image-to-Graph Generation. J Chem Inf Model 63:1925\u20131934","journal-title":"J Chem Inf Model"},{"key":"1091_CR43","unstructured":"Tanimoto T (1958) An Elementary Mathematical Theory of Classification and Prediction. In: International Business Machines Corporation"},{"key":"1091_CR44","doi-asserted-by":"publisher","DOI":"10.1016\/j.wpi.2021.102055","volume":"66","author":"J Ohms","year":"2021","unstructured":"Ohms J (2021) Current methodologies for chemical compound searching in patents: a case study. World Patent Inf 66:102055","journal-title":"World Patent Inf"},{"key":"1091_CR45","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1186\/s13321-021-00496-1","volume":"13","author":"K Rajan","year":"2021","unstructured":"Rajan K, Brinkhaus HO, Sorokina M, Zielesny A, Steinbeck C (2021) DECIMER-Segmentation: automated extraction of chemical structure depictions from scientific literature. J Cheminform 13:20","journal-title":"J Cheminform"},{"key":"1091_CR46","unstructured":"ChemAxom Marvin JS (2025) https:\/\/marvinjs-demo.chemaxon.com\/latest\/demo.html. (Accessed: January)"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01091-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-025-01091-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01091-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T14:39:05Z","timestamp":1759156745000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-025-01091-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,29]]},"references-count":46,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1091"],"URL":"https:\/\/doi.org\/10.1186\/s13321-025-01091-4","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,29]]},"assertion":[{"value":"7 April 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 August 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 September 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"149"}}