{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T22:03:09Z","timestamp":1765231389774,"version":"3.41.2"},"reference-count":51,"publisher":"World Scientific Pub Co Pte Ltd","issue":"02","funder":[{"DOI":"10.13039\/100014718","name":"Innovative Research Group Project of the National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62372326"],"award-info":[{"award-number":["62372326"]}],"id":[{"id":"10.13039\/100014718","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100014718","name":"Innovative Research Group Project of the National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62172300"],"award-info":[{"award-number":["62172300"]}],"id":[{"id":"10.13039\/100014718","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Bioinform. Comput. Biol."],"published-print":{"date-parts":[[2025,4]]},"abstract":"<jats:p> This paper introduces M<jats:sup>3<\/jats:sup>-20M, a large-scale Multi-Modal Molecule dataset that contains over 20 million molecules, with the data mainly being integrated from existing databases and partially generated by large language models. Designed to support AI-driven drug design and discovery, M<jats:sup>3<\/jats:sup>-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit the training or fine-tuning of models, including large language models for drug design and discovery tasks. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M<jats:sup>3<\/jats:sup>-20M in drug design and discovery, we conduct extensive experiments on two key tasks: molecule generation and molecular property prediction, using large language models including GLM4, GPT-3.5, GPT-4, and Llama3-8b. Our experimental results show that M<jats:sup>3<\/jats:sup>-20M can significantly boost model performance in both tasks. Specifically, it enables the models to generate more diverse and valid molecular structures and achieve higher property prediction accuracy than existing single-modal datasets, which validates the value and potential of M<jats:sup>3<\/jats:sup>-20M in supporting AI-driven drug design and discovery. The dataset is available at https:\/\/github.com\/bz99bz\/M-3. <\/jats:p>","DOI":"10.1142\/s0219720025500064","type":"journal-article","created":{"date-parts":[[2025,5,9]],"date-time":"2025-05-09T09:24:42Z","timestamp":1746782682000},"source":"Crossref","is-referenced-by-count":2,"title":["M<sup>3<\/sup>-20M: A large-scale multi-modal molecule dataset for AI-driven drug design and discovery"],"prefix":"10.1142","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0378-830X","authenticated-orcid":false,"given":"Siyuan","family":"Guo","sequence":"first","affiliation":[{"name":"Department of Computer Science and Technology, Tongji University, No. 4800 Cao\u2019an Road, Shanghai 201804, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-8778-0217","authenticated-orcid":false,"given":"Lexuan","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tongji University, No. 4800 Cao\u2019an Road, Shanghai 201804, China"}]},{"given":"Chang","family":"Jin","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tongji University, No. 4800 Cao\u2019an Road, Shanghai 201804, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3391-1532","authenticated-orcid":false,"given":"Jinxian","family":"Wang","sequence":"additional","affiliation":[{"name":"Shanghai Key Lab of Intelligent Information Processing and School of Computer Science, Fudan University, 2005 Songhu Road, Shanghai 200438, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6519-8538","authenticated-orcid":false,"given":"Han","family":"Peng","sequence":"additional","affiliation":[{"name":"Shanghai Key Lab of Intelligent Information Processing and School of Computer Science, Fudan University, 2005 Songhu Road, Shanghai 200438, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0870-6832","authenticated-orcid":false,"given":"Huayang","family":"Shi","sequence":"additional","affiliation":[{"name":"Shanghai Key Lab of Intelligent Information Processing and School of Computer Science, Fudan University, 2005 Songhu Road, Shanghai 200438, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8768-6740","authenticated-orcid":false,"given":"Wengen","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tongji University, No. 4800 Cao\u2019an Road, Shanghai 201804, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2313-7635","authenticated-orcid":false,"given":"Jihong","family":"Guan","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tongji University, No. 4800 Cao\u2019an Road, Shanghai 201804, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1949-2768","authenticated-orcid":false,"given":"Shuigeng","family":"Zhou","sequence":"additional","affiliation":[{"name":"Shanghai Key Lab of Intelligent Information Processing and School of Computer Science, Fudan University, 2005 Songhu Road, Shanghai 200438, China"}]}],"member":"219","published-online":{"date-parts":[[2025,6,9]]},"reference":[{"key":"S0219720025500064BIB001","doi-asserted-by":"publisher","DOI":"10.1038\/nrd.2016.230"},{"key":"S0219720025500064BIB002","doi-asserted-by":"publisher","DOI":"10.1111\/j.1476-5381.2010.01127.x"},{"key":"S0219720025500064BIB003","doi-asserted-by":"publisher","DOI":"10.14573\/altex.1610101"},{"key":"S0219720025500064BIB004","doi-asserted-by":"publisher","DOI":"10.1039\/C7SC02664A"},{"key":"S0219720025500064BIB005","doi-asserted-by":"publisher","DOI":"10.1016\/j.omtn.2023.02.019"},{"key":"S0219720025500064BIB006","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbad014"},{"key":"S0219720025500064BIB007","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-021-27137-3"},{"key":"S0219720025500064BIB008","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btac377"},{"key":"S0219720025500064BIB009","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-024-45461-2"},{"key":"S0219720025500064BIB011","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkae236"},{"key":"S0219720025500064BIB012","first-page":"48548","volume":"36","author":"Goldman S","year":"2023","journal-title":"Adv Neural Inf Process Syst"},{"key":"S0219720025500064BIB013","first-page":"12559","volume":"33","author":"Rong Y","year":"2020","journal-title":"Adv Neural Inf Process Syst"},{"key":"S0219720025500064BIB014","first-page":"21342","volume":"35","author":"Gao W","year":"2022","journal-title":"Adv Neural Inf Process Syst"},{"key":"S0219720025500064BIB015","first-page":"26856","volume":"35","author":"Franke J","year":"2022","journal-title":"Adv Neural Inf Process Syst"},{"key":"S0219720025500064BIB016","first-page":"2550","volume":"35","author":"Kong X","year":"2022","journal-title":"Adv Neural Inf Process Syst"},{"key":"S0219720025500064BIB017","first-page":"7924","volume":"34","author":"Yang S","year":"2021","journal-title":"Adv Neural Inf Process Syst"},{"key":"S0219720025500064BIB018","volume":"36","author":"Song Y","year":"2024","journal-title":"Adv Neural Inf Process Syst"},{"key":"S0219720025500064BIB019","doi-asserted-by":"publisher","DOI":"10.1021\/acs.jcim.1c00600"},{"key":"S0219720025500064BIB023","doi-asserted-by":"publisher","DOI":"10.1016\/j.compbiomed.2024.108073"},{"key":"S0219720025500064BIB025","first-page":"21497","volume-title":"Int Conf Machine Learning","author":"Liu S","year":"2023"},{"key":"S0219720025500064BIB026","first-page":"1822","volume":"3","author":"Ai Q","year":"2024","journal-title":"Dig Disc"},{"volume-title":"Twelfth Int Conf Learning Representations","year":"2023","author":"Yu Q","key":"S0219720025500064BIB027"},{"key":"S0219720025500064BIB028","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbae693"},{"key":"S0219720025500064BIB029","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-023-00759-6"},{"key":"S0219720025500064BIB030","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-022-28494-3"},{"key":"S0219720025500064BIB031","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkv951"},{"key":"S0219720025500064BIB032","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.26"},{"key":"S0219720025500064BIB033","first-page":"27730","volume":"35","author":"Ouyang L","year":"2022","journal-title":"Adv Neural Inf Process Syst"},{"key":"S0219720025500064BIB036","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403104"},{"key":"S0219720025500064BIB037","doi-asserted-by":"publisher","DOI":"10.1109\/TCBB.2024.3434461"},{"key":"S0219720025500064BIB038","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-023-00640-6"},{"key":"S0219720025500064BIB039","doi-asserted-by":"publisher","DOI":"10.1186\/s13321-020-00460-5"},{"key":"S0219720025500064BIB040","doi-asserted-by":"publisher","DOI":"10.1186\/1758-2946-2-1"},{"issue":"1","key":"S0219720025500064BIB041","first-page":"4","volume":"1","author":"Landrum G","year":"2013","journal-title":"Release"},{"key":"S0219720025500064BIB042","doi-asserted-by":"publisher","DOI":"10.1186\/1758-2946-3-33"},{"key":"S0219720025500064BIB043","doi-asserted-by":"publisher","DOI":"10.1016\/j.drudis.2022.103373"},{"key":"S0219720025500064BIB045","doi-asserted-by":"publisher","DOI":"10.1021\/ci600423u"},{"key":"S0219720025500064BIB046","doi-asserted-by":"publisher","DOI":"10.1016\/j.sbi.2021.10.001"},{"key":"S0219720025500064BIB047","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-022-01288-4"},{"key":"S0219720025500064BIB048","doi-asserted-by":"publisher","DOI":"10.1186\/s13321-024-00843-y"},{"key":"S0219720025500064BIB049","doi-asserted-by":"publisher","DOI":"10.3389\/fphar.2020.565644"},{"key":"S0219720025500064BIB050","doi-asserted-by":"publisher","DOI":"10.1021\/ci049714+"},{"key":"S0219720025500064BIB052","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkr777"},{"key":"S0219720025500064BIB053","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-024-00977-6"},{"key":"S0219720025500064BIB054","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2014.22"},{"key":"S0219720025500064BIB055","doi-asserted-by":"publisher","DOI":"10.1021\/ci300415d"},{"key":"S0219720025500064BIB057","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00276"},{"key":"S0219720025500064BIB060","first-page":"1877","volume":"33","author":"Brown T","year":"2020","journal-title":"Adv Neural Inf Process Syst"},{"key":"S0219720025500064BIB062","doi-asserted-by":"publisher","DOI":"10.1002\/cmdc.200800178"},{"key":"S0219720025500064BIB063","doi-asserted-by":"publisher","DOI":"10.1021\/jm9602928"},{"key":"S0219720025500064BIB064","doi-asserted-by":"publisher","DOI":"10.1021\/jm901137j"}],"container-title":["Journal of Bioinformatics and Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S0219720025500064","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,11]],"date-time":"2025-06-11T01:43:32Z","timestamp":1749606212000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/10.1142\/S0219720025500064"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4]]},"references-count":51,"journal-issue":{"issue":"02","published-print":{"date-parts":[[2025,4]]}},"alternative-id":["10.1142\/S0219720025500064"],"URL":"https:\/\/doi.org\/10.1142\/s0219720025500064","relation":{},"ISSN":["0219-7200","1757-6334"],"issn-type":[{"type":"print","value":"0219-7200"},{"type":"electronic","value":"1757-6334"}],"subject":[],"published":{"date-parts":[[2025,4]]},"article-number":"2550006"}}