{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,21]],"date-time":"2026-01-21T06:45:01Z","timestamp":1768977901182,"version":"3.49.0"},"reference-count":36,"publisher":"IOP Publishing","issue":"1","license":[{"start":{"date-parts":[[2025,1,29]],"date-time":"2025-01-29T00:00:00Z","timestamp":1738108800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"start":{"date-parts":[[2025,1,29]],"date-time":"2025-01-29T00:00:00Z","timestamp":1738108800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/iopscience.iop.org\/info\/page\/text-and-data-mining"}],"content-domain":{"domain":["iopscience.iop.org"],"crossmark-restriction":false},"short-container-title":["Mach. Learn.: Sci. Technol."],"published-print":{"date-parts":[[2025,3,31]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Recent advancements in large foundation models have revealed impressive capabilities in mastering complex chemical language representations. These models undergo a task-agnostic learning phase, characterized by pre-training on extensive unlabeled corpora followed by fine-tuning on specific downstream tasks. This methodology reduces reliance on labeled data, facilitating data acquisition and broadening the scope of chemical language representation. However, real-world scenarios often pose challenges due to domain shift, a phenomenon where the data distribution in downstream tasks differs from that of the pre-training phase, potentially degrading model performance. To address this, we present a novel causal-based framework for feature selection and domain adaptation to enhance the performance of chemical foundation models on downstream tasks. Our approach employs a multi-stage feature selection method that identifies physico-chemical features based on their direct causal-effect over specific downstream properties. By employing Mordred descriptors and Markov blanket causal graphs, our approach provides insight into the causal relationships between features and target properties for prediction tasks. We evaluate our approach on various foundation model architectures and datasets, demonstrating performance improvements, which showcases the robustness and the agnostic nature of our approach.<\/jats:p>","DOI":"10.1088\/2632-2153\/adabb1","type":"journal-article","created":{"date-parts":[[2025,1,17]],"date-time":"2025-01-17T22:55:03Z","timestamp":1737154503000},"page":"015017","update-policy":"https:\/\/doi.org\/10.1088\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks"],"prefix":"10.1088","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2634-8270","authenticated-orcid":true,"given":"Eduardo","family":"Soares","sequence":"first","affiliation":[]},{"given":"Victor Yukio","family":"Shirasuna","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4982-3836","authenticated-orcid":true,"given":"Emilio Vital","family":"Brazil","sequence":"additional","affiliation":[]},{"given":"Karen Fiorella","family":"Aquino Gutierrez","sequence":"additional","affiliation":[]},{"given":"Renato","family":"Cerqueira","sequence":"additional","affiliation":[]},{"given":"Dmitry","family":"Zubarev","sequence":"additional","affiliation":[]},{"given":"Kristin","family":"Schmidt","sequence":"additional","affiliation":[]},{"given":"Daniel P","family":"Sanders","sequence":"additional","affiliation":[]}],"member":"266","published-online":{"date-parts":[[2025,1,29]]},"reference":[{"key":"mlstadabb1bib1","doi-asserted-by":"publisher","first-page":"457","DOI":"10.1038\/s41570-023-00502-0","article-title":"The future of chemistry is language","volume":"7","author":"White","year":"2023","journal-title":"Nat. Rev. Chem."},{"key":"mlstadabb1bib2","article-title":"ChemBERTa-2: towards chemical foundation models","author":"Ahmad","year":"2022"},{"key":"mlstadabb1bib3","doi-asserted-by":"publisher","DOI":"10.1088\/2632-2153\/ac3ffb","article-title":"Chemformer: a pre-trained transformer for computational chemistry","volume":"3","author":"Irwin","year":"2022","journal-title":"Mach. Learn.: Sci. Technol."},{"key":"mlstadabb1bib4","article-title":"On the opportunities and risks of foundation models","author":"Bommasani","year":"2021"},{"key":"mlstadabb1bib5","first-page":"pp 3833","article-title":"Rethinking pre-training and self-training","volume":"vol 33","author":"Zoph","year":"2020"},{"key":"mlstadabb1bib6","article-title":"ChatGPT is not enough: enhancing large language models with knowledge graphs for fact-aware language modeling","author":"Yang","year":"2023"},{"key":"mlstadabb1bib7","first-page":"pp 1","article-title":"Efficient domain adaptation for speech foundation models","author":"Li","year":"2023"},{"key":"mlstadabb1bib8","doi-asserted-by":"publisher","first-page":"3370","DOI":"10.1021\/acs.jcim.9b00237","article-title":"Analyzing learned molecular representations for property prediction","volume":"59","author":"Yang","year":"2019","journal-title":"J. Chem. Inform. Mod."},{"key":"mlstadabb1bib9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13321-018-0258-y","article-title":"Mordred: a molecular descriptor calculator","volume":"10","author":"Moriwaki","year":"2018","journal-title":"J. Chem."},{"key":"mlstadabb1bib10","doi-asserted-by":"publisher","first-page":"1466","DOI":"10.1002\/jcc.21707","article-title":"PaDEL-Descriptor: an open source software to calculate molecular descriptors and fingerprints","volume":"32","author":"Yap","year":"2011","journal-title":"J. Comput. Chem."},{"key":"mlstadabb1bib11","doi-asserted-by":"publisher","first-page":"754","DOI":"10.1038\/s42256-023-00683-9","article-title":"Application of variational graph encoders as an effective generalist algorithm in computer-aided drug design","volume":"5","author":"Lam","year":"2023","journal-title":"Nat. Mach. Intell."},{"key":"mlstadabb1bib12","doi-asserted-by":"publisher","DOI":"10.1016\/j.fuel.2022.123836","article-title":"A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties","volume":"321","author":"Comesana","year":"2022","journal-title":"Fuel"},{"key":"mlstadabb1bib13","first-page":"21","article-title":"HITON: a novel Markov blanket algorithm for optimal variable selection AMIA annual symposium proceedings","volume":"21","author":"Aliferis","year":"2003","journal-title":"Am. Med. Inform. Assoc."},{"key":"mlstadabb1bib14","first-page":"pp 79","article-title":"Causal Feature Selection","author":"Guyon","year":"2007"},{"key":"mlstadabb1bib15","first-page":"pp 6430","article-title":"Multi-label causal feature selection","volume":"vol 34","author":"Wu","year":"2020"},{"key":"mlstadabb1bib16","article-title":"Causal feature selection with dual correction","author":"Guo","year":"2022"},{"key":"mlstadabb1bib17","first-page":"171","article-title":"Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: algorithms and empirical evaluation","volume":"11","author":"Aliferis","year":"2010","journal-title":"J. Mach. Learn. Res."},{"key":"mlstadabb1bib18","doi-asserted-by":"publisher","first-page":"1256","DOI":"10.1038\/s42256-022-00580-7","article-title":"Large-scale chemical language representations capture molecular structure and properties","volume":"4","author":"Ross","year":"2022","journal-title":"Nat. Mach. Intell."},{"key":"mlstadabb1bib19","article-title":"MHG-GNN: combination of molecular hypergraph grammar with graph neural network","author":"Kishimoto","year":"2023"},{"key":"mlstadabb1bib20","article-title":"Mamba: linear-time sequence modeling with selective state spaces","author":"Gu","year":"2023"},{"key":"mlstadabb1bib21","doi-asserted-by":"publisher","first-page":"513","DOI":"10.1039\/C7SC02664A","article-title":"MoleculeNet: a benchmark for molecular machine learning","volume":"9","author":"Wu","year":"2018","journal-title":"Chem. sci."},{"key":"mlstadabb1bib22","doi-asserted-by":"publisher","first-page":"3649","DOI":"10.1021\/acsomega.1c06274","article-title":"A comparative study of the performance for predicting biodegradability classification: the quantitative structure\u2013activity relationship model vs the graph convolutional network","volume":"7","author":"Lee","year":"2022","journal-title":"ACS Omega"},{"key":"mlstadabb1bib23","doi-asserted-by":"publisher","first-page":"5793","DOI":"10.1021\/acs.jcim.1c01204","article-title":"Uncertainty-informed deep transfer learning of perfluoroalkyl and polyfluoroalkyl substance toxicity","volume":"61","author":"Feinstein","year":"2021","journal-title":"J. Chem. Inform. Mod."},{"key":"mlstadabb1bib24","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1023\/A:1023247831238","article-title":"An analysis of the autocorrelation descriptor for molecules","volume":"33","author":"Hollas","year":"2003","journal-title":"J. Math. Chem."},{"key":"mlstadabb1bib25","article-title":"Toward optimal feature selection","author":"Koller","year":"1996"},{"key":"mlstadabb1bib26","article-title":"PPFS: predictive permutation feature selection","author":"Hassan","year":"2021"},{"key":"mlstadabb1bib27","article-title":"\u201cA large encoder-decoder family of foundation models for chemical language","author":"Soares","year":"2024"},{"key":"mlstadabb1bib28","article-title":"A Mamba-based foundation model for chemistry","author":"Brazil","year":"2024"},{"key":"mlstadabb1bib29","doi-asserted-by":"publisher","first-page":"283","DOI":"10.1021\/acscentsci.6b00367","article-title":"Low data drug discovery with one-shot learning","volume":"3","author":"Altae-Tran","year":"2017","journal-title":"ACS Cent. Sci."},{"key":"mlstadabb1bib30","doi-asserted-by":"publisher","first-page":"8749","DOI":"10.1021\/acs.jmedchem.9b00959","article-title":"Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism","volume":"63","author":"Xiong","year":"2019","journal-title":"J. Med. Chem."},{"key":"mlstadabb1bib31","first-page":"12559","article-title":"Self-supervised graph transformer on large-scale molecular data","volume":"vol 33","author":"Rong","year":"2020"},{"key":"mlstadabb1bib32","doi-asserted-by":"publisher","first-page":"2054","DOI":"10.1021\/acs.est.1c05398","article-title":"Predicting solute descriptors for organic chemicals by a deep neural network (DNN) using basic chemical structures and a surrogate metric","volume":"56","author":"Zhang","year":"2022","journal-title":"Environ. Sci. Technol."},{"key":"mlstadabb1bib33","doi-asserted-by":"publisher","first-page":"127","DOI":"10.1038\/s42256-021-00438-4","article-title":"Geometry-enhanced molecular representation learning for property prediction","volume":"4","author":"Fang","year":"2022","journal-title":"Nat. Mach. Intell."},{"key":"mlstadabb1bib34","doi-asserted-by":"crossref","DOI":"10.21203\/rs.3.rs-2425375\/v1","article-title":"Bidirectional generation of structure and properties through a single molecular foundation model","author":"Chang","year":"2023"},{"key":"mlstadabb1bib35","doi-asserted-by":"crossref","DOI":"10.26434\/chemrxiv-2022-jjm0j-v4","article-title":"Uni-Mol: a universal 3D molecular representation learning framework","author":"Zhou","year":"2023"},{"key":"mlstadabb1bib36","first-page":"pp 1263","article-title":"Neural message passing for quantum chemistry","author":"Gilmer","year":"2017"}],"container-title":["Machine Learning: Science and Technology"],"original-title":[],"link":[{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adabb1","content-type":"text\/html","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adabb1\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adabb1","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adabb1\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adabb1\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adabb1\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adabb1\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"similarity-checking"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adabb1\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,29]],"date-time":"2025-01-29T10:42:43Z","timestamp":1738147363000},"score":1,"resource":{"primary":{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adabb1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,29]]},"references-count":36,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,1,29]]},"published-print":{"date-parts":[[2025,3,31]]}},"URL":"https:\/\/doi.org\/10.1088\/2632-2153\/adabb1","relation":{},"ISSN":["2632-2153"],"issn-type":[{"value":"2632-2153","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,29]]},"assertion":[{"value":"Causality-driven feature selection and domain adaptation for enhancing chemical foundation models in downstream tasks","name":"article_title","label":"Article Title"},{"value":"Machine Learning: Science and Technology","name":"journal_title","label":"Journal Title"},{"value":"paper","name":"article_type","label":"Article Type"},{"value":"\u00a9 2025 The Author(s). Published by IOP Publishing Ltd","name":"copyright_information","label":"Copyright Information"},{"value":"2024-09-25","name":"date_received","label":"Date Received","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2025-01-17","name":"date_accepted","label":"Date Accepted","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2025-01-29","name":"date_epub","label":"Online publication date","group":{"name":"publication_dates","label":"Publication dates"}}]}}