{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T06:57:11Z","timestamp":1770706631077,"version":"3.49.0"},"reference-count":64,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T00:00:00Z","timestamp":1770595200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T00:00:00Z","timestamp":1770595200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Ministry of Science, Technology and Innovation of the Republic of Serbia"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>We propose that word embeddings of atoms derived from scientific literature are revisited as autonomous machine learning predictors in materials design. If static word embeddings encode comprehensive physicochemical information, joined embeddings of chemical elements constituting a chemical compound represent a viable source of physicochemical knowledge. Nevertheless, static word embeddings are susceptible to variability due to information heterogeneity within training material. We analysed whether variability occurs in embeddings affiliated with physicochemical entities, including explicit atoms, and whether it affects therein-encoded domain-specialized information or inhibits the information transfer. Results demonstrate the substantial variability in individual atomic embeddings, which is highly dependent on vocabulary terms selected for language modelling. Regardless, variability does not obstruct the mapping of materials' composite predictors into physicochemical properties when joined atomic embeddings are implemented within a regression model estimating the compound stability by predicting its formation energy. Moreover, the encoded information and the model's predictive performance maintained stability following compound vector calibration via dimensional reduction.<\/jats:p>\n                  <jats:p>Scientific contribution<\/jats:p>\n                  <jats:p>The magnitude of variability in word embeddings of physicochemical entities, including chemical elements, occurring due to information heterogeneity in complementary training material of materials science, chemistry, and physics scientific literature was observed and quantified. The research shows that notable variability of vectorial representations of chemical elements does not obstruct the underlying statistical properties, nor does it inhibit the information transfer. Accordingly, regardless of their origin, conjoined atomic embeddings representing chemical compounds facilitate stable predictive performance when implemented within a regression model.<\/jats:p>","DOI":"10.1186\/s13321-025-01149-3","type":"journal-article","created":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T13:26:40Z","timestamp":1770643600000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Word embeddings as autonomous predictors in materials design\u2014the effect of inherent variability on information transfer"],"prefix":"10.1186","volume":"18","author":[{"given":"Jana","family":"Radakovi\u0107","sequence":"first","affiliation":[]},{"given":"Katarina","family":"Batalovi\u0107","sequence":"additional","affiliation":[]},{"given":"Nikola","family":"Novakovi\u0107","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,2,9]]},"reference":[{"key":"1149_CR1","doi-asserted-by":"publisher","first-page":"599","DOI":"10.1021\/acs.accounts.9b00470","volume":"53","author":"JM Cole","year":"2020","unstructured":"Cole JM (2020) A design-to-device pipeline for data-driven materials discovery. Acc Chem Res 53:599\u2013610","journal-title":"Acc Chem Res"},{"key":"1149_CR2","doi-asserted-by":"publisher","first-page":"813","DOI":"10.1039\/D1RE00560J","volume":"7","author":"SP Batchu","year":"2022","unstructured":"Batchu SP, Hernandez B, Malhotra A, Fang H, Ierapetritou M, Vlachos DG (2022) Accelerating manufacturing for biomass conversion via integrated process and bench digitalization: a perspective. React Chem Eng 7:813\u2013832","journal-title":"React Chem Eng"},{"key":"1149_CR3","doi-asserted-by":"publisher","first-page":"7217","DOI":"10.1021\/acs.chemmater.1c01368","volume":"33","author":"CJ Court","year":"2021","unstructured":"Court CJ, Jain A, Cole JM (2021) Inverse design of materials that exhibit the magnetocaloric effect by text-mining of the scientific literature and generative deep learning. Chem Mater 33:7217\u20137231","journal-title":"Chem Mater"},{"key":"1149_CR4","doi-asserted-by":"publisher","first-page":"102","DOI":"10.1038\/s41524-022-00784-w","volume":"8","author":"T Gupta","year":"2022","unstructured":"Gupta T, Zaki M, Krishnan NMA, Mausam (2022) MatSciBERT: A materials domain language model for text mining and information extraction. npi Comput Mater 8:102. https:\/\/doi.org\/10.1038\/s41524-022-00784-w","journal-title":"npi Comput Mater"},{"key":"1149_CR5","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1021\/acs.jcim.7b00616","volume":"58","author":"S Jaeger","year":"2018","unstructured":"Jaeger S, Fulle S, Turk S (2018) Mol2vec: unsupervised machine learning approach with chemical intuition. J Chem Inf Model 58:27\u201335","journal-title":"J Chem Inf Model"},{"key":"1149_CR6","doi-asserted-by":"publisher","first-page":"5470","DOI":"10.1039\/D1TA10688K","volume":"10","author":"Z Shu","year":"2022","unstructured":"Shu Z, Yan H, Chen H, Cai Y (2022) Mutual modulation via charge transfer and unpaired electrons of catalytic sites for the superior intrinsic activity of N2 reduction: from high-throughput computation assisted with a machine learning perspective. J Mater Chem A 10:5470\u20135478","journal-title":"J Mater Chem A"},{"key":"1149_CR7","doi-asserted-by":"publisher","first-page":"4821","DOI":"10.1021\/acs.chemmater.2c00445","volume":"34","author":"A Smith","year":"2022","unstructured":"Smith A, Bhat V, Ai Q, Risko C (2022) Challenges in information-mining the materials literature: a case study and perspective. Chem Mater 34:4821\u20134827","journal-title":"Chem Mater"},{"key":"1149_CR8","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1038\/s41586-019-1335-8","volume":"571","author":"V Tshitoyan","year":"2019","unstructured":"Tshitoyan V, Dagdelen J, Weston L, Dunn A, Rong Z, Kononova O et al (2019) Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571:95\u201398","journal-title":"Nature"},{"key":"1149_CR9","doi-asserted-by":"publisher","first-page":"781","DOI":"10.1093\/bib\/bbaa296","volume":"22","author":"LL Wang","year":"2021","unstructured":"Wang LL, Lo K (2021) Text mining approaches for dealing with the rapidly expanding literature on COVID-19. Brief Bioinform 22:781\u2013799","journal-title":"Brief Bioinform"},{"key":"1149_CR10","doi-asserted-by":"publisher","first-page":"1233","DOI":"10.1039\/D3DD00113J","volume":"2","author":"KM Jablonka","year":"2023","unstructured":"Jablonka KM, Ai Q, Al-Feghali A, Badhwar S, Bocarsly JD, Bran AM et al (2023) 14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon. Digital Discovery 2:1233\u20131250","journal-title":"Digital Discovery"},{"key":"1149_CR11","doi-asserted-by":"publisher","first-page":"368","DOI":"10.1039\/D2DD00087C","volume":"2","author":"AD White","year":"2023","unstructured":"White AD, Hocky GM, Gandhi HA, Ansari M, Cox S, Wellawatte GP et al (2023) Assessment of chemistry knowledge in large language models that generate code. Digital Discovery 2:368\u2013376","journal-title":"Digital Discovery"},{"key":"1149_CR12","doi-asserted-by":"publisher","first-page":"161","DOI":"10.1038\/s42256-023-00788-1","volume":"6","author":"KM Jablonka","year":"2024","unstructured":"Jablonka KM, Schwaller P, Ortega-Guerrero A, Smit B (2024) Leveraging large language models for predictive chemistry. Nat Mach Intell 6:161\u2013169","journal-title":"Nat Mach Intell"},{"key":"1149_CR13","doi-asserted-by":"publisher","first-page":"79","DOI":"10.1039\/D1DD00009H","volume":"1","author":"GM Hocky","year":"2022","unstructured":"Hocky GM, White AD (2022) Natural language processing models that automate programming will transform chemistry research and teaching. Digit Discov 1:79\u201383","journal-title":"Digit Discov"},{"key":"1149_CR14","doi-asserted-by":"publisher","DOI":"10.1016\/j.isci.2024.110780","volume":"27","author":"D Cao","year":"2024","unstructured":"Cao D, Chan MK (2024) Enhancing chemical synthesis research with NLP: Word embeddings for chemical reagent identification\u2014a case study on nano-FeCu. iScience 27:110780","journal-title":"iScience"},{"key":"1149_CR15","first-page":"372","volume":"2","author":"M Yoshitake","year":"2022","unstructured":"Yoshitake M, Sato F, Kawano H, Teraoka H (2022) MaterialBERT for natural language processing of materials science texts. Science and Technology of Advanced Materials: Methods 2:372\u2013380","journal-title":"Science and Technology of Advanced Materials: Methods"},{"key":"1149_CR16","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2307.15759","author":"X Lei","year":"2023","unstructured":"Lei X, Kim E, Baibakova V, Sun S (2023) Lessons in Reproducibility: Insights from NLP Studies in Materials Science. arXiv. https:\/\/doi.org\/10.48550\/arXiv.2307.15759","journal-title":"arXiv"},{"key":"1149_CR17","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2411.05466","author":"L Zhang","year":"2024","unstructured":"Zhang L, Banko L, Schuhmann W, Ludwig A, Stricker M (2024) Composition-property extrapolation for compositionally complex solid solutions based on word embeddings. arXiv. https:\/\/doi.org\/10.48550\/arXiv.2411.05466","journal-title":"arXiv"},{"key":"1149_CR18","doi-asserted-by":"publisher","DOI":"10.1186\/s13321-025-00984-8","volume":"17","author":"S Ishida","year":"2025","unstructured":"Ishida S, Sato T, Honma T, Terayama K (2025) Large language models open new way of AI-assisted molecule design for chemists. J Cheminform 17:36","journal-title":"J Cheminform"},{"key":"1149_CR19","doi-asserted-by":"publisher","DOI":"10.1016\/j.compbiomed.2021.104259","volume":"131","author":"SM Ali Shah","year":"2021","unstructured":"Ali Shah SM, Taju SW, Ho Q-T, Nguyen T-T-D, Ou Y-Y (2021) GT-finder: classify the family of glucose transporters with pre-trained BERT language models. Comput Biol Med 131:104259","journal-title":"Comput Biol Med"},{"key":"1149_CR20","doi-asserted-by":"publisher","DOI":"10.1186\/s13321-018-0313-8","volume":"10","author":"P Corbett","year":"2018","unstructured":"Corbett P, Boyle J (2018) Chemlistem: chemical named entity recognition using recurrent neural networks. J Cheminform 10:59","journal-title":"J Cheminform"},{"key":"1149_CR21","doi-asserted-by":"crossref","unstructured":"Chiu B, Crichton G, Korhonen A, Pyysalo S. 2016. How to Train good Word Embeddings for Biomedical NLP. In: Cohen K B, Demner-Fushman D, Tsujii J (eds) Proceedings of the 15th Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics. Berlin. p 166.","DOI":"10.18653\/v1\/W16-2922"},{"key":"1149_CR22","doi-asserted-by":"publisher","first-page":"2035","DOI":"10.1021\/acs.jcim.1c00284","volume":"62","author":"J Guo","year":"2022","unstructured":"Guo J, Ibanez-Lopez AS, Gao H, Quach V, Coley CW, Jensen KF, Barzilay R (2022) Automated chemical reaction extraction from scientific literature. J Chem Inf Model 62:2035\u20132045","journal-title":"J Chem Inf Model"},{"key":"1149_CR23","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-018-35934-y","volume":"8","author":"D Jha","year":"2018","unstructured":"Jha D, Ward L, Paul A, Liao W, Choudhary A, Wolverton C et al (2018) ElemNet: deep learning the chemistry of materials from only elemental composition. Sci Rep 8:17593","journal-title":"Sci Rep"},{"key":"1149_CR24","doi-asserted-by":"publisher","DOI":"10.1016\/j.yjbinx.2019.100057","volume":"100","author":"FK Khattak","year":"2019","unstructured":"Khattak FK, Jeblee S, Pou-Prom C, Abdalla M, Meaney C, Rudzicz F (2019) A survey of word embeddings for clinical text. J Biomed Inform 100:100057","journal-title":"J Biomed Inform"},{"key":"1149_CR25","doi-asserted-by":"publisher","DOI":"10.1186\/gb-2008-9-s2-s8","volume":"9","author":"M Krallinger","year":"2008","unstructured":"Krallinger M, Valencia A, Hirschman L (2008) Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol 9:S8","journal-title":"Genome Biol"},{"key":"1149_CR26","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-022-33397-4","volume":"13","author":"D Miller","year":"2022","unstructured":"Miller D, Stern A, Burstein D (2022) Deciphering microbial gene function using natural language processing. Nat Commun 13:5731","journal-title":"Nat Commun"},{"key":"1149_CR27","doi-asserted-by":"publisher","DOI":"10.1186\/1758-2946-7-S1-S9","volume":"7","author":"T Munkhdalai","year":"2015","unstructured":"Munkhdalai T, Li M, Batsuren K, Park HA, Choi NH, Ryu KH (2015) Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations. J Cheminform 7:S9","journal-title":"J Cheminform"},{"key":"1149_CR28","doi-asserted-by":"crossref","unstructured":"Neumann M, King D, Beltagy I, Ammar W (2019) ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. In: Demner-Fushman, D, Cohen K B, Ananiadou S, Tsujii, J (eds) Proceedings of the 18th BioNLP Workshop and Shared Task. Association for Computational Linguistics. Florence. 319\u2013327.","DOI":"10.18653\/v1\/W19-5034"},{"key":"1149_CR29","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1706.02241","author":"D Newman-Griffis","year":"2017","unstructured":"Newman-Griffis D, Lai AM, Fosler-Lussier E (2017) Insights into analogy completion from the biomedical domain. arXiv. https:\/\/doi.org\/10.48550\/arXiv.1706.02241","journal-title":"arXiv"},{"key":"1149_CR30","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1301.3781","author":"T Mikolov","year":"2013","unstructured":"Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv. https:\/\/doi.org\/10.48550\/arXiv.1301.3781","journal-title":"arXiv"},{"key":"1149_CR31","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1310.4546","author":"T Mikolov","year":"2013","unstructured":"Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. arXiv. https:\/\/doi.org\/10.48550\/arXiv.1310.4546","journal-title":"arXiv"},{"key":"1149_CR32","doi-asserted-by":"crossref","unstructured":"Levy O, Goldberg Y (2014) Linguistic Regularities in Sparse and Explicit Word Representations. In: Roser M, Scott W Y (eds) Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics. Ann Arbor, p 171\u2013180.","DOI":"10.3115\/v1\/W14-1618"},{"key":"1149_CR33","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1162\/tacl_a_00134","volume":"3","author":"O Levy","year":"2015","unstructured":"Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. TACL 3:211\u2013225","journal-title":"TACL"},{"key":"1149_CR34","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1162\/tacl_a_00008","volume":"6","author":"M Antoniak","year":"2018","unstructured":"Antoniak M, Mimno D (2018) Evaluating the stability of embedding-based word similarities. Trans Assoc Comput Linguist 6:107\u2013119","journal-title":"Trans Assoc Comput Linguist"},{"key":"1149_CR35","doi-asserted-by":"crossref","unstructured":"Burdick L, Kummerfeld JK, Mihalcea R. (2021) Analyzing the Surprising Variability in Word Embedding Stability Across Languages. In: Marie-Francine M, Xuanjing H, Lucia S, Scott W Y (eds) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. New Orleans. 5891\u20135901.","DOI":"10.18653\/v1\/2021.emnlp-main.476"},{"key":"1149_CR36","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2003.04983","author":"M Leszczynski","year":"2020","unstructured":"Leszczynski M, May A, Zhang J, Wu S, Aberger CR, R\u00e9 C (2020) Understanding the downstream instability of word embeddings. arXiv. https:\/\/doi.org\/10.48550\/arXiv.2003.04983","journal-title":"arXiv"},{"key":"1149_CR37","doi-asserted-by":"crossref","unstructured":"Pierrejean B, Tanguy L. (2018) Towards Qualitative Word Embeddings Evaluation: Measuring Neighbors Variation. In: Silvio R C, Shereen O, Umashanthi P, Kyeongmin R (eds) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics. New Orleans. p 32.","DOI":"10.18653\/v1\/N18-4005"},{"key":"1149_CR38","doi-asserted-by":"crossref","unstructured":"Wendlandt L, Kummerfeld J K, Mihalcea R (2018) Factors Influencing the Surprising Instability of Word Embeddings In: Marilyn W, Heng J, Amanda S (eds) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. New Orleans. p 2092.","DOI":"10.18653\/v1\/N18-1190"},{"key":"1149_CR39","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"crossref","first-page":"812","DOI":"10.1007\/978-3-030-03991-2_73","volume-title":"AI 2018: Advances in Artificial Intelligence","author":"M Chugh","year":"2018","unstructured":"Chugh M, Whigham PA, Dick G (2018) Stability of Word Embeddings Using Word2vec. In: Mitrovic T, Xue B, Li X (eds) AI 2018: Advances in Artificial Intelligence. Lecture Notes in Computer Science. Springer, Cham, p 812"},{"key":"1149_CR40","doi-asserted-by":"crossref","unstructured":"Pennington J, Socher R, Manning C (2014) Glove: Global Vectors for Word Representation. In: Alessandro M, Bo P, Walter D (eds) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. Qatar. p 1532.","DOI":"10.3115\/v1\/D14-1162"},{"key":"1149_CR41","doi-asserted-by":"publisher","first-page":"391","DOI":"10.1002\/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9","volume":"41","author":"S Deerwester","year":"1990","unstructured":"Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science 41:391\u2013407","journal-title":"Journal of the American Society for Information Science"},{"key":"1149_CR42","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s43246-024-00449-9","volume":"5","author":"J Choi","year":"2024","unstructured":"Choi J, Lee B (2024) Accelerating materials language processing with large language models. Commun Mater 5:1\u201311","journal-title":"Commun Mater"},{"key":"1149_CR43","doi-asserted-by":"publisher","first-page":"1257","DOI":"10.1039\/D4DD00074A","volume":"3","author":"G Lei","year":"2024","unstructured":"Lei G, Docherty R, Cooper SJ (2024) Materials science in the era of large language models: a perspective. Digit Discov 3:1257\u20131272","journal-title":"Digit Discov"},{"key":"1149_CR44","doi-asserted-by":"publisher","DOI":"10.1038\/s43246-024-00449-9","volume":"5","author":"J Choi","year":"2024","unstructured":"Choi J, Lee B (2024) Accelerating materials language processing with large language models. Commun Mater 5:13","journal-title":"Commun Mater"},{"key":"1149_CR45","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-021-88027-8","volume":"11","author":"AS Krishnapriyan","year":"2021","unstructured":"Krishnapriyan AS, Montoya J, Haranczyk M, Hummelsh\u00f8j J, Morozov D (2021) Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks. Sci Rep 11:8888","journal-title":"Sci Rep"},{"key":"1149_CR46","doi-asserted-by":"crossref","unstructured":"Pierrejean B, Tanguy L. 2018. In: Nissim M, Berant J, Lenci A (eds) Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. Predicting Word Embeddings Variability. Association for Computational Linguistics. New Orleans. p 154.","DOI":"10.18653\/v1\/S18-2019"},{"key":"1149_CR47","doi-asserted-by":"publisher","first-page":"245","DOI":"10.1016\/S0004-3702(97)00063-5","volume":"97","author":"AL Blum","year":"1997","unstructured":"Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97:245\u2013271","journal-title":"Artif Intell"},{"key":"1149_CR48","doi-asserted-by":"publisher","first-page":"16028","DOI":"10.1038\/npjcompumats.2016.28","volume":"2","author":"L Ward","year":"2016","unstructured":"Ward L, Agrawal A, Choudhary A, Wolverton C (2016) A general-purpose machine learning framework for predicting properties of inorganic materials. npi Comput Mater 2:16028","journal-title":"npi Comput Mater"},{"key":"1149_CR49","doi-asserted-by":"crossref","unstructured":"Lo K, Wang L L, Neumann M, Kinney R, Weld D. 2020. The semantic scholar open research corpus. In: Jurafsky D, Chai J, Schluter N, Tetreault J (eds) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. p 4969","DOI":"10.18653\/v1\/2020.acl-main.447"},{"key":"1149_CR50","doi-asserted-by":"crossref","unstructured":"Lo K, Wang L L. 2020. S2ORC: the semantic scholar open research corpus https:\/\/github.com\/allenai\/s2orc. Accessed 25 Jan 2021","DOI":"10.18653\/v1\/2020.acl-main.447"},{"key":"1149_CR51","doi-asserted-by":"publisher","first-page":"1894","DOI":"10.1021\/acs.jcim.6b00207","volume":"56","author":"MC Swain","year":"2016","unstructured":"Swain MC, Cole JM (2016) ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. J Chem Inf Model 56:1894\u20131904","journal-title":"J Chem Inf Model"},{"key":"1149_CR52","unstructured":"Rehurek R, Sojka P (2010) Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, p 45"},{"key":"1149_CR53","doi-asserted-by":"crossref","unstructured":"\u00c1cs J, K\u00e1d\u00e1r \u00c1, Kornai A (2021) Subword Pooling Makes a Difference. In: Merlo P, Tiedemann J, Tsarfaty R (eds) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Online, Association for Computational Linguistics, p 2284\u20132295.","DOI":"10.18653\/v1\/2021.eacl-main.194"},{"key":"1149_CR54","doi-asserted-by":"crossref","unstructured":"Tenney I, Dipanjan D, and Pavlick E (2019) BERT Rediscovers the Classical NLP Pipeline. In: Korhonen A, Traum D, M\u00e0rquez L (eds) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy. Association for Computational Linguistics, p 4593\u20134601","DOI":"10.18653\/v1\/P19-1452"},{"key":"1149_CR55","doi-asserted-by":"crossref","unstructured":"Peters M E, Neumann M, Zettlemoyer L, Yih W (2018) Dissecting Contextual Word Embeddings: Architecture and Representation. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J (eds) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Brussels, p 1499\u20131509.","DOI":"10.18653\/v1\/D18-1179"},{"key":"1149_CR56","doi-asserted-by":"crossref","unstructured":"Jawahar G, Sagot B, Seddah Dj (2019) What Does BERT Learn about the Structure of Language?. In: Korhonen A, Traum D, M\u00e0rquez L (eds) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. Florence, p 3651\u20133657","DOI":"10.18653\/v1\/P19-1356"},{"key":"1149_CR57","doi-asserted-by":"crossref","unstructured":"Liu N F, Gardner M, Belinkov Y, Peters M E, Smith N A (2019) Linguistic Knowledge and Transferability of Contextual Representations. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. Minneapolis, p 1073\u20131094","DOI":"10.18653\/v1\/N19-1112"},{"key":"1149_CR58","doi-asserted-by":"publisher","first-page":"625","DOI":"10.1038\/s41586-024-07421-0","volume":"630","author":"S Farquhar","year":"2024","unstructured":"Farquhar S, Kossen J, Kuhn L, Gal Y (2024) Detecting hallucinations in large language models using semantic entropy. Nature 630:625\u2013630","journal-title":"Nature"},{"key":"1149_CR59","doi-asserted-by":"publisher","first-page":"249","DOI":"10.5586\/asbp.1934.012","volume":"11","author":"D Szymkiewicz","year":"1934","unstructured":"Szymkiewicz D (1934) Une conlribution statistique \u00e0 la g\u00e9ographie floristique. Acta Soc Bot Pol 11:249\u2013265","journal-title":"Acta Soc Bot Pol"},{"key":"1149_CR60","first-page":"547","volume":"37","author":"P Jaccard","year":"1901","unstructured":"Jaccard P (1901) \u00c9tude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat 37:547\u2013579","journal-title":"Bull Soc Vaudoise Sci Nat"},{"key":"1149_CR61","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1016\/j.commatsci.2018.05.018","volume":"152","author":"L Ward","year":"2018","unstructured":"Ward L, Dunn A, Faghaninia A, Zimmermann NER, Bajaj S, Wang Q et al (2018) Matminer: an open source toolkit for materials data mining. Comput Mater Sci 152:60\u201369","journal-title":"Comput Mater Sci"},{"key":"1149_CR62","doi-asserted-by":"publisher","DOI":"10.1063\/1.4812323","volume":"1","author":"A Jain","year":"2013","unstructured":"Jain A, Ong SP, Hautier G, Chen W, Richards WD, Dacek S et al (2013) Commentary: the materials project: a materials genome approach to accelerating materials innovation. APL Mater 1:011002","journal-title":"APL Mater"},{"key":"1149_CR63","doi-asserted-by":"publisher","first-page":"8767","DOI":"10.1021\/acs.chemmater.3c02491","volume":"35","author":"Y Gogotsi","year":"2023","unstructured":"Gogotsi Y (2023) The future of MXenes. Chem Mater 35:8767\u20138770","journal-title":"Chem Mater"},{"key":"1149_CR64","doi-asserted-by":"crossref","unstructured":"John G H, Kohavi R, Pfleger K (1994) Irrelevant Features and the Subset Selection Problem. In: Cohen WW, Hirsh H (eds) Machine Learning Proceedings. Morgan Kaufmann Publishers. San Francisco. p 121","DOI":"10.1016\/B978-1-55860-335-6.50023-4"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01149-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-025-01149-3","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-025-01149-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T13:26:51Z","timestamp":1770643611000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1186\/s13321-025-01149-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,9]]},"references-count":64,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,12]]}},"alternative-id":["1149"],"URL":"https:\/\/doi.org\/10.1186\/s13321-025-01149-3","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,9]]},"assertion":[{"value":"9 April 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 December 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 February 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"20"}}