{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T17:41:51Z","timestamp":1772818911734,"version":"3.50.1"},"reference-count":34,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2024,6,27]],"date-time":"2024-06-27T00:00:00Z","timestamp":1719446400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ministry of Science and Higher Education of the Republic of Kazakhstan","award":["AP19677756"],"award-info":[{"award-number":["AP19677756"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>This study introduces an unsupervised term extraction approach that combines non-negative matrix factorization (NMF) with word embeddings. Inspired by a pioneering semantic NMF method that employs regularization to jointly optimize document\u2013word and word\u2013word matrix factorizations for document clustering, we adapt this strategy for term extraction. Typically, a word\u2013word matrix representing semantic relationships between words is constructed using cosine similarities between word embeddings. However, it has been established that transformer encoder embeddings tend to reside within a narrow cone, leading to consistently high cosine similarities between words. To address this issue, we replace the conventional word\u2013word matrix with a word\u2013seed submatrix, restricting columns to \u2018domain seeds\u2019\u2014specific words that encapsulate the essential semantic features of the domain. Therefore, we propose a modified NMF framework that jointly factorizes the document\u2013word and word\u2013seed matrices, producing more precise encoding vectors for words, which we utilize to extract high-relevancy topic-related terms. Our modification significantly improves term extraction effectiveness, marking the first implementation of semantically enhanced NMF, designed specifically for the task of term extraction. Comparative experiments demonstrate that our method outperforms both traditional NMF and advanced transformer-based methods such as KeyBERT and BERTopic. To support further research and application, we compile and manually annotate two new datasets, each containing 1000 sentences, from the \u2018Geography and History\u2019 and \u2018National Heroes\u2019 domains. These datasets are useful for both term extraction and document classification tasks. All related code and datasets are freely available.<\/jats:p>","DOI":"10.3390\/bdcc8070072","type":"journal-article","created":{"date-parts":[[2024,6,27]],"date-time":"2024-06-27T13:26:43Z","timestamp":1719494803000},"page":"72","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Semantic Non-Negative Matrix Factorization for Term Extraction"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5522-4421","authenticated-orcid":false,"given":"Aliya","family":"Nugumanova","sequence":"first","affiliation":[{"name":"Big Data and Blockchain Technologies Research Innovation Center, Astana IT University, Astana 010000, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-8083-2366","authenticated-orcid":false,"given":"Almas","family":"Alzhanov","sequence":"additional","affiliation":[{"name":"Big Data and Blockchain Technologies Research Innovation Center, Astana IT University, Astana 010000, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9076-0722","authenticated-orcid":false,"given":"Aiganym","family":"Mansurova","sequence":"additional","affiliation":[{"name":"Big Data and Blockchain Technologies Research Innovation Center, Astana IT University, Astana 010000, Kazakhstan"}]},{"given":"Kamilla","family":"Rakhymbek","sequence":"additional","affiliation":[{"name":"Laboratory of Digital Technologies and Modeling, Sarsen Amanzholov East Kazakhstan University, Ust-Kamenogorsk 070000, Kazakhstan"}]},{"given":"Yerzhan","family":"Baiburin","sequence":"additional","affiliation":[{"name":"Laboratory of Digital Technologies and Modeling, Sarsen Amanzholov East Kazakhstan University, Ust-Kamenogorsk 070000, Kazakhstan"}]}],"member":"1968","published-online":{"date-parts":[[2024,6,27]]},"reference":[{"key":"ref_1","unstructured":"QasemiZadeh, B. (2015). Investigating the Use of Distributional Semantic Models for Co-Hyponym Identification in Special Corpora. [Ph.D. Thesis, National University of Ireland]."},{"key":"ref_2","first-page":"1","article-title":"Computational terminology and filtering of terminological information: Introduction to the special issue","volume":"24","author":"Drouin","year":"2018","journal-title":"Terminology"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Fusco, F., Staar, P., and Antognini, D. (2022). Unsupervised Term Extraction for Highly Technical Domains. arXiv.","DOI":"10.18653\/v1\/2022.emnlp-industry.1"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"3607","DOI":"10.18653\/v1\/2021.findings-acl.316","article-title":"Transforming term extraction: Transformer-based approaches to multilingual term extraction across domains","volume":"2021","author":"Lang","year":"2021","journal-title":"Find. Assoc. Comput. Linguist. ACL-IJCNLP"},{"key":"ref_5","first-page":"254","article-title":"HAMLET: Hybrid adaptable machine learning approach to extract terminology","volume":"27","author":"Terryn","year":"2021","journal-title":"Terminol. Int. J. Theor. Appl. Issues Spec. Commun."},{"key":"ref_6","unstructured":"Hazem, A., Bouhandi, M., Boudin, F., and Daille, B. (2022, January 20\u201325). Cross-lingual and cross-domain transfer learning for automatic term extraction from low resource data. Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Vukovic, R., Heck, M., Ruppik, B.M., van Niekerk, C., Zibrowius, M., and Ga\u0161i\u0107, M. (2022). Dialogue term extraction using transfer learning and topological data analysis. arXiv.","DOI":"10.18653\/v1\/2022.sigdial-1.53"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Qin, Y., Zheng, D., Zhao, T., and Zhang, M. (2013). Chinese terminology extraction using EM-based transfer learning method. Computational Linguistics and Intelligent Text Proceedings of the 14th International Conference, CICLing 2013, Samos, Greece, 24\u201330 March 2013, Springer. Part I.","DOI":"10.1007\/978-3-642-37247-6_12"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"117179","DOI":"10.1016\/j.eswa.2022.117179","article-title":"NMF-based approach to automatic term extraction","volume":"199","author":"Nugumanova","year":"2022","journal-title":"Expert Syst. Appl."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"105","DOI":"10.1016\/j.neucom.2022.04.122","article-title":"Improving NMF clustering by leveraging contextual relationships among words","volume":"495","author":"Febrissy","year":"2022","journal-title":"Neurocomputing"},{"key":"ref_11","unstructured":"Lee, D.D., and Seung, H.S. (2000, January 1). Algorithms for non-negative matrix factorization. Proceedings of the Neural Information Processing Systems (NIPS), Denver, CO, USA."},{"key":"ref_12","unstructured":"Gao, J., He, D., Tan, X., Qin, T., Wang, L., and Liu, T.Y. (2019). Representation degeneration problem in training natural language generation models. arXiv."},{"key":"ref_13","unstructured":"Grootendorst, M. (2024, April 29). KeyBERT: Minimal keyword extraction with BERT. Available online: https:\/\/zenodo.org\/records\/8388690."},{"key":"ref_14","unstructured":"Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv."},{"key":"ref_15","first-page":"4","article-title":"Semi-supervised nonnegative matrix factorization","volume":"17","author":"Lee","year":"2009","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"366","DOI":"10.20965\/jaciii.2014.p0366","article-title":"Hierarchical semi-supervised factorization for learning the semantics","volume":"18","author":"Shen","year":"2014","journal-title":"J. Adv. Comput. Intell. Intell. Inform."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Vangara, R., Skau, E., Chennupati, G., Djidjev, H., Tierney, T., Smith, J.P., Bhattarai, M., Stanev, V.G., and Alexandrov, B.S. (2020, January 14\u201317). Semantic nonnegative matrix factorization with automatic model determination for topic modeling. Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA.","DOI":"10.1109\/ICMLA51294.2020.00060"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"117217","DOI":"10.1109\/ACCESS.2021.3106879","article-title":"Finding the number of latent topics with semantic non-negative matrix factorization","volume":"9","author":"Vangara","year":"2021","journal-title":"IEEE Access"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Eren, M.E., Solovyev, N., Bhattarai, M., Rasmussen, K.\u00d8., Nicholas, C., and Alexandrov, B.S. (2022, January 20\u201323). SeNMFk-split: Large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection. Proceedings of the 22nd ACM Symposium on Document Engineering, San Jose, CA, USA.","DOI":"10.1145\/3558100.3563844"},{"key":"ref_20","unstructured":"Budahazy, R., Cheng, L., Huang, Y., Johnson, A., Li, P., Vendrow, J., Wu, Z., Molitor, D., Rebrova, E., and Needell, D. (2021). Analysis of Legal Documents via Non-negative Matrix Factorization Methods. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Vendrow, J., Haddock, J., Rebrova, E., and Needell, D. (2021, January 6\u201311). On a guided nonnegative matrix factorization. Proceedings of the ICASSP 2021\u20132021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9413656"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Li, P., Tseng, C., Zheng, Y., Chew, J.A., Huang, L., Jarman, B., and Needell, D. (2022). Guided semi-supervised non-negative matrix factorization. Algorithms, 15.","DOI":"10.3390\/a15050136"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"545","DOI":"10.1007\/s10898-014-0247-2","article-title":"SymNMF: Nonnegative low-rank approximation of a similarity matrix for graph clustering","volume":"62","author":"Kuang","year":"2015","journal-title":"J. Glob. Optim."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"2550","DOI":"10.1109\/TCYB.2020.2969684","article-title":"Semisupervised adaptive symmetric non-negative matrix factorization","volume":"51","author":"Jia","year":"2020","journal-title":"IEEE Trans. Cybern."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Jing, L., Yu, J., Zeng, T., and Zhu, Y. (2012, January 4\u20137). Semi-supervised clustering via constrained symmetric non-negative matrix factorization. Proceedings of the Brain Informatics: International Conference, Macau, China.","DOI":"10.1007\/978-3-642-35139-6_29"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"54","DOI":"10.1007\/s42452-019-1836-y","article-title":"Novel semantic tagging detection algorithms based non-negative matrix factorization","volume":"2","author":"Gadelrab","year":"2020","journal-title":"SN Appl. Sci."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Esposito, F. (2021). A review on initialization methods for nonnegative matrix factorization: Towards omics data experiments. Mathematics, 9.","DOI":"10.3390\/math9091006"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"2217","DOI":"10.1016\/j.patcog.2004.02.013","article-title":"Improving non-negative matrix factorizations through structured initialization","volume":"37","author":"Wild","year":"2004","journal-title":"Pattern Recognit."},{"key":"ref_29","unstructured":"Nannen, V. (2003). The Paradox of Overfitting. [Master\u2019s Thesis, Faculty of Science and Engineering, Rijksuniversiteit Groningen]. Available online: https:\/\/fse.studenttheses.ub.rug.nl\/id\/eprint\/8664."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1109\/TPAMI.2006.60","article-title":"Nonsmooth nonnegative matrix factorization (nsNMF)","volume":"28","author":"Carazo","year":"2006","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Manning, D., Raghavan, P., and Sch\u00fctze, H. (2008). Introduction to Information Retrieval, Cambridge University Press.","DOI":"10.1017\/CBO9780511809071"},{"key":"ref_32","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching word vectors with subword information","volume":"5","author":"Bojanowski","year":"2017","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_34","unstructured":"Lopes, L., Vieira, R., and Fernandes, P. (2012, January 16\u201319). Domain term relevance through tf-dcf. Proceedings of the 2012 International Conference on Artificial Intelligence (ICAI), Las Vegas, NV, USA."}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/7\/72\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T15:06:35Z","timestamp":1760108795000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/7\/72"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,27]]},"references-count":34,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2024,7]]}},"alternative-id":["bdcc8070072"],"URL":"https:\/\/doi.org\/10.3390\/bdcc8070072","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,27]]}}}