{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T00:39:58Z","timestamp":1759970398869,"version":"build-2065373602"},"reference-count":41,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2025,1,20]],"date-time":"2025-01-20T00:00:00Z","timestamp":1737331200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003359","name":"Generalitat Valenciana (Spain)","doi-asserted-by":"publisher","award":["PROMETEO CIPROM\/2023\/32"],"award-info":[{"award-number":["PROMETEO CIPROM\/2023\/32"]}],"id":[{"id":"10.13039\/501100003359","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>We present a new word embedding technique in a (non-linear) metric space based on the shared membership of terms in a corpus of textual documents, where the metric is naturally defined by the Boolean algebra of all subsets of the corpus and a measure \u03bc defined on it. Once the metric space is constructed, a new term (a noun, an adjective, a classification term) can be introduced into the model and analyzed by means of semantic projections, which in turn are defined as indexes using the measure \u03bc and the word embedding tools. We formally define all necessary elements and prove the main results about the model, including a compatibility theorem for estimating the representability of semantically meaningful external terms in the model (which are written as real Lipschitz functions in the metric space), proving the relation between the semantic index and the metric of the space (Theorem 1). Our main result proves the universality of our word-set embedding, proving mathematically that every word embedding based on linear space can be written as a word-set embedding (Theorem 2). Since we adopt an empirical point of view for the semantic issues, we also provide the keys for the interpretation of the results using probabilistic arguments (to facilitate the subsequent integration of the model into Bayesian frameworks for the construction of inductive tools), as well as in fuzzy set-theoretic terms. We also show some illustrative examples, including a complete computational case using big-data-based computations. Thus, the main advantages of the proposed model are that the results on distances between terms are interpretable in semantic terms once the semantic index used is fixed and, although the calculations could be costly, it is possible to calculate the value of the distance between two terms without the need to calculate the whole distance matrix. \u201cWovon man nicht sprechen kann, dar\u00fcber muss man schweigen\u201d. Tractatus Logico-Philosophicus. L. Wittgenstein.<\/jats:p>","DOI":"10.3390\/computers14010030","type":"journal-article","created":{"date-parts":[[2025,1,20]],"date-time":"2025-01-20T12:32:52Z","timestamp":1737376372000},"page":"30","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Set-Word Embeddings and Semantic Indices: A New Contextual Model for Empirical Language Analysis"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0347-7280","authenticated-orcid":false,"given":"Pedro","family":"Fern\u00e1ndez de C\u00f3rdoba","sequence":"first","affiliation":[{"name":"Instituto Universitario de Matem\u00e1tica Pura y Aplicada, Universitat Polit\u00e8cnica de Val\u00e8ncia, 46022 Val\u00e8ncia, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-6829-1150","authenticated-orcid":false,"given":"Carlos A.","family":"Reyes P\u00e9rez","sequence":"additional","affiliation":[{"name":"Instituto Universitario de Matem\u00e1tica Pura y Aplicada, Universitat Polit\u00e8cnica de Val\u00e8ncia, 46022 Val\u00e8ncia, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-1245-9289","authenticated-orcid":false,"given":"Claudia","family":"S\u00e1nchez Arnau","sequence":"additional","affiliation":[{"name":"E.T.S. Ingenier\u00eda, Universitat de Val\u00e8ncia, 46100 Val\u00e9ncia, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8854-3154","authenticated-orcid":false,"given":"Enrique A.","family":"S\u00e1nchez P\u00e9rez","sequence":"additional","affiliation":[{"name":"Instituto Universitario de Matem\u00e1tica Pura y Aplicada, Universitat Polit\u00e8cnica de Val\u00e8ncia, 46022 Val\u00e8ncia, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,1,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Lappin, S., and Fox, C. (2015). Vector space models of lexical meaning. The Handbook of Contemporary Semantics, Blackwell.","DOI":"10.1002\/9781118882139"},{"key":"ref_2","unstructured":"Burstein, J., Doran, C., and Solorio, T. (2019, January 2\u20137). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"418","DOI":"10.1016\/j.inffus.2022.08.024","article-title":"Beyond word embeddings: A survey","volume":"89","author":"Incitti","year":"2023","journal-title":"Inf. Fusion"},{"key":"ref_4","unstructured":"Mikolov, T., Chen, K., Corrado, G.S., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014, January 25\u201329). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_6","unstructured":"Radford, A., and Narasimhan, K. (2024, December 23). Improving Language Understanding by Generative Pre-Training. Available online: https:\/\/www.semanticscholar.org\/paper\/Improving-Language-Understanding-by-Generative-Radford-Narasimhan\/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035."},{"key":"ref_7","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"975","DOI":"10.1038\/s41562-022-01316-8","article-title":"Semantic projection recovers rich human knowledge of multiple object features from word embeddings","volume":"6","author":"Grand","year":"2022","journal-title":"Nat. Hum. Behav."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Manetti, A., Ferrer-Sapena, A., S\u00e1nchez-P\u00e9rez, E.A., and Lara-Navarra, P. (2021). Design Trend Forecasting by Combining Conceptual Analysis and Semantic Projections: New Tools for Open Innovation. J. Open Innov. Technol. Mark. Complex., 7.","DOI":"10.3390\/joitmc7010092"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"159","DOI":"10.1016\/S0020-0255(71)80004-X","article-title":"Quantitative fuzzy semantics","volume":"3","author":"Zadeh","year":"1971","journal-title":"Inf. Sci."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1080\/01969727208542910","article-title":"A Fuzzy-Set-Theoretic Interpretation of Linguistic Hedges","volume":"2","author":"Zadeh","year":"1972","journal-title":"J. Cybern."},{"key":"ref_12","unstructured":"Saranya, M., and Amutha, B. (2024, January 14\u201315). A Survey of Machine Learning Technique for Topic Modeling and Word Embedding. Proceedings of the 2024 10th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India."},{"key":"ref_13","unstructured":"Hongliu, C.A.O. (2024). Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., Mirzaei, A., Arunkumar, A., and Khashabi, D. (2022). Super-naturalinstructions: Generalization via declarative instructions on 1600+ NLP tasks. arXiv.","DOI":"10.18653\/v1\/2022.emnlp-main.340"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Georgila, K. (2024, January 18\u201320). Comparing Pre-Trained Embeddings and Domain-Independent Features for Regression-Based Evaluation of Task-Oriented Dialogue Systems. Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Kyoto, Japan.","DOI":"10.18653\/v1\/2024.sigdial-1.52"},{"key":"ref_16","unstructured":"Baroni, M., and Zamparelli, R. (2010, January 9\u201311). Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"635","DOI":"10.1002\/lnco.362","article-title":"Vector space models of word meaning and phrase meaning: A survey","volume":"6","author":"Erk","year":"2012","journal-title":"Lang. Linguist. Compass"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"397","DOI":"10.2140\/pjm.1956.6.397","article-title":"On embedding uniform and topological spaces","volume":"6","author":"Arens","year":"1956","journal-title":"Pac. J. Math"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"36","DOI":"10.1016\/j.patrec.2018.12.007","article-title":"A note on the triangle inequality for the Jaccard distance","volume":"120","author":"Kosub","year":"2019","journal-title":"Pattern Recognit. Lett."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Deza, M.M., and Deza, E. (2009). Encyclopedia of Distances, Springer. [1st ed.].","DOI":"10.1007\/978-3-642-00234-2"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Gardner, A., Kanno, J., Duncan, C.A., and Selmic, R. (2014, January 23\u201328). Measuring distance between unordered sets of different sizes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.25"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Cobza\u015f, C. (2012). Functional Analysis in Asymmetric Normed Spaces, Springer Science & Business Media.","DOI":"10.1007\/978-3-0348-0478-3"},{"key":"ref_23","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013, January 5\u20138). Distributed representations of words and phrases and their compositionality. Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_24","unstructured":"Mikolov, T., Yih, W.T., and Zweig, G. (2013, January 9\u201314). Linguistic regularities in continuous space word representations. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"2697","DOI":"10.1016\/j.jfa.2019.02.003","article-title":"Isomorphisms between spaces of Lipschitz functions","volume":"277","author":"Candido","year":"2019","journal-title":"J. Funct. Anal."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Cobza\u015f, C., Miculescu, R., and Nicolae, A. (2019). Lipschitz Functions, Springer.","DOI":"10.1007\/978-3-030-16489-8"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1","DOI":"10.15388\/namc.2022.27.27493","article-title":"Index spaces and standard indices in metric modelling","volume":"27","author":"Erdogan","year":"2022","journal-title":"Nonlinear Anal. Model. Control"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"534","DOI":"10.4064\/fm-25-1-534-545","article-title":"Quelques probl\u00e8mes concernant les espaces m\u00e9triques non-s\u00e9parables","volume":"25","author":"Kuratowski","year":"1935","journal-title":"Fundam. Math."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Ruas, T., and Grosky, W. (2017, January 7\u201310). Keyword extraction through contextual semantic analysis of documents. Proceedings of the 9th International Conference on Management of Digital EcoSystems, Bangkok, Thailand.","DOI":"10.1145\/3167020.3167043"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Shi, F., Qing, P., Yang, D., Wang, N., Lei, Y., Lu, H., Lin, X., and Li, D. (2023). Prompt space optimizing few-shot reasoning success with large language models. arXiv.","DOI":"10.18653\/v1\/2024.findings-naacl.119"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wan, X., Sun, R., Dai, H., Arik, S.O., and Pfister, T. (2023). Better zero-shot reasoning with self-adaptive prompting. arXiv.","DOI":"10.18653\/v1\/2023.findings-acl.216"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"2444","DOI":"10.1016\/j.neucom.2017.11.019","article-title":"Corpus-based topic diffusion for short text clustering","volume":"275","author":"Zheng","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3582688","article-title":"A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities","volume":"55","author":"Song","year":"2023","journal-title":"Acm Comput. Surv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Xu, S., Pang, L., Shen, H., Cheng, X., and Chua, T.S. (2024, January 13\u201317). Search-in-the-Chain: Interactively Enhancing Large Language Models with Search for Knowledge-intensive Tasks. Proceedings of the ACM on Web Conference 2024, Singapore.","DOI":"10.1145\/3589334.3645363"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"4558","DOI":"10.1109\/TSC.2024.3451185","article-title":"When search engine services meet large language models: Visions and challenges","volume":"17","author":"Xiong","year":"2024","journal-title":"IEEE Trans. Serv. Comput."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"1719","DOI":"10.1007\/s10618-023-00933-9","article-title":"Benchmarking and Survey of Explanation Methods for Black Box Models","volume":"37","author":"Bodria","year":"2023","journal-title":"Data Min. Knowl. Disc."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"52138","DOI":"10.1109\/ACCESS.2018.2870052","article-title":"Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)","volume":"6","author":"Adadi","year":"2018","journal-title":"IEEE Access"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"657","DOI":"10.1162\/coli_a_00511","article-title":"Towards Faithful Model Explanation in NLP: A Survey","volume":"50","author":"Lyu","year":"2024","journal-title":"Comput. Linguist."},{"key":"ref_39","unstructured":"Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). Multilingual E5 Text Embeddings: A Technical Report. arXiv."},{"key":"ref_40","unstructured":"Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. (2020, January 6\u201312). MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. Proceedings of the Advances in Neural Information Processing Systems, Virtual."},{"key":"ref_41","unstructured":"Lin, Y., Ding, B., Jagadish, H.V., and Zhou, J. (2023). SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions. arXiv."}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/1\/30\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,8]],"date-time":"2025-10-08T10:32:29Z","timestamp":1759919549000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/1\/30"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,20]]},"references-count":41,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,1]]}},"alternative-id":["computers14010030"],"URL":"https:\/\/doi.org\/10.3390\/computers14010030","relation":{},"ISSN":["2073-431X"],"issn-type":[{"type":"electronic","value":"2073-431X"}],"subject":[],"published":{"date-parts":[[2025,1,20]]}}}