{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T02:03:11Z","timestamp":1760148191947,"version":"build-2065373602"},"reference-count":27,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2023,4,5]],"date-time":"2023-04-05T00:00:00Z","timestamp":1680652800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>To solve the problem of text clustering according to semantic groups, we suggest using a model of a unified lexico-semantic bond between texts and a similarity matrix based on it. Using lexico-semantic analysis methods, we can create \u201cterm\u2013document\u201d matrices based both on the occurrence frequencies of words and n-grams and the determination of the degrees of nodes in their semantic network, followed by calculating the cosine metrics of text similarity. In the process of the construction of the text similarity matrix using lexical or semantic analysis methods, the cosine of the angle for a vector pair describing such texts will determine the degree of similarity in the lexical or semantic presentation, respectively. Based on the averaging procedure described in this paper, we can obtain a matrix of cosine metric values that describes the lexico-semantic bonds between texts. We propose an algorithm for solving text clustering problems. This algorithm allows one to use the statistical characteristics of the distribution functions of element values in the rows of the cosine metric value matrix in the model of the lexico-semantic bond between documents. In addition, this algorithm allows one to separately describe the matrix of the cosine metric values obtained separately based on the lexical or semantic properties of texts. Our research has shown that the developed model for the lexico-semantic presentation of texts allows one to slightly increase the accuracy of their subsequent clustering. The statistical text clustering algorithm based on this model shows excellent results that are comparable to those of the widely used affinity propagation algorithm. Additionally, our algorithm does not require specification of the degree of similarity for combining vectors into a common cluster and other configuration parameters. The suggested model and algorithm significantly expand the list of known approaches for determining text similarity metrics and their clustering.<\/jats:p>","DOI":"10.3390\/a16040198","type":"journal-article","created":{"date-parts":[[2023,4,6]],"date-time":"2023-04-06T01:10:27Z","timestamp":1680743427000},"page":"198","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Model of Lexico-Semantic Bonds between Texts for Creating Their Similarity Metrics and Developing Statistical Clustering Algorithm"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4516-3746","authenticated-orcid":false,"given":"Liliya","family":"Demidova","sequence":"first","affiliation":[{"name":"Institute of Information Technology, MIREA-Russian Technological University, 78 Vernadsky Avenue, 119454 Moscow, Russia"}]},{"given":"Dmitry","family":"Zhukov","sequence":"additional","affiliation":[{"name":"Institute of Cybersecurity and Digital Technologies, MIREA-Russian Technological University, 78 Vernadsky Avenue, 119454 Moscow, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6418-6797","authenticated-orcid":false,"given":"Elena","family":"Andrianova","sequence":"additional","affiliation":[{"name":"Institute of Information Technology, MIREA-Russian Technological University, 78 Vernadsky Avenue, 119454 Moscow, Russia"}]},{"given":"Vladimir","family":"Kalinin","sequence":"additional","affiliation":[{"name":"Institute of Radio Electronics and Informatics, MIREA-Russian Technological University, 78 Vernadsky Avenue, 119454 Moscow, Russia"}]}],"member":"1968","published-online":{"date-parts":[[2023,4,5]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Kadhim, A.I., Cheah, Y.-N., and Ahamed, N.H. (June, January 12). Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering. Proceedings of the 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, ICAIET 2014, Kota Kinabalu, Malaysia.","DOI":"10.1109\/ICAIET.2014.21"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"427","DOI":"10.1007\/978-981-10-3223-3_41","article-title":"A novel map-reduce based augmented clustering algorithm for big text datasets","volume":"542","author":"Kanimozhi","year":"2018","journal-title":"Adv. Intell. Syst. Comput."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1016\/j.patrec.2016.11.004","article-title":"Improved TFIDF in big news retrieval: An empirical study","volume":"93","author":"Chen","year":"2017","journal-title":"Pattern Recognit. Lett."},{"key":"ref_4","unstructured":"Bouras, C., and Tsogkas, V. (2013, January 29\u201331). Enhancing news articles clustering using word N-grams. Proceedings of the 2nd International Conference on Data Technologies and Applications, Reykjav\u00edk, Iceland."},{"key":"ref_5","first-page":"3592","article-title":"Improvement tfidf for news document using efficient similarity","volume":"4","author":"Elahi","year":"2012","journal-title":"Res. J. Appl. Sci. Eng. Technol."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"208","DOI":"10.1016\/j.knosys.2017.07.010","article-title":"Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy","volume":"133","author":"Yaohui","year":"2017","journal-title":"Knowl.-Based Syst."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"268","DOI":"10.1016\/j.eswa.2017.03.057","article-title":"Variable Global Feature Selection Scheme for automatic classification of text documents","volume":"81","author":"Agnihotri","year":"2017","journal-title":"Expert Syst. Appl."},{"key":"ref_8","unstructured":"Al-Fath, A.M.U., Saleh, W.K.R., and Sa\u2019Adah, S. (2016, January 25\u201327). Implementation of MCL algorithm in clustering digital news with graph representation. Proceedings of the 4th International Conference on Information and Communication Technology, ICoICT, Bandung, Indonesia."},{"key":"ref_9","first-page":"466","article-title":"Thematic Clustering Methods Applied to News Texts Analysis","volume":"Volume 466","author":"Kravets","year":"2014","journal-title":"Knowledge-Based Software Engineering, Proceedings of the JCKBSE 2014. Communications in Computer and Information Science, Volgograd, Russia, 17\u201320 September 2014"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1563","DOI":"10.1109\/TKDE.2017.2681669","article-title":"Sparse Poisson Latent Block Model for Document Clustering","volume":"29","author":"Ailem","year":"2017","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1016\/j.eswa.2017.05.002","article-title":"Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering","volume":"84","author":"Abualigah","year":"2017","journal-title":"Expert Syst. Appl."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Rahmawati, D., Saptawati, G.A.P., and Widyani, Y. (2016, January 25\u201326). Document clustering using sequential pattern (SP): Maximal frequent sequences (MFS) as SP representation. Proceedings of the 2015 International Conference on Data and Software Engineering, ICODSE 2015, Yogyakarta, Indonesia.","DOI":"10.1109\/ICODSE.2015.7436979"},{"key":"ref_13","first-page":"527","article-title":"A Graph-based Approach to Text Genre Analysis","volume":"20","author":"Nabhan","year":"2016","journal-title":"Comput. Sist."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Ali, I., and Melton, A. (February, January 31). Semantic-Based Text Document Clustering Using Cognitive Semantic Learning and Graph Theory. Proceedings of the 2018 IEEE 12th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA.","DOI":"10.1109\/ICSC.2018.00042"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"350","DOI":"10.1016\/j.ins.2016.12.027","article-title":"Clustering coefficients of large networks","volume":"382\u2013383","author":"Li","year":"2017","journal-title":"Inf. Sci."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"759391","DOI":"10.1155\/2014\/759391","article-title":"Geometric Assortative Growth Model for Small-World Networks","volume":"2014","author":"Shang","year":"2014","journal-title":"Sci. World J."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"012051","DOI":"10.1088\/1742-6596\/1703\/1\/012051","article-title":"Using semantic field model to create information search engines","volume":"1703","author":"Sachkov","year":"2020","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wazarkar, S.V., and Manjrekar, A.A. (2014, January 24\u201327). HFRECCA for clustering of text data from travel guide articles. Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014, Delhi, India.","DOI":"10.1109\/ICACCI.2014.6968349"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Hamroun, M., Gouider, M.S., and Said, L.B. (2015, January 19\u201321). Lexico Semantic Patterns for Customer Intentions Analysis of Microblogging. Proceedings of the 2015 11th International Conference on Semantics, Knowledge and Grids (SKG), Beijing, China.","DOI":"10.1109\/SKG.2015.40"},{"key":"ref_20","first-page":"315","article-title":"SRDF: A Novel Lexical Knowledge Graph for Whole Sentence Knowledge Extraction","volume":"Volume 10318","author":"Gracia","year":"2017","journal-title":"Language, Data, and Knowledge. LDK 2017"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"7","DOI":"10.32362\/2500-316X-2021-9-1-7-17","article-title":"Research of unstructured data interpretation problems","volume":"9","author":"Tomashevskaya","year":"2021","journal-title":"Russ. Technol. J."},{"key":"ref_22","unstructured":"Lemaire, B., and Denhiere, G. (2004, January 4\u20137). Incremental Construction of an Associative Network from a Corpus. Proceedings of the 26th Annual Meeting of the Cognitive Science Society, Chicago, IL, USA."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"188","DOI":"10.3758\/BF03200643","article-title":"From simple associations to the building blocks of language: Modeling meaning in memory with the HAL model","volume":"30","author":"Burgess","year":"1998","journal-title":"Behav. Res. Methods Instrum. Comput."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1017\/S0257543400001061","article-title":"Explorations in the derivation of semantic representations from word co-occurrence statistics","volume":"10","author":"Levy","year":"1998","journal-title":"South Pac. J. Psychol."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"McInnes, L., Healy, J., and Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv.","DOI":"10.21105\/joss.00861"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Demidova, L.A., and Gorchakov, A.V. (2022). Fuzzy Information Discrimination Measures and Their Application to Low Dimensional Embedding Construction in the UMAP Algorithm. J. Imaging, 8.","DOI":"10.3390\/jimaging8040113"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"972","DOI":"10.1126\/science.1136800","article-title":"Clustering by passing messages between data points","volume":"315","author":"Frey","year":"2007","journal-title":"Science"}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/16\/4\/198\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:10:31Z","timestamp":1760123431000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/16\/4\/198"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,5]]},"references-count":27,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2023,4]]}},"alternative-id":["a16040198"],"URL":"https:\/\/doi.org\/10.3390\/a16040198","relation":{},"ISSN":["1999-4893"],"issn-type":[{"type":"electronic","value":"1999-4893"}],"subject":[],"published":{"date-parts":[[2023,4,5]]}}}