{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T21:35:15Z","timestamp":1767908115599,"version":"3.49.0"},"reference-count":34,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2020,6,19]],"date-time":"2020-06-19T00:00:00Z","timestamp":1592524800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MTI"],"abstract":"<jats:p>Increasingly, the web produces massive volumes of texts, alone or associated with images, videos, photographs, together with some metadata, indispensable for their finding and retrieval. Keywords\/keyphrases that characterize the semantic content of documents should be, automatically or manually, extracted, and\/or associated with them. The paper presents a novel method to address the problem of the automatic unsupervised extraction of keywords\/phrases from texts, expressed both in English and in Italian. The main feature of this approach is the integration of two methods that have given interesting results: word embedding models, such as Word2Vec or GloVe able to capture the semantics of words and their context, and clustering algorithms, able to identify the essence of the terms and choose the more significant one(s), to represent the contents of a text. In the paper, the datasets used are presented, together with the method implemented and the results obtained. These results will be discussed, commented, and compared with those obtained in previous experimentations, using TextRank, Rapid Automatic Keyword Extraction (RAKE), and TF-IDF.<\/jats:p>","DOI":"10.3390\/mti4020030","type":"journal-article","created":{"date-parts":[[2020,6,19]],"date-time":"2020-06-19T10:43:58Z","timestamp":1592563438000},"page":"30","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":19,"title":["Semantic Unsupervised Automatic Keyphrases Extraction by Integrating Word Embedding with Clustering Methods"],"prefix":"10.3390","volume":"4","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4667-919X","authenticated-orcid":false,"given":"Isabella","family":"Gagliardi","sequence":"first","affiliation":[{"name":"Institute for Applied Mathematics and Information Technologies \u201cEnrico Magenes\u201d (IMATI), National Research Council\u2014CNR, Via Bassini, 15, 20133 Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Maria Teresa","family":"Artese","sequence":"additional","affiliation":[{"name":"Institute for Applied Mathematics and Information Technologies \u201cEnrico Magenes\u201d (IMATI), National Research Council\u2014CNR, Via Bassini, 15, 20133 Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2020,6,19]]},"reference":[{"key":"ref_1","first-page":"35","article-title":"Modern information retrieval: A brief overview","volume":"24","author":"Singhal","year":"2001","journal-title":"IEEE Data Eng. Bull."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Manning, C.D., Raghavan, P., and Sch\u00fctze, H. (2008). Introduction to Information Retrieval, Cambridge University Press.","DOI":"10.1017\/CBO9780511809071"},{"key":"ref_3","first-page":"1","article-title":"An overview of graph-based keyword extraction methods and approaches","volume":"39","author":"Beliga","year":"2015","journal-title":"J. Inf. Organ. Sci."},{"key":"ref_4","first-page":"1169","article-title":"Automatic keyword extraction from documents using conditional random fields","volume":"4","author":"Zhang","year":"2008","journal-title":"J. Comput. Inf. Syst."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Hasan, K.S., and Ng, V. (2014, January 23\u201325). Automatic Keyphrase Extraction: A Survey of the State of the Art. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland.","DOI":"10.3115\/v1\/P14-1119"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Merrouni, Z.A., Frikh, B., and Ouhbi, B. (2016, January 24\u201326). Automatic keyphrase extraction: An overview of the state of the art. Proceedings of the 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt), Tangier, Morocco.","DOI":"10.1109\/CIST.2016.7805062"},{"key":"ref_7","first-page":"18","article-title":"Keyword and keyphrase extraction techniques: A literature review","volume":"109","author":"Siddiqi","year":"2015","journal-title":"Int. J. Comput. Appl."},{"key":"ref_8","unstructured":"Mihalcea, R., and Tarau, P. (2004). Textrank: Bringing order into text. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Berry, M.W., and Kogan, J. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, Wiley.","DOI":"10.1002\/9780470689646"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wan, X., and Xiao, J. (2008, January 18\u201322). CollabRank: Towards a collaborative approach to single-document keyphrase extraction. Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, UK.","DOI":"10.3115\/1599081.1599203"},{"key":"ref_11","unstructured":"Wan, X., and Xiao, J. (2008, January 13\u201317). Single Document Keyphrase Extraction Using Neighborhood Knowledge. Proceedings of the AAAI, Chicago, IL, USA."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Liu, Z., Li, P., Zheng, Y., and Sun, M. (2009, January 6\u20137). Clustering to find exemplar terms for keyphrase extraction. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore.","DOI":"10.3115\/1699510.1699544"},{"key":"ref_13","first-page":"222","article-title":"SemCluster: Unsupervised automatic keyphrase extraction using affinity propagation","volume":"Volume 650","author":"Chao","year":"2017","journal-title":"Advances in Computational Intelligence Systems"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"188","DOI":"10.1002\/aris.1440380105","article-title":"Latent semantic analysis","volume":"38","author":"Dumais","year":"2004","journal-title":"Annu. Rev. Inf. Sci. Technol."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Landauer, T.K. (2006). Latent semantic analysis. Encyclopedia of Cognitive Science, Wiley.","DOI":"10.1002\/0470018860.s00561"},{"key":"ref_16","first-page":"993","article-title":"Latent dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Comito, C., Forestiero, A., and Pizzuti, C. (2019, January 14\u201317). Word Embedding based Clustering to Detect Topics in Social Media. Proceedings of the 2019 IEEE\/WIC\/ACM International Conference on Web Intelligence (WI), Thessaloniki, Greece.","DOI":"10.1145\/3350546.3352518"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Hu, J., Li, S., Yao, Y., Yu, L., Yang, G., and Hu, J. (2018). Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy, 20.","DOI":"10.3390\/e20020104"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Artese, M.T., and Gagliardi, I. (2018, January 16\u201318). What is this painting about? Experiments on Unsupervised Keyphrases Extraction algorithms. Proceedings of the IOP Conference Series: Materials Science and Engineering, Florence, Italy.","DOI":"10.1088\/1757-899X\/364\/1\/012050"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Artese, M.T., and Gagliardi, I. (2020). Unsupervised Automatic Keyphrases Extraction Algorithms: Multilingual Experimentations, Encyclopedia of Information Science and Technology, [5th ed.]. in press.","DOI":"10.2352\/issn.2168-3204.2019.1.0.36"},{"key":"ref_21","unstructured":"Schmid, G. (1994). Treetagger-a Language Independent Part-of-Speech Tagger, Institut f\u00fcr Maschinelle Sprachverarbeitung, Universit\u00e4t Stuttgart."},{"key":"ref_22","unstructured":"Gabrilovich, E., and Markovitch, S. (2007, January 6\u201312). Computing semantic relatedness using wikipedia-based explicit semantic analysis. Proceedings of the IJcAI 2007, Hyderabad, India."},{"key":"ref_23","first-page":"316","article-title":"Measuring Text-Based Semantics Relatedness Using WordNet","volume":"13","author":"Khan","year":"2019","journal-title":"Int. J. Cogn. Lang. Sci."},{"key":"ref_24","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013, January 5\u201310). Distributed representations of words and phrases and their compositionality. Proceedings of the 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, NV, USA."},{"key":"ref_25","unstructured":"Mikolov, T., Chen, K., Corrado, G., Dean, J., Sutskever, L., and Zweig, G. (2020, March 27). Tool for Computing Continuous Distributed Representations of Words: word2vec. Available online: https:\/\/code.google.com\/p\/word2vec."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Aggarwal, C.C., and Zhai, C. (2012). A survey of text clustering algorithms. Mining Text Data, Springer.","DOI":"10.1007\/978-1-4614-3223-4"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"972","DOI":"10.1126\/science.1136800","article-title":"Clustering by passing messages between data points","volume":"315","author":"Frey","year":"2007","journal-title":"Science"},{"key":"ref_29","first-page":"281","article-title":"Some methods for classification and analysis of multivariate observations","volume":"Volume 1","author":"Neyman","year":"1967","journal-title":"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Loper, E., and Bird, S. (2002). NLTK: The Natural Language Toolkit. arXiv.","DOI":"10.3115\/1118108.1118117"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Taylor, A., Marcus, M., and Santorini, B. (2003). The Penn treebank: An overview. Treebanks, Springer.","DOI":"10.1007\/978-94-010-0201-1_1"},{"key":"ref_32","unstructured":"Bontcheva, K., and Zhu, J. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics."},{"key":"ref_33","first-page":"2063","article-title":"Pattern for python","volume":"13","author":"Daelemans","year":"2012","journal-title":"J. Mach. Learn. Res."},{"key":"ref_34","unstructured":"\u0158eh\u016f\u0159ek, R., and Sojka, P. (2020, June 19). Gensim\u2014Statistical Semantics in Python. Available online: https:\/\/radimrehurek.com\/gensim\/."}],"container-title":["Multimodal Technologies and Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2414-4088\/4\/2\/30\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:40:56Z","timestamp":1760175656000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2414-4088\/4\/2\/30"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,6,19]]},"references-count":34,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2020,6]]}},"alternative-id":["mti4020030"],"URL":"https:\/\/doi.org\/10.3390\/mti4020030","relation":{},"ISSN":["2414-4088"],"issn-type":[{"value":"2414-4088","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,6,19]]}}}