{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T20:25:14Z","timestamp":1760300714981,"version":"build-2065373602"},"reference-count":32,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2022,5,4]],"date-time":"2022-05-04T00:00:00Z","timestamp":1651622400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"NSERC Discovery Grant","award":["RGPIN\/341811-2012"],"award-info":[{"award-number":["RGPIN\/341811-2012"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>We propose a multi-layer data mining architecture for web services discovery using word embedding and clustering techniques to improve the web service discovery process. The proposed architecture consists of five layers: web services description and data preprocessing; word embedding and representation; syntactic similarity; semantic similarity; and clustering. In the first layer, we identify the steps to parse and preprocess the web services documents. In the second layer, Bag of Words with Term Frequency\u2013Inverse Document Frequency and three word-embedding models are employed for web services representation. In the third layer, four distance measures, namely, Cosine, Euclidean, Minkowski, and Word Mover, are considered to find the similarities between Web services documents. In layer four, WordNet and Normalized Google Distance are employed to represent and find the similarity between web services documents. Finally, in the fifth layer, three clustering algorithms, namely, affinity propagation, K-means, and hierarchical agglomerative clustering, are investigated for clustering of web services based on observed similarities in documents. We demonstrate how each component of the five layers is employed in web services clustering using randomly selected web services documents. We conduct experimental analysis to cluster web services using a collected dataset consisting of web services documents and evaluate their clustering performances. Using a ground truth for evaluation purposes, we observe that clusters built based on the word embedding models performed better than those built using the Bag of Words with Term Frequency\u2013Inverse Document Frequency model. Among the three word embedding models, the pre-trained Word2Vec\u2019s skip-gram model reported higher performance in clustering web services. Among the three semantic similarity measures, path-based WordNet similarity reported higher clustering performance. By considering the different word representations models and syntactic and semantic similarity measures, we found that the affinity propagation clustering technique performed better in discovering similarities among Web services.<\/jats:p>","DOI":"10.3390\/data7050057","type":"journal-article","created":{"date-parts":[[2022,5,4]],"date-time":"2022-05-04T08:21:25Z","timestamp":1651652485000},"page":"57","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Multi-Layer Web Services Discovery Using Word Embedding and Clustering Techniques"],"prefix":"10.3390","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5086-3950","authenticated-orcid":false,"given":"Waeal","family":"J. Obidallah","sequence":"first","affiliation":[{"name":"College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11673, Saudi Arabia"}]},{"given":"Bijan","family":"Raahemi","sequence":"additional","affiliation":[{"name":"Knowledge Discovery and Data Mining Lab, Telfer School of Management University of Ottawa, Ottawa, ON K1H 8M5, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5045-7322","authenticated-orcid":false,"given":"Waleed","family":"Rashideh","sequence":"additional","affiliation":[{"name":"College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11673, Saudi Arabia"}]}],"member":"1968","published-online":{"date-parts":[[2022,5,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"144","DOI":"10.1016\/j.scico.2008.02.002","article-title":"Easy web service discovery: A query-by-example approach","volume":"71","author":"Crasso","year":"2008","journal-title":"Sci. Comput. Program."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Klusch, M. (2014). Service Discovery. Encyclopedia of Social Networks and Mining (ESNAM), Springer.","DOI":"10.1007\/978-1-4614-6170-8_121"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Grefenstette, G. (1999). Tokenization. Syntactic Wordclass Tagging, Number October, Springer.","DOI":"10.1007\/978-94-015-9273-4_9"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1007\/s42979-019-0026-8","article-title":"Clustering and Association Rules for Web Service Discovery and Recommendation: A Systematic Literature Review","volume":"1","author":"Obidallah","year":"2020","journal-title":"SN Comput. Sci."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Obidallah, W.J., Ruhi, U., and Raahemi, B. (2016, January 13\u201316). Current Landscape of Web Service Discovery: A Typology Based on Five Characteristics. Proceedings of the 2016 IEEE\/WIC\/ACM International Conference on Web Intelligence (WI), Omaha, NE, USA.","DOI":"10.1109\/WI.2016.0121"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Obidallah, W.J., and Raahemi, B. (April, January 29). A Taxonomy to Characterize Web Service Discovery Approaches, Looking at Five Perspectives. Proceedings of the 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE), Oxford, UK.","DOI":"10.1109\/SOSE.2016.13"},{"key":"ref_7","unstructured":"Richardson, L. (2022, April 01). Beautiful Soup Documentation. Available online: https:\/\/buildmedia.readthedocs.org\/media\/pdf\/beautiful-soup-4\/latest\/beautiful-soup-4.pdf."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Rasmussen, E. (2009). Stoplists. Encyclopedia of Database Systems, Springer.","DOI":"10.1007\/978-0-387-39940-9_955"},{"key":"ref_9","unstructured":"Bird, S. NLTK: The Natural Language Toolkit. Proceedings of the COLING\/ACL on Interactive Presentation Sessions, Available online: https:\/\/dl.acm.org\/doi\/10.3115\/1225403.1225421."},{"key":"ref_10","unstructured":"Stanford-University (2022, April 01). Stemming and Lemmatization. Available online: https:\/\/nlp.stanford.edu\/IR-book\/html\/htmledition\/stemming-and-lemmatization-1.html."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"613","DOI":"10.1145\/361219.361220","article-title":"A vector space model for automatic indexing","volume":"18","author":"Salton","year":"1975","journal-title":"Commun. ACM"},{"key":"ref_12","unstructured":"Almeida, F., and Xex\u00e9o, G. (2019). Word Embeddings: A Survey. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Yan, J. (2009). Text Representation. Encyclopedia of Database Systems, Springer.","DOI":"10.1007\/978-0-387-39940-9_420"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"150","DOI":"10.1016\/j.patrec.2016.06.012","article-title":"Representation learning for very short texts using weighted word embedding aggregation","volume":"80","author":"Demeester","year":"2016","journal-title":"Pattern Recognit. Lett."},{"key":"ref_15","first-page":"331","article-title":"Chapter 11\u2014Information Retrieval: Concepts, Models, and Systems","volume":"Volume 38","author":"Gudivada","year":"2018","journal-title":"Handbook of Statistics: Computational Analysis and Understanding of Natural Languages: Principles, Methods and Applications"},{"key":"ref_16","unstructured":"Wikimedia (2022, April 01). Wikimedia Downloads. Available online: https:\/\/dumps.wikimedia.org\/backup-index.html."},{"key":"ref_17","unstructured":"Parker, R., Graff, D., Kong, J., Chen, K., and Maeda, K. (2011). Gigaword. English Gigaword Fifth Edition\u2014Linguistic Data Consortium, Linguistic Data Consortium."},{"key":"ref_18","unstructured":"Crawl, C. (2022, April 01). Common Crawl. Available online: https:\/\/commoncrawl.org\/."},{"key":"ref_19","unstructured":"Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K.Q. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems 26, Curran Associates, Inc."},{"key":"ref_20","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv, 1\u201312."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"743","DOI":"10.1613\/jair.1.11259","article-title":"From word to sense embeddings: A survey on vector representations of meaning","volume":"63","author":"Pilehvar","year":"2018","journal-title":"J. Artif. Intell. Res."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Han, J., Kamber, M., and Pei, J. (2012). Getting to Know Your Data. Data Mining, Elsevier.","DOI":"10.1016\/B978-0-12-381479-1.00002-2"},{"key":"ref_23","unstructured":"Kusner, M.J., Sun, Y., Kolkin, N.I., and Weinberger, K.Q. (2015, January 6\u201311). From Word Embeddings to Document Distances. Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1145\/219717.219748","article-title":"WordNet: A lexical database for English","volume":"38","author":"Miller","year":"1995","journal-title":"Commun. ACM"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"370","DOI":"10.1109\/TKDE.2007.48","article-title":"The Google Similarity Distance","volume":"19","author":"Cilibrasi","year":"2007","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_26","unstructured":"Vit\u00e1nyi, P.M., Cilibrasi, R.L., and Vitanyi, P.M.B. (2009). Normalized web distance and word similarity. arXiv."},{"key":"ref_27","unstructured":"Wu, Z., and Palmer, M. Verbs semantics and lexical selection. Proceedings of the 32nd annual meeting on Association for Computational Linguistics."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"972","DOI":"10.1126\/science.1136800","article-title":"Clustering by Passing Messages Between Data Points","volume":"315","author":"Frey","year":"2007","journal-title":"Science"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Everitt, B.S., Landau, S., Leese, M., and Stahl, D. (2011). Hierarchical Clustering. Cluster Analysis, John Wiley & Sons, Inc.","DOI":"10.1002\/9780470977811"},{"key":"ref_30","first-page":"281","article-title":"Some methods for classification and analysis of multivariate observations","volume":"Volume 1","author":"MacQueen","year":"1967","journal-title":"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability"},{"key":"ref_31","unstructured":"Gensim (2022, April 01). Gensim: Topic Modelling for Humans. Available online: https:\/\/radimrehurek.com\/gensim\/."},{"key":"ref_32","first-page":"2825","article-title":"Scikit-Learn: Machine Learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/7\/5\/57\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:05:58Z","timestamp":1760137558000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/7\/5\/57"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,4]]},"references-count":32,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2022,5]]}},"alternative-id":["data7050057"],"URL":"https:\/\/doi.org\/10.3390\/data7050057","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2022,5,4]]}}}