{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,15]],"date-time":"2026-01-15T02:45:01Z","timestamp":1768445101936,"version":"3.49.0"},"reference-count":38,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2024,2,22]],"date-time":"2024-02-22T00:00:00Z","timestamp":1708560000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004895","name":"European Union","doi-asserted-by":"publisher","award":["GINOP-2.3.2-15-2016-00005"],"award-info":[{"award-number":["GINOP-2.3.2-15-2016-00005"]}],"id":[{"id":"10.13039\/501100004895","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004895","name":"European Union","doi-asserted-by":"publisher","award":["TKP2021-NKTA-34"],"award-info":[{"award-number":["TKP2021-NKTA-34"]}],"id":[{"id":"10.13039\/501100004895","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004895","name":"European Union","doi-asserted-by":"publisher","award":["2020-1.1.2-PIACI-KFI-2021-00223"],"award-info":[{"award-number":["2020-1.1.2-PIACI-KFI-2021-00223"]}],"id":[{"id":"10.13039\/501100004895","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Research, Development, and Innovation Fund of Hungary","award":["GINOP-2.3.2-15-2016-00005"],"award-info":[{"award-number":["GINOP-2.3.2-15-2016-00005"]}]},{"name":"National Research, Development, and Innovation Fund of Hungary","award":["TKP2021-NKTA-34"],"award-info":[{"award-number":["TKP2021-NKTA-34"]}]},{"name":"National Research, Development, and Innovation Fund of Hungary","award":["2020-1.1.2-PIACI-KFI-2021-00223"],"award-info":[{"award-number":["2020-1.1.2-PIACI-KFI-2021-00223"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>The efficiency of natural language processing has improved dramatically with the advent of machine learning models, particularly neural network-based solutions. However, some tasks are still challenging, especially when considering specific domains. This paper presents a model that can extract insights from customer reviews using machine learning methods integrated into a pipeline. For topic modeling, our composite model uses transformer-based neural networks designed for natural language processing, vector-embedding-based keyword extraction, and clustering. The elements of our model have been integrated and tailored to better meet the requirements of efficient information extraction and topic modeling of the extracted information for opinion mining. Our approach was validated and compared with other state-of-the-art methods using publicly available benchmark datasets. The results show that our system performs better than existing topic modeling and keyword extraction methods in this task.<\/jats:p>","DOI":"10.3390\/bdcc8030020","type":"journal-article","created":{"date-parts":[[2024,2,22]],"date-time":"2024-02-22T03:30:26Z","timestamp":1708572626000},"page":"20","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["A Machine Learning-Based Pipeline for the Extraction of Insights from Customer Reviews"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-9349-3077","authenticated-orcid":false,"given":"R\u00f3bert","family":"Lakatos","sequence":"first","affiliation":[{"name":"Department of Data Science and Visualization, Faculty of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"},{"name":"Doctoral School of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"}]},{"given":"Gerg\u0151","family":"Bogacsovics","sequence":"additional","affiliation":[{"name":"Department of Data Science and Visualization, Faculty of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"},{"name":"Doctoral School of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4405-2040","authenticated-orcid":false,"given":"Bal\u00e1zs","family":"Harangi","sequence":"additional","affiliation":[{"name":"Department of Data Science and Visualization, Faculty of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"}]},{"given":"Istv\u00e1n","family":"Lakatos","sequence":"additional","affiliation":[{"name":"Department of Data Science and Visualization, Faculty of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"},{"name":"Doctoral School of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"}]},{"given":"Attila","family":"Tiba","sequence":"additional","affiliation":[{"name":"Department of Data Science and Visualization, Faculty of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1958-5144","authenticated-orcid":false,"given":"J\u00e1nos","family":"T\u00f3th","sequence":"additional","affiliation":[{"name":"Department of Data Science and Visualization, Faculty of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"}]},{"given":"Marianna","family":"Szab\u00f3","sequence":"additional","affiliation":[{"name":"Doctoral School of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"},{"name":"Department of Applied Mathematics and Probability Theory, Faculty of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1718-9770","authenticated-orcid":false,"given":"Andr\u00e1s","family":"Hajdu","sequence":"additional","affiliation":[{"name":"Department of Data Science and Visualization, Faculty of Informatics, University of Debrecen, H-4032 Debrecen, Hungary"}]}],"member":"1968","published-online":{"date-parts":[[2024,2,22]]},"reference":[{"key":"ref_1","first-page":"1","article-title":"Statistical language models for information retrieval","volume":"1","author":"Zhai","year":"2008","journal-title":"Synth. Lect. Hum. Lang. Technol."},{"key":"ref_2","first-page":"1","article-title":"Dependency parsing","volume":"1","author":"McDonald","year":"2009","journal-title":"Synth. Lect. Hum. Lang. Technol."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"264","DOI":"10.1145\/331499.331504","article-title":"Data clustering: A review","volume":"31","author":"Jain","year":"1999","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Xu, R., and Wunsch, D. (2008). Clustering, John Wiley & Sons.","DOI":"10.1002\/9780470382776"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1567","DOI":"10.1093\/genetics\/164.4.1567","article-title":"Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies","volume":"164","author":"Falush","year":"2003","journal-title":"Genetics"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"945","DOI":"10.1093\/genetics\/155.2.945","article-title":"Inference of population structure using multilocus genotype data","volume":"155","author":"Pritchard","year":"2000","journal-title":"Genetics"},{"key":"ref_7","unstructured":"Angelov, D. (2020). Top2vec: Distributed representations of topics. arXiv."},{"key":"ref_8","unstructured":"Grootendorst, M. (2022). BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics. arXiv."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Rajaraman, A., and Ullman, J.D. (2011). Mining of Massive Datasets, Cambridge University Press.","DOI":"10.1017\/CBO9781139058452"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1177\/1536867X1801800205","article-title":"Content analysis: Frequency distribution of words","volume":"18","author":"Dicle","year":"2018","journal-title":"Stata J."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C.D. (2014, January 25\u201329). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_12","unstructured":"Goldberg, Y., and Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.\u2019s negative-sampling word-embedding method. arXiv."},{"key":"ref_13","unstructured":"Joulin, A., Grave, E., Bojanowski, P., Douze, M., J\u00e9gou, H., and Mikolov, T. (2016). Fasttext. zip: Compressing text classification models. arXiv."},{"key":"ref_14","unstructured":"Raschka, S. (2015). Python Machine Learning, Packt Publishing Ltd."},{"key":"ref_15","first-page":"993","article-title":"Latent dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_16","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA."},{"key":"ref_17","unstructured":"Arthur, D., and Vassilvitskii, S. (2006). k-Means++: The Advantages of Careful Seeding, Soda. Technical Report."},{"key":"ref_18","unstructured":"Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996, January 2\u20134). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2733381","article-title":"Hierarchical density estimates for data clustering, visualization, and outlier detection","volume":"10","author":"Campello","year":"2015","journal-title":"ACM Trans. Knowl. Discov. Data (TKDD)"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"559","DOI":"10.1080\/14786440109462720","article-title":"LIII. On lines and planes of closest fit to systems of points in space","volume":"2","author":"Pearson","year":"1901","journal-title":"Lond. Edinb. Dublin Philos. Mag. J. Sci."},{"key":"ref_21","unstructured":"Halko, N., Martinsson, P.G., and Tropp, J.A. (2009). Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. arXiv."},{"key":"ref_22","unstructured":"Szlam, A., Kluger, Y., and Tygert, M. (2014). An implementation of a randomized algorithm for principal component analysis. arXiv."},{"key":"ref_23","first-page":"1609","article-title":"A unifying probabilistic perspective for spectral dimensionality reduction: Insights and new models","volume":"13","author":"Lawrence","year":"2012","journal-title":"J. Mach. Learn. Res."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Lee, J.A., and Verleysen, M. (2007). Nonlinear Dimensionality Reduction, Springer Science & Business Media.","DOI":"10.1007\/978-0-387-39351-3"},{"key":"ref_25","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"Hinton","year":"2008","journal-title":"J. Mach. Learn. Res."},{"key":"ref_26","first-page":"1303","article-title":"Stochastic variational inference","volume":"14","author":"Hoffman","year":"2013","journal-title":"J. Mach. Learn. Res."},{"key":"ref_27","first-page":"856","article-title":"Online learning for latent dirichlet allocation","volume":"23","author":"Hoffman","year":"2010","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"146","DOI":"10.1080\/00437956.1954.11659520","article-title":"Distributional structure","volume":"10","author":"Harris","year":"1954","journal-title":"Word"},{"key":"ref_29","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Ni, J., Li, J., and McAuley, J. (2019, January 3\u20137). Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1018"},{"key":"ref_31","unstructured":"Rennie, J. (2024, January 22). 20 Newsgroups Dataset. Available online: http:\/\/qwone.com\/~jason\/20Newsgroups\/."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Schuster, M., and Nakajima, K. (2012, January 25\u201330). Japanese and Korean voice search. Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan.","DOI":"10.1109\/ICASSP.2012.6289079"},{"key":"ref_33","unstructured":"Foundation, A.S. (2024, January 22). Apache Hadoop. Available online: https:\/\/hadoop.apache.org\/."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1147\/rd.14.0309","article-title":"A statistical approach to mechanized encoding and searching of literary information","volume":"1","author":"Luhn","year":"1957","journal-title":"IBM J. Res. Dev."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1108\/eb026526","article-title":"A statistical interpretation of term specificity and its application in retrieval","volume":"28","author":"Jones","year":"1972","journal-title":"J. Doc."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1016\/j.cosrev.2017.10.002","article-title":"The evolution of sentiment analysis\u2014A review of research topics, venues, and top cited papers","volume":"27","author":"Graziotin","year":"2018","journal-title":"Comput. Sci. Rev."},{"key":"ref_37","unstructured":"Grootendorst, M. (2024, January 22). KeyBERT. Available online: https:\/\/maartengr.github.io\/KeyBERT\/."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Reimers, N., and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv.","DOI":"10.18653\/v1\/D19-1410"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/3\/20\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:02:55Z","timestamp":1760104975000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/3\/20"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,22]]},"references-count":38,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2024,3]]}},"alternative-id":["bdcc8030020"],"URL":"https:\/\/doi.org\/10.3390\/bdcc8030020","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,22]]}}}