{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T14:38:47Z","timestamp":1780497527879,"version":"3.54.1"},"reference-count":54,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2025,4,27]],"date-time":"2025-04-27T00:00:00Z","timestamp":1745712000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>This paper examines the ability of ChatGPT to generate synthetic comment datasets that mimic those produced by humans. To this end, a collection of datasets containing human comments, freely available in the Kaggle repository, was compared to comments generated via ChatGPT. The latter were based on prompts designed to provide the necessary context for approximating human results. It was hypothesized that the responses obtained from ChatGPT would demonstrate a high degree of similarity with the human-generated datasets with regard to vocabulary usage. Two categories of prompts were analyzed, depending on whether they specified the desired length of the generated comments. The evaluation of the results primarily focused on the vocabulary used in each comment dataset, employing several analytical measures. This analysis yielded noteworthy observations, which reflect the current capabilities of ChatGPT in this particular task domain. It was observed that ChatGPT typically employs a reduced number of words compared to human respondents and tends to provide repetitive answers. Furthermore, the responses of ChatGPT have been observed to vary considerably when the length is specified. It is noteworthy that ChatGPT employs a smaller vocabulary, which does not always align with human language. Furthermore, the proportion of non-stop words in ChatGPT\u2019s output is higher than that found in human communication. Finally, the vocabulary of ChatGPT is more closely aligned with human language than the similarity between the two configurations of ChatGPT. This alignment is particularly evident in the use of stop words. While it does not fully achieve the intended purpose, the generated vocabulary serves as a reasonable approximation, enabling specific applications such as the creation of word clouds.<\/jats:p>","DOI":"10.3390\/computers14050162","type":"journal-article","created":{"date-parts":[[2025,4,28]],"date-time":"2025-04-28T05:21:31Z","timestamp":1745817691000},"page":"162","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["A Black-Box Analysis of the Capacity of ChatGPT to Generate Datasets of Human-like Comments"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4579-3556","authenticated-orcid":false,"given":"Alejandro","family":"Rosete","sequence":"first","affiliation":[{"name":"Facultad de Ingenier\u00eda Inform\u00e1tica, Universidad Tecnol\u00f3gica de La Habana Jos\u00e9 Antonio Echeverr\u00eda (Cujae), Marianao, La Habana 19390, Cuba"},{"name":"Avangenio S.R.L., 5ta B. esq. 6, Miramar, Playa, La Habana 11300, Cuba"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7793-896X","authenticated-orcid":false,"given":"Guillermo","family":"Sosa-G\u00f3mez","sequence":"additional","affiliation":[{"name":"Facultad de Ciencias Econ\u00f3micas y Empresariales, Universidad Panamericana, \u00c1lvaro del Portillo 49, Zapopan 45010, Jalisco, Mexico"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0681-3833","authenticated-orcid":false,"given":"Omar","family":"Rojas","sequence":"additional","affiliation":[{"name":"Facultad de Ciencias Econ\u00f3micas y Empresariales, Universidad Panamericana, \u00c1lvaro del Portillo 49, Zapopan 45010, Jalisco, Mexico"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2025,4,27]]},"reference":[{"key":"ref_1","unstructured":"OpenAI (2024, April 12). GPT-4 Technical Report. Available online: https:\/\/cdn.openai.com\/papers\/gpt-4.pdf."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"2275851","DOI":"10.1080\/23311975.2023.2275851","article-title":"ChatGPT: A brief narrative review","volume":"10","author":"Gupta","year":"2023","journal-title":"Cogent Bus. Manag."},{"key":"ref_3","first-page":"3","article-title":"The Coming ChatGPT","volume":"144","author":"Salloum","year":"2024","journal-title":"Stud. Big Data"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Najafov, E. (2024). Understanding ChatGPT. ChatGPT for Marketing, Apress.","DOI":"10.1007\/979-8-8688-0312-3"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"St\u00f6ckl, A. (2024). Information visualization with ChatGPT. Artificial Intelligence and Visualization: Advancing Visual Knowledge Discovery, Springer.","DOI":"10.1007\/978-3-031-46549-9_17"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Naznin, K., Mahmud, A.A., Nguyen, M.T., and Chua, C. (2025). ChatGPT Integration in Higher Education for Personalized Learning, Academic Writing, and Coding Tasks: A Systematic Review. Computers, 14.","DOI":"10.3390\/computers14020053"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1080\/13691457.2024.2377786","article-title":"Confronting and managing ethical dilemmas in social work using ChatGPT","volume":"28","author":"Segal","year":"2024","journal-title":"Eur. J. Soc. Work"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"80218","DOI":"10.1109\/ACCESS.2023.3300381","article-title":"From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy","volume":"11","author":"Gupta","year":"2023","journal-title":"IEEE Access"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"483","DOI":"10.1038\/s41551-024-01284-6","article-title":"Simple and effective embedding model for single-cell biology built from ChatGPT","volume":"9","author":"Chen","year":"2024","journal-title":"Nat. Biomed. Eng"},{"key":"ref_10","first-page":"7","article-title":"Evaluating research quality with Large Language Models: An analysis of ChatGPT\u2019s effectiveness with different settings and inputs","volume":"10","author":"Thelwall","year":"2024","journal-title":"J. Data Inf. Sci."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Thelwall, M., and Kousha, K. (2024). Journal Quality Factors from ChatGPT: More meaningful than Impact Factors?. J. Data Inf. Sci.","DOI":"10.2478\/jdis-2025-0016"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Thelwall, M. (2025). Is Google Gemini better than ChatGPT at evaluating research quality?. J. Data Inf. Sci.","DOI":"10.2478\/jdis-2025-0014"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Fuller, K.A., Morbitzer, K.A., Zeeman, J.M., Persky, A.M., Savage, A.C., and McLaughlin, J.E. (2024). Exploring the use of ChatGPT to analyze student course evaluation comments. BMC Med. Educ., 24.","DOI":"10.1186\/s12909-024-05316-2"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"23837","DOI":"10.1007\/s12144-024-06140-z","article-title":"Can ChatGPT provide a better support: A comparative analysis of ChatGPT and dataset responses in mental health dialogues","volume":"43","author":"Naher","year":"2024","journal-title":"Curr. Psychol."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"825","DOI":"10.1038\/s41597-024-03661-x","article-title":"A dataset of synthetic art dialogues with ChatGPT","volume":"11","year":"2024","journal-title":"Sci. Data"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"18403","DOI":"10.1007\/s10639-024-12589-z","article-title":"Machine learning model for chatGPT usage detection in students\u2019 answers to open-ended questions: Case of Lithuanian language","volume":"29","year":"2024","journal-title":"Educ. Inf. Technol."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Kauchak, D., Song, V., Mishra, P., Leroy, G., Harber, P., Rains, S., Hamre, J., and Morgenstein, N. (2024). Automatic Generation of a large multiple-choice question-answer corpus. Intelligent Systems and Applications, Springer.","DOI":"10.1007\/978-3-031-66428-1_4"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Franke, S., Pott, C., Rutinowski, J., Pauly, M., Reining, C., and Kirchheim, A. (2025). Can ChatGPT Solve Undergraduate Exams from Warehousing Studies? An Investigation. Computers, 14.","DOI":"10.20944\/preprints202501.0843.v1"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Montenegro-Rueda, M., Fern\u00e1ndez-Cerero, J., Fern\u00e1ndez-Batanero, J.M., and L\u00f3pez-Meneses, E. (2023). Impact of the Implementation of ChatGPT in Education: A Systematic Review. Computers, 12.","DOI":"10.3390\/computers12080153"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Iio, J. (2023). Analysis of critical comments on ChatGPT. Advances in Network-Based Information Systems, Springer.","DOI":"10.1007\/978-3-031-40978-3_48"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1007\/s12109-024-09998-w","article-title":"This Book is Written by ChatGPT: A Quantitative Analysis of ChatGPT Authorships Through Amazon.com","volume":"40","year":"2024","journal-title":"Publ. Res. Q."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Bucol, J.L., and Sangkawong, N. (2024). Exploring ChatGPT as a Writing Assessment Tool. Innov. Educ. Teach. Int., 1\u201316.","DOI":"10.1080\/14703297.2024.2363901"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"704","DOI":"10.1057\/s41599-023-02119-6","article-title":"ChatGPT as a COBUILD lexicographer","volume":"10","author":"Lew","year":"2023","journal-title":"Humanit. Soc. Sci. Commun."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"3219","DOI":"10.1007\/s00405-024-08524-0","article-title":"ChatGPT vs. web search for patient questions: What does ChatGPT do better?","volume":"281","author":"Shen","year":"2024","journal-title":"Eur. Arch. Otorhinolaryngol."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"779","DOI":"10.1007\/s00062-024-01426-y","article-title":"Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases","volume":"34","author":"Horiuchi","year":"2024","journal-title":"Clin. Neuroradiol."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1790","DOI":"10.1007\/s11695-023-06603-5","article-title":"Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery","volume":"33","author":"Samaan","year":"2023","journal-title":"Obes. Surg."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"26839","DOI":"10.1109\/ACCESS.2024.3365742","article-title":"A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges","volume":"12","author":"Raiaan","year":"2024","journal-title":"IEEE Access"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Kumar, S., Balachandran, V., Njoo, L., Anastasopoulos, A., and Tsvetkov, Y. (2023, January 2\u20136). Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia.","DOI":"10.18653\/v1\/2023.eacl-main.241"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Sable, R., Baviskar, V., Gupta, S., Pagare, D., Kasliwal, E., Bhosale, D., and Jade, P. (2023). AI Content Detection. Advanced Computing, Springer.","DOI":"10.1007\/978-3-031-56700-1_22"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1706","DOI":"10.1057\/s41599-024-04219-3","article-title":"Does ChatGPT show gender bias in behavior detection?","volume":"11","author":"Wu","year":"2024","journal-title":"Humanit. Soc. Sci. Commun."},{"key":"ref_31","first-page":"4","article-title":"Detecting LLM-assisted writing in scientific communication: Are we there yet?","volume":"9","author":"Lazebnik","year":"2024","journal-title":"J. Data Inf. Sci."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Rawashdeh, A., Rawashdeh, O., and Rawashdeh, M. (2024, January 14\u201315). ChatGPT and ChatGPT API: An Experiment with Evaluating ChatGPT Answers. Proceedings of the Future Technologies Conference (FTC), London, UK. Lecture Notes in Networks and Systems.","DOI":"10.1007\/978-3-031-73125-9_33"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Pieper, T., Ballout, M., Krumnack, U., Heidemann, G., and K\u00fchnberger, K. (2024). Enhancing small language models via ChatGPT and dataset augmentation. Natural Language Processing and Information Systems, Springer.","DOI":"10.1007\/978-3-031-70242-6_26"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Vinora, A., Bojiah, J., and Alfiras, M. (2025). Sentiment analysis of reviews on AI interface ChatGPT: An interpretative study. Business Sustainability with Artificial Intelligence (AI): Challenges and Opportunities, Springer.","DOI":"10.1007\/978-3-031-71318-7_30"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1038\/s41586-024-07566-y","article-title":"AI models collapse when trained on recursively generated data","volume":"631","author":"Shumailov","year":"2024","journal-title":"Nature"},{"key":"ref_36","unstructured":"Laxman, D. (2024, November 04). Kaggle: Amazon Reviews Dataset. Available online: https:\/\/www.kaggle.com\/datasets\/dongrelaxman\/amazon-reviews-dataset."},{"key":"ref_37","unstructured":"Jagathratchakan, J. (2024, November 04). Kaggle: Indian Airlines Customer Reviews. Available online: https:\/\/www.kaggle.com\/datasets\/jagathratchakan\/indian-airlines-customer-reviews."},{"key":"ref_38","unstructured":"Elgiriyewithana, N. (2024, November 04). Kaggle: McDonald\u2019s Store Reviews. Available online: https:\/\/www.kaggle.com\/datasets\/nelgiriyewithana\/mcdonalds-store-reviews."},{"key":"ref_39","unstructured":"Nicapotato, N. (2024, November 04). Kaggle: Women\u2019s E-Commerce Clothing Reviews. Available online: https:\/\/www.kaggle.com\/datasets\/nicapotato\/womens-ecommerce-clothing-reviews."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Volkova, S. (2023). An overview on data augmentation for machine learning. Digital and Information Technologies in Economics and Management, Springer.","DOI":"10.1007\/978-3-031-55349-3_12"},{"key":"ref_41","unstructured":"Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., and Zhou, D. (December, January 28). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"140","DOI":"10.1111\/nyas.15007","article-title":"Holistic Evaluation of Language Models","volume":"1525","author":"Bommasani","year":"2023","journal-title":"Ann. N. Y. Acad. Sci."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1007\/s10462-024-10903-2","article-title":"Contrasting Linguistic Patterns in Human and LLM-Generated News Text","volume":"57","author":"Vilares","year":"2024","journal-title":"Artif. Intell. Rev."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Botana, F., Recio, T., and V\u00e9lez, M.P. (2024). On Using GeoGebra and ChatGPT for Geometric Discovery. Computers, 13.","DOI":"10.3390\/computers13080187"},{"key":"ref_45","unstructured":"Selvio\u011flu, A., Adanova, V., and Atagoziev, M. (2025). Feature Extraction and Analysis for GPT-Generated Text. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Kumar, V., Choudhary, A., and Cho, E. (2020, January 4\u20137). Data Augmentation Using Pre-Trained Transformer Models. Proceedings of the 2nd Workshop on Life-Long Learning for Spoken Language Systems, Suzhou, China.","DOI":"10.18653\/v1\/2020.lifelongnlp-1.3"},{"key":"ref_47","unstructured":"Lajcinova, B., Valabek, P., and Spisiak, M. (2025, March 29). Named Entity Recognition for Address Extraction in Speech-to-Text Transcriptions Using Synthetic Data. Available online: https:\/\/www.eurocc-access.eu\/success-stories\/named-entity-recognition-for-address-extraction-in-speech-to-text-transcriptions-using-synthetic-data\/."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"826","DOI":"10.1162\/tacl_a_00492","article-title":"Generate, annotate, and learn: NLP with synthetic text","volume":"10","author":"He","year":"2022","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1016\/j.aiopen.2022.03.001","article-title":"Data augmentation approaches in natural language processing: A survey","volume":"3","author":"Li","year":"2022","journal-title":"AI Open"},{"key":"ref_50","first-page":"8","article-title":"Effective Listings of Function Stop words for Twitter","volume":"3","author":"Choy","year":"2012","journal-title":"Int. J. Adv. Comput. Sci. Appl."},{"key":"ref_51","unstructured":"(2024, November 30). Google Code Archive. Available online: https:\/\/code.google.com\/archive\/p\/stop-words\/downloads."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Novotn\u00fd, V. (2018, January 22\u201326). Implementation Notes for the Soft Cosine Measure. Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy.","DOI":"10.1145\/3269206.3269317"},{"key":"ref_53","unstructured":"Buda, A., and Jarynowski, A. (2010). Implementation Notes for the Soft Cosine Measure, Wydawnictwo Niezale\u017cne."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"567","DOI":"10.1016\/S0378-4371(01)00355-7","article-title":"Beyond the Zipf\u2013Mandelbrot law in quantitative linguistics","volume":"300","author":"Montemurro","year":"2022","journal-title":"Phys. A Stat. Mech. Its Appl."}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/5\/162\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:22:33Z","timestamp":1760030553000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/5\/162"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,27]]},"references-count":54,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2025,5]]}},"alternative-id":["computers14050162"],"URL":"https:\/\/doi.org\/10.3390\/computers14050162","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,27]]}}}