{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T19:12:53Z","timestamp":1767985973463,"version":"3.49.0"},"reference-count":52,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2025,2,18]],"date-time":"2025-02-18T00:00:00Z","timestamp":1739836800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004285","name":"St. Petersburg State University (SPbSU)","doi-asserted-by":"publisher","award":["124032900006-1"],"award-info":[{"award-number":["124032900006-1"]}],"id":[{"id":"10.13039\/501100004285","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Previously, it was suggested that the \u201cpersona-driven\u201d approach can contribute to producing sufficiently diverse synthetic training data for Large Language Models (LLMs) that are currently about to run out of real natural language texts. In our paper, we explore whether personas evoked from LLMs through HCI-style descriptions could indeed imitate human-like differences in authorship. For this end, we ran an associative experiment with 50 human participants and four artificial personas evoked from the popular LLM-based services: GPT-4(o) and YandexGPT Pro. For each of the five stimuli words selected from university websites\u2019 homepages, we asked both groups of subjects to come up with 10 short associations (in Russian). We then used cosine similarity and Mahalanobis distance to measure the distance between the association lists produced by different humans and personas. While the difference in the similarity was significant for different human associators and different gender and age groups, neither was the case for the different personas evoked from ChatGPT or YandexGPT. Our findings suggest that the LLM-based services so far fall short at imitating the associative thesauri of different authors. The outcome of our study might be of interest to computer linguists, as well as AI\/ML scientists and prompt engineers.<\/jats:p>","DOI":"10.3390\/bdcc9020046","type":"journal-article","created":{"date-parts":[[2025,2,18]],"date-time":"2025-02-18T12:16:37Z","timestamp":1739880997000},"page":"46","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users\u2019 Associative Thesauri"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1889-0692","authenticated-orcid":false,"given":"Maxim","family":"Bakaev","sequence":"first","affiliation":[{"name":"Department of Data Collection and Processing Systems, Novosibirsk State Technical University, Novosibirsk 630073, Russia"}]},{"given":"Svetlana","family":"Gorovaia","sequence":"additional","affiliation":[{"name":"Department of Mathematical Linguistics, Saint-Petersburg State University, St. Petersburg 199034, Russia"}]},{"given":"Olga","family":"Mitrofanova","sequence":"additional","affiliation":[{"name":"Department of Mathematical Linguistics, Saint-Petersburg State University, St. Petersburg 199034, Russia"}]}],"member":"1968","published-online":{"date-parts":[[2025,2,18]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1109\/MITP.2023.3262923","article-title":"How Many Data Does Machine Learning in Human\u2013Computer Interaction Need?: Re-Estimating the Dataset Size for Convolutional Neural Network-Based Models of Visual Perception","volume":"25","author":"Bakaev","year":"2023","journal-title":"IT Prof."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Edwards, C. (2024, October 06). Data Quality May Be All You Need: Model Size Is Not Everything. Available online: https:\/\/cacm.acm.org\/news\/data-quality-may-be-all-you-need\/.","DOI":"10.1145\/3647631"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Liu, Y., Cao, J., Liu, C., Ding, K., and Jin, L. (2024). Datasets for large language models: A comprehensive survey. arXiv.","DOI":"10.21203\/rs.3.rs-3996137\/v1"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Yu, P., Xu, H., Hu, X., and Deng, C. (2023). Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration. Healthcare, 11.","DOI":"10.3390\/healthcare11202776"},{"key":"ref_5","unstructured":"Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., and Hobbhahn, M. (2024, January 21\u201327). Position: Will we run out of data? Limits of LLM scaling based on human-generated data. Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria."},{"key":"ref_6","first-page":"50358","article-title":"Scaling data-constrained language models","volume":"36","author":"Muennighoff","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_7","unstructured":"Zhou, Y., Guo, C., Wang, X., Chang, Y., and Wu, Y. (2024). A survey on data augmentation in large model era. arXiv."},{"key":"ref_8","unstructured":"Liu, R., Wei, J., Liu, F., Si, C., Zhang, Y., Rao, J., Zheng, S., Peng, D., Yang, D., and Zhou, D. (2024, January 7\u20139). Best Practices and Lessons Learned on Synthetic Data. Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Ding, B., Qin, C., Zhao, R., Luo, T., Li, X., Chen, G., Xia, W., Hu, J., Luu, A.T., and Joty, S. (2024). Data augmentation using LLMs: Data perspectives, learning paradigms and challenges. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.97"},{"key":"ref_10","unstructured":"Gerstgrasser, M., Schaeffer, R., Dey, A., Rafailov, R., Sleight, H., Hughes, J., Korbak, T., Agrawal, R., Pai, D., and Gromov, A. (2024). Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data. arXiv."},{"key":"ref_11","unstructured":"Zhang, J., Qiao, D., Yang, M., and Wei, Q. (2024). Regurgitative Training: The Value of Real Data in Training Large Language Models. arXiv."},{"key":"ref_12","unstructured":"Jain, A., Montanari, A., and Sasoglu, E. (2024). Scaling laws for learning with real and surrogate data. arXiv."},{"key":"ref_13","unstructured":"Ferbach, D., Bertrand, Q., Bose, A.J., and Gidel, G. (2024). Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Goyal, M., and Mahmoud, Q.H. (2024). A systematic review of synthetic data generation techniques using generative AI. Electronics, 13.","DOI":"10.3390\/electronics13173509"},{"key":"ref_15","unstructured":"Nakada, R., Xu, Y., Li, L., and Zhang, L. (2024). Synthetic Oversampling: Theory and a Practical Approach Using LLMs to Address Data Imbalance. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Chan, X., Wang, X., Yu, D., Mi, H., and Yu, D. (2024). Scaling synthetic data creation with 1,000,000,000 personas. arXiv.","DOI":"10.14218\/JCTH.2023.00464"},{"key":"ref_17","unstructured":"Schreiber, W., White, J., and Schmidt, D.C. (2024, October 06). A Pattern Language for Persona-based Interactions with LLMs. Available online: https:\/\/www.dre.vanderbilt.edu\/~schmidt\/PDF\/Persona-Pattern-Language.pdf."},{"key":"ref_18","unstructured":"Bakaev, M., Gorovaia, S., and Mitrofanova, O. (2024, January 24\u201326). Multiple Personalities by Order: Can ChatGPT Simulate Personas for User-Centered Design?. Proceedings of the International Conference on Internet and Modern Society (IMS 2024), St. Petersburg, FL, USA. in print."},{"key":"ref_19","unstructured":"Karaulov, Y.N. (2010). Russian Language and Linguistic Personality, LKI Publishing House."},{"key":"ref_20","unstructured":"Weng, Y., He, S., Liu, K., Liu, S., and Zhao, J. (2024). ControlLM: Crafting diverse personalities for language models. arXiv."},{"key":"ref_21","first-page":"10622","article-title":"Evaluating and inducing personality in pre-trained language models","volume":"36","author":"Jiang","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_22","unstructured":"Frisch, I., and Giulianelli, M. (2024). LLM Agents in Interaction: Measuring Personality Consistency and Linguistic Alignment in Interacting Populations of Large Language Models. arXiv."},{"key":"ref_23","unstructured":"Dehghani, M., and Boyd, R.L. (2022). Handbook of Language Analysis in Psychology, Guilford Publications."},{"key":"ref_24","unstructured":"Kumar, S., Gupta, R., Akhtar, M.S., and Chakraborty, T. (2024, January 20\u201325). Adding SPICE to Life: Speaker Profiling in Multiparty Conversations. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, Italy."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1140\/epjds\/s13688-022-00333-x","article-title":"Predicting subjective well-being in a high-risk sample of Russian mental health app users","volume":"11","author":"Panicheva","year":"2022","journal-title":"EPJ Data Sci."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"102674","DOI":"10.1016\/j.ipm.2021.102674","article-title":"Detecting ethnicity-targeted hate speech in Russian social media texts","volume":"58","author":"Pronoza","year":"2021","journal-title":"Inf. Process. Manag."},{"key":"ref_27","first-page":"61","article-title":"Individual differences in the associative meaning of a word through the lens of the language model and semantic differential, Research Result","volume":"10","author":"Litvinova","year":"2024","journal-title":"Theor. Appl. Linguist."},{"key":"ref_28","unstructured":"Shaposhnikova, I.V., and Romanenko, A.A. (2022). The Russian Regional Associative Dictionary: Siberia and the Far East: In 2 Volumes, CPI NSU. (In Russian)."},{"key":"ref_29","first-page":"193","article-title":"Associative Lexicography and Studies of Language Consciousness","volume":"4","author":"Ufimtseva","year":"2014","journal-title":"Philol. Cult."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1016\/j.cogsys.2020.12.007","article-title":"Conceptual processing system for a companion robot","volume":"67","author":"Kotov","year":"2021","journal-title":"Cogn. Syst. Res."},{"key":"ref_31","first-page":"71","article-title":"Written vs. generated text: \u00abnaturalness\u00bb as a textual and psycholinguistic category, Research Result","volume":"10","author":"Kolmogorova","year":"2024","journal-title":"Theor. Appl. Linguist."},{"key":"ref_32","unstructured":"Xu, P. (2024). Distinguishing 19th Century British Novels by Women Authors Using Natural Language Processing. Intell. Planet J. Math. Its Appl., 1."},{"key":"ref_33","unstructured":"Chen, N., Wang, Y., Deng, Y., and Li, J. (2024). The Oscars of AI theater: A survey on role-playing with language models. arXiv."},{"key":"ref_34","unstructured":"Chen, J., Wang, X., Xu, R., Yuan, S., Zhang, Y., Shi, W., Xie, J., Li, S., Yang, R., and Zhu, T. (2024). From persona to personalization: A survey on role-playing language agents. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Chen, N., Wang, Y., Jiang, H., Cai, D., Li, Y., Chen, Z., Wang, L., and Li, J. (2022). Large Language Models Meet Harry Potter: A Bilingual Dataset for Aligning Dialogue Agents with Characters. arXiv.","DOI":"10.18653\/v1\/2023.findings-emnlp.570"},{"key":"ref_36","unstructured":"Cooper, A., and Reimann, R. (2003). About Face 2.0: The Essentials of Interaction Design, John Wiley & Sons."},{"key":"ref_37","unstructured":"Nielsen, J. (2024, October 06). User Research with Humans vs. AI. Available online: https:\/\/jakobnielsenphd.substack.com\/p\/research-humans-ai."},{"key":"ref_38","first-page":"491","article-title":"Soft similarity and soft cosine measure: Similarity of features in vector space model","volume":"18","author":"Sidorov","year":"2014","journal-title":"Comput. Sist."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Azarpanah, H., and Farhadloo, M. (2021, January 10). Measuring Biases of Word Embeddings: What Similarity Measures and Descriptive Statistics to Use?. Proceedings of the First Workshop on Trustworthy Natural Language Processing, Online.","DOI":"10.18653\/v1\/2021.trustnlp-1.2"},{"key":"ref_40","first-page":"86","article-title":"Efficiency assessment of Euclidean and Makhalanobis distances for solving a major text classification problem","volume":"44","author":"Glazkova","year":"2017","journal-title":"Her. Dagestan State Tech. Univ. Tech. Sci."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Sitikhu, P., Pahi, K., Thapa, P., and Shakya, S. (2019, January 5). A Comparison of Semantic Similarity Methods for Maximum Human Interpretability. Proceedings of the IEEE International Conference on Artificial Intelligence for Transforming Business and Society, Kathmandu, Nepal.","DOI":"10.1109\/AITB48515.2019.8947433"},{"key":"ref_42","unstructured":"Pishchalnikova, V.A., Kardanova-Biryukova, K.S., Panarina, N.S., Stepykin, N.I., Khlopova, A.I., and Shevchenko, S.N. (2019). Associative Experiment: Theoretical and Applied Perspectives of Psycholinguistics, Valent."},{"key":"ref_43","unstructured":"Ufimtseva, N.V. (2011). Linguistic Consciousness: Dynamics and Variability, Institute of Linguistics of the Russian Academy of Sciences."},{"key":"ref_44","first-page":"27","article-title":"Comparative Study of Word Associations in Social Networks Corpora by means of Distri-butional Semantics Models for Russian","volume":"8","author":"Antipenko","year":"2020","journal-title":"Int. J. Open Inf. Technol."},{"key":"ref_45","unstructured":"Li, K., Liu, T., Bashkansky, N., Bau, D., Vi\u00e9gas, F., Pfister, H., and Wattenberg, M. (2024, January 7\u20139). Measuring and Controlling Instruction (In) Stability in Language Model Dialogs. Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1038\/s41586-024-07566-y","article-title":"AI models collapse when trained on recursively generated data","volume":"631","author":"Shumailov","year":"2024","journal-title":"Nature"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Fan, L., Chen, K., Krishnan, D., Katabi, D., Isola, P., and Tian, Y. (2024, January 17\u201321). Scaling laws of synthetic images for model training... for now. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.00705"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Hang, C.-N., Yu, P.-D., Morabito, R., and Tan, C.-W. (2024). Large Language Models Meet Next-Generation Networking Technologies: A Review. Future Internet, 16.","DOI":"10.3390\/fi16100365"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Whitney, C.D., and Norman, J. (2024, January 3\u20136). Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent Circumvention. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro, Brazil.","DOI":"10.1145\/3630106.3659002"},{"key":"ref_50","unstructured":"Khalil, M., Vadiee, F., Shakya, R., and Liu, Q. (2025). Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation. arXiv."},{"key":"ref_51","first-page":"55734","article-title":"Large language model as attributed training data generator: A tale of diversity and bias","volume":"36","author":"Yu","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_52","first-page":"145","article-title":"The Problems of LLM-generated Data in Social Science Research","volume":"18","author":"Rossi","year":"2024","journal-title":"Sociologica"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/2\/46\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:37:13Z","timestamp":1760027833000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/2\/46"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,18]]},"references-count":52,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2025,2]]}},"alternative-id":["bdcc9020046"],"URL":"https:\/\/doi.org\/10.3390\/bdcc9020046","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,18]]}}}