{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T02:26:28Z","timestamp":1770085588449,"version":"3.49.0"},"reference-count":37,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T00:00:00Z","timestamp":1769904000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>The increasing ability of Large Language Models (LLMs) to generate fluent and coherent text has heightened the need for resources to analyze and detect synthetic content, particularly in Spanish, where the scarcity of datasets hinders the development of reliable detection systems. This work presents a Spanish-language dataset of 18,236 synthetic news descriptions generated from real journalistic headlines using a fully reproducible, open-source pipeline. The methodology used to produce the dataset includes both a Retrieval Augmented Generation (RAG) approach, which incorporates contextual information from recent news descriptions, and a NO-RAG approach, which relies solely on the headline. Texts were generated with the instruction-tuned Mistral 7B Instruct model, systematically varying temperature to explore the effect of generation parameters. The dataset includes detailed metadata linking each synthetic description to its source headline, generation settings, and, when applicable, retrieved contextual content. By combining contextual grounding, controlled parameter variation, and source-level traceability, this dataset provides a reproducible and richly annotated resource that supports research in Spanish synthetic text and evaluation of LLM-based generation.<\/jats:p>","DOI":"10.3390\/data11020029","type":"journal-article","created":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T12:49:44Z","timestamp":1770036584000},"page":"29","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Controlled Generation of Synthetic Spanish Texts: A Dataset Using LLMs with and Without Contextual Retrieval"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8989-6920","authenticated-orcid":false,"given":"Jos\u00e9 M.","family":"Garc\u00eda-Campos","sequence":"first","affiliation":[{"name":"Department of Telematics Engineering, University of Seville, Camino de los Descubrimientos s\/n, 41092 Seville, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4809-5654","authenticated-orcid":false,"given":"Agust\u00edn W.","family":"Lara-Romero","sequence":"additional","affiliation":[{"name":"Department of Telematics Engineering, University of Seville, Camino de los Descubrimientos s\/n, 41092 Seville, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8461-1102","authenticated-orcid":false,"given":"Vicente","family":"Mayor","sequence":"additional","affiliation":[{"name":"Department of Telematics Engineering, University of Seville, Camino de los Descubrimientos s\/n, 41092 Seville, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1277-3310","authenticated-orcid":false,"given":"Jorge","family":"Calvillo-Arbizu","sequence":"additional","affiliation":[{"name":"Department of Telematics Engineering, University of Seville, Camino de los Descubrimientos s\/n, 41092 Seville, Spain"},{"name":"Biomedical Engineering Group, University of Seville, Camino de los Descubrimientos s\/n, 41092 Seville, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,2,1]]},"reference":[{"key":"ref_1","unstructured":"Wang, K., Zhu, J., Ren, M., Liu, Z., Li, S., Zhang, Z., Zhang, C., Wu, X., Zhan, Q., and Liu, Q. (2024). A Survey on Data Synthesis and Augmentation for Large Language Models. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Yu, X., Zhang, Z., Niu, F., Hu, X., Xia, X., and Grundy, J. (2024, January 7\u201311). What Makes a High-Quality Training Dataset for Large Language Models: A Practitioners\u2019 Perspective. Proceedings of the 39th IEEE\/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA.","DOI":"10.1145\/3691620.3695061"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"275","DOI":"10.1162\/coli_a_00549","article-title":"A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions","volume":"51","author":"Wu","year":"2025","journal-title":"Comput. Linguist."},{"key":"ref_4","unstructured":"Calzolari, N., Huang, C.-R., Kim, H., Pustejovsky, J., Wanner, L., Choi, K.-S., Ryu, P.-M., Chen, H.-H., Donatelli, L., and Ji, H. (2022). Threat Scenarios and Best Practices to Detect Neural Fake News. Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12\u201317 October 2022, International Committee on Computational Linguistics."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3571730","article-title":"Survey of Hallucination in Natural Language Generation","volume":"55","author":"Ji","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"100793","DOI":"10.1016\/j.cosrev.2025.100793","article-title":"AI-generated text detection: A comprehensive review of methods, datasets, and applications","volume":"58","author":"Kehkashan","year":"2025","journal-title":"Comput. Sci. Rev."},{"key":"ref_7","unstructured":"Yang, X., Chen, W., Wu, Y., Petzold, L., Wang, W.Y., and Chen, H. (2023). DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Yu, X., Chen, K., Yang, Q., Zhang, W., and Yu, N. (2024). Text Fluoroscopy: Detecting LLM-Generated Text through Intrinsic Features. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, PR, USA, 10\u201314 November 2024, Association for Computing Linguistics.","DOI":"10.18653\/v1\/2024.emnlp-main.885"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Soto-Osorio, D., Sidorov, G., Chanona-Hern\u00e1ndez, L., and L\u00f3pez-Ram\u00edrez, B.C. (2024). Identification of Scientific Texts Generated by Large Language Models Using Machine Learning. Computers, 13.","DOI":"10.3390\/computers13120346"},{"key":"ref_10","unstructured":"Mitchell, E., Lee, Y., Khazatsky, A., Manning, C.D., and Finn, C. (2023). DetectGPT: Zero-Shot Machine-Generated Text Detection Using Probability Curvature. arXiv."},{"key":"ref_11","unstructured":"Hans, A., Schwarzschild, A., Cherepanova, V., Kazemi, H., Saha, A., Goldblum, M., Geiping, G., and Goldstein, T. (2024). Spotting LLMs with binoculars: Zero-shot detection of machine-generated text. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Kumar, B.P., Ahmed, M.S., and Sadanandam, M. (2024). DistilBERT: A Novel Approach to Detect Text Generated by Large Language Models (LLM). Res. Sq.","DOI":"10.21203\/rs.3.rs-3909387\/v1"},{"key":"ref_13","unstructured":"Hernandez, D.I., Hope, T., and Li, M. (2024). LLM-DetectAIve: A Tool for Fine-Grained Machine-Generated Text Detection. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Miami, PR, USA, 7\u201311 December 2024, Association for Computing Linguistics."},{"key":"ref_14","unstructured":"Moens, M.-F., Huang, X., Specia, L., and Yih, S.W. (2021). TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16\u201320 November 2021, Association for Computing Machinery."},{"key":"ref_15","unstructured":"Duh, K., Gomez, H., and Bethard, S. (2024). Ghostbuster: Detecting Text Ghostwritten by Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 13\u201318 April 2024, Association for Computing Linguistics."},{"key":"ref_16","unstructured":"Duh, K., Gomez, H., and Bethard, S. (2024). LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?. Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 7\u201312 June 2024, Association for Computing Linguistics."},{"key":"ref_17","unstructured":"Ku, L.-W., Martins, A., and Srikumar, V. (2024). MAGE: Machine-Generated Text Detection in the Wild. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 10\u201315 May 2024, Association for Computing Linguistics."},{"key":"ref_18","unstructured":"Ku, L.-W., Martins, A., and Srikumar, V. (2024). RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 10\u201315 May 2024, Association for Computing Linguistics."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Almeman, K. (2025). Automated Building of a Multidialectal Parallel Arabic Corpus Using Large Language Models. Data, 10.","DOI":"10.3390\/data10120208"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.-S., and Li, Q. (2024, January 25\u201329). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD \u201924), Barcelona, Spain.","DOI":"10.1145\/3637528.3671470"},{"key":"ref_21","unstructured":"Huang, Y., and Huang, J. (2024). A survey on retrieval-augmented text generation for large language models. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Andrzejewski, M., Dubicka, N., Podolak, J., Kowal, M., and Si\u0142ka, J. (2025). Automated Test Generation Using Large Language Models. Data, 10.","DOI":"10.3390\/data10100156"},{"key":"ref_23","unstructured":"(2026, January 09). RSS 2.0 Specification. Available online: https:\/\/www.rssboard.org\/rss-specification."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Palanisamy, S., and SuvithaVani, P. (2020, January 22\u201324). A Survey on RDBMS and NoSQL Databases: MySQL vs MongoDB. Proceedings of the 2020 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India.","DOI":"10.1109\/ICCCI48352.2020.9104047"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Tu, H. (2024, January 19\u201321). Cassandra vs. MongoDB: A Systematic Review of Two NoSQL Data Stores in Their Industry Uses. Proceedings of the IEEE 7th International Conference on Big Data and Artificial Intelligence (BDAI), Beijing, China.","DOI":"10.1109\/BDAI62182.2024.10692676"},{"key":"ref_26","first-page":"169","article-title":"Instruction Tuning for Large Language Models: A Survey","volume":"58","author":"Zhang","year":"2025","journal-title":"ACM Comput. Surv."},{"key":"ref_27","unstructured":"Rambow, O., Wanner, L., and Apidianaki, M. (2025). Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation. Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19\u201324 January 2025, Association for Computational Linguistics."},{"key":"ref_28","unstructured":"Garc\u00eda-Campos, J.M., Lara, A., Mayor, V., and Calvillo-Arbizu, J. (2026, January 14). Controlled-News-Generation-Es. Available online: https:\/\/github.com\/jmgarcam\/controlled-news-generation-es."},{"key":"ref_29","unstructured":"Jiang, D., Liu, Y., Liu, S., Zhao, J., Zhang, H., Gao, Z., Zhang, X., Li, J., and Xiong, H. (2023). From clip to dino: Visual encoders shout in multi-modal large language models. arXiv."},{"key":"ref_30","unstructured":"Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2019). The curious case of neural text degeneration. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"381","DOI":"10.3758\/BRM.42.2.381","article-title":"MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment","volume":"42","author":"McCarthy","year":"2010","journal-title":"Behav. Res. Methods"},{"key":"ref_32","first-page":"29","article-title":"Medidas sencillas de lecturabilidad","volume":"214","year":"1959","journal-title":"Consigna"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1037\/h0057532","article-title":"A new readability yardstick","volume":"32","author":"Flesch","year":"1948","journal-title":"J. Appl. Psychol."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Li, J., Sun, A., Han, J., and Li, C. (2023). A Survey on Deep Learning for Named Entity Recognition: Extended Abstract. Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3\u20137 April 2023, IEEE.","DOI":"10.1109\/ICDE55515.2023.00335"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"1091","DOI":"10.1109\/TPAMI.2007.1078","article-title":"A Normalized Levenshtein Distance Metric","volume":"29","author":"Yujian","year":"2007","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Shuster, K., Poff, S., Chen, M., Kiela, D., and Weston, J. (2021). Retrieval augmentation reduces hallucination in conversation. arXiv.","DOI":"10.18653\/v1\/2021.findings-emnlp.320"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Wi\u0119ckowska, B., Kubiak, K.B., J\u00f3\u017awiak, P., Moryson, W., and Stawi\u0144ska-Witoszy\u0144ska, B. (2022). Cohen\u2019s Kappa Coefficient as a Measure to Assess Classification Improvement following the Addition of a New Marker to a Regression Model. Int. J. Environ. Res. Public Health, 19.","DOI":"10.3390\/ijerph191610213"}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/11\/2\/29\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T13:18:06Z","timestamp":1770038286000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/11\/2\/29"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,1]]},"references-count":37,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2026,2]]}},"alternative-id":["data11020029"],"URL":"https:\/\/doi.org\/10.3390\/data11020029","relation":{},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,1]]}}}