{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T22:36:30Z","timestamp":1776465390263,"version":"3.51.2"},"reference-count":54,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2025,7,19]],"date-time":"2025-07-19T00:00:00Z","timestamp":1752883200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001711","name":"Swiss National Science Foundation","doi-asserted-by":"publisher","award":["197864"],"award-info":[{"award-number":["197864"]}],"id":[{"id":"10.13039\/501100001711","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>We present a pipeline for synthetic simplification of text in French that combines large language models with structured semantic guidance. Our approach enhances data generation by integrating contextual knowledge from Wikipedia and Vikidia articles and injecting symbolic control through lightweight knowledge graphs. To construct document-level representations, we implement a progressive summarization process that incrementally builds running summaries and extracts key ideas. Simplifications are generated iteratively and assessed using semantic comparisons between input and output graphs, enabling targeted regeneration when critical information is lost. Our system is implemented using LangChain\u2019s orchestration framework, allowing modular and extensible coordination of LLM components. Evaluation shows that context-aware prompting and semantic feedback improve simplification quality across successive iterations.<\/jats:p>","DOI":"10.3390\/make7030068","type":"journal-article","created":{"date-parts":[[2025,7,21]],"date-time":"2025-07-21T09:33:53Z","timestamp":1753090433000},"page":"68","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Towards Robust Synthetic Data Generation for Simplification of Text in French"],"prefix":"10.3390","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4195-599X","authenticated-orcid":false,"given":"Nikos","family":"Tsourakis","sequence":"first","affiliation":[{"name":"Department of Translation Technology, TIM\/FTI, University of Geneva, Bd du Pont-d\u2019Arve 40, 1205 Gen\u00e8ve, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,7,19]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Alu\u00edsio, S.M., Specia, L., Pardo, T.A., Maziero, E.G., and Fortes, R.P. (2008, January 16\u201319). Towards Brazilian Portuguese Automatic Text Simplification Systems. Proceedings of the 8th ACM Symposium on Document Engineering, Sao Paulo, Brazil.","DOI":"10.1145\/1410140.1410191"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"259","DOI":"10.1075\/itl.165.2.06sid","article-title":"A Survey of Research on Text Simplification","volume":"165","author":"Siddharthan","year":"2014","journal-title":"ITL-Int. J. Appl. Linguist."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1016\/j.csl.2016.12.001","article-title":"Source Sentence Simplification for Statistical Machine Translation","volume":"45","author":"Hasler","year":"2017","journal-title":"Comput. Speech Lang."},{"key":"ref_4","unstructured":"Vickrey, D., and Koller, D. (2008, January 16\u201318). Sentence Simplification for Semantic Role Labeling. Proceedings of the ACL-08: HLT, Columbus, OH, USA."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"957","DOI":"10.1007\/s10772-024-10146-0","article-title":"Automatic Text Simplification for French: Model Fine-Tuning for Simplicity Assessment and Simpler Text Generation","volume":"27","author":"Ormaechea","year":"2024","journal-title":"Int. J. Speech Technol."},{"key":"ref_6","unstructured":"Ormaechea, L., Tsourakis, N., Schwab, D., Bouillon, P., and Lecouteux, B. (2023, January 16\u201317). Simple, Simpler and Beyond: A Fine-Tuning BERT-Based Approach to Enhance Sentence Complexity Assessment for Text Simplification. Proceedings of the International Conference on Natural Language and Speech Processing, Trento, Italy."},{"key":"ref_7","unstructured":"Le Scao, T., Fan, A., Akiki, C., Pavlick, E., Ili\u0107, S., Hesslow, D., Castagn\u00e9, R., Luccioni, A.S., Yvon, F., and Gall\u00e9, M. (2023). Bloom: A 176b-Parameter Open-Access Multilingual Language Model. arXiv."},{"key":"ref_8","unstructured":"OpenAI (2025, July 03). GPT-4 Technical Report. Available online: https:\/\/openai.com\/research\/gpt-4."},{"key":"ref_9","first-page":"46534","article-title":"Self-Refine: Iterative Refinement with Self-Feedback","volume":"Volume 36","author":"Oh","year":"2023","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Long, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., and Wang, H. (2024). On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.658"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Reynolds, L., and McDonell, K. (2021, January 8\u201313). Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. Proceedings of the Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan.","DOI":"10.1145\/3411763.3451760"},{"key":"ref_12","unstructured":"Rogers, A., Boyd-Graber, J., and Okazaki, N. (2023, January 9\u201314). Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"494","DOI":"10.1109\/TNNLS.2021.3070843","article-title":"A Survey on Knowledge Graphs: Representation, Acquisition, and Applications","volume":"33","author":"Ji","year":"2021","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3512467","article-title":"A Survey of Knowledge-Enhanced Text Generation","volume":"54","author":"Yu","year":"2022","journal-title":"ACM Comput. Surv."},{"key":"ref_15","first-page":"46595","article-title":"Judging LLL-as-a-Judge with MT-Bench and Chatbot Arena","volume":"36","author":"Zheng","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_16","unstructured":"Carroll, J.A., Minnen, G., Pearce, D., Canning, Y., Devlin, S., and Tait, J. (1999, January 8\u201312). Simplifying Text for Language-Impaired Readers. Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, Bergen, Norway."},{"key":"ref_17","unstructured":"Siddharthan, A. (2002, January 13\u201315). An Architecture for a Text Simplification System. Proceedings of the Language Engineering Conference, Hyderabad, India."},{"key":"ref_18","first-page":"161","article-title":"The Use of a Psycholinguistic Database in the Simplification of Text for Aphasic Readers","volume":"77","author":"Devlin","year":"1998","journal-title":"Linguist. Databases"},{"key":"ref_19","unstructured":"Nisioi, S., \u0160tajner, S., Ponzetto, S.P., and Dinu, L.P. (August, January 30). Exploring Neural Text Simplification Models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, BC, Canada."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1162\/tacl_a_00139","article-title":"Problems in Current Text Simplification Research: New Data Can Help","volume":"3","author":"Xu","year":"2015","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Zhang, X., and Lapata, M. (2017). Sentence Simplification with Deep Reinforcement Learning. arXiv.","DOI":"10.18653\/v1\/D17-1062"},{"key":"ref_22","unstructured":"Mallinson, J., and Lapata, M. (2019). Controllable Sentence Simplification: Employing Syntactic and Lexical Constraints. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Maddela, M., Xu, W., and Preotiuc-Pietro, D. (2021, January 7\u201311). Controllable Text Simplification with Explicit Paraphrasing. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic.","DOI":"10.18653\/v1\/2021.naacl-main.277"},{"key":"ref_24","unstructured":"Tsourakis, N. (2022). Machine Learning Techniques for Text: Apply Modern Techniques with Python, Packt Publishing Ltd."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Jamet, H., Shrestha, Y.R., and Vlachos, M. (2024). Difficulty Estimation and Simplification of French Text Using LLMs. Generative Intelligence and Intelligent Tutoring Systems, Springer Nature.","DOI":"10.1007\/978-3-031-63028-6_34"},{"key":"ref_26","unstructured":"Gala, N., Tack, A., Javourey-Drevet, L., Fran\u00e7ois, T., and Ziegler, J.C. (2020, January 11\u201316). Alector: A parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers. Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France."},{"key":"ref_27","unstructured":"Martin, L., Fan, A., De La Clergerie, E., Bordes, A., and Sagot, B. (2020). MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., and Hajishirzi, H. (2022). Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv.","DOI":"10.18653\/v1\/2023.acl-long.754"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Sahu, G., Vechtomova, O., and Laradji, I.H. (2025). A Guide To Effectively Leveraging LLMs for Low-Resource Text Summarization: Data Augmentation and Semi-Supervised Approaches. arXiv.","DOI":"10.18653\/v1\/2025.findings-naacl.86"},{"key":"ref_30","unstructured":"Li, Z., Wang, X., Zhao, J., Yang, S., Du, G., Hu, X., Zhang, B., Ye, Y., Li, Z., and Zhao, R. (2024). PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-Consistency. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Akella, A., Manatkar, A., Chavda, B., and Patel, H. (2024, January 16\u201321). An Automatic Prompt Generation System for Tabular Data Tasks. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Volume 6: Industry Track, Mexico City, Mexico.","DOI":"10.18653\/v1\/2024.naacl-industry.16"},{"key":"ref_32","unstructured":"Chai, Y., Xie, H., and Qin, J.S. (2025). Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities. arXiv."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Sufi, F. (2024). Addressing Data Scarcity in the Medical Domain: A GPT-Based Approach for Synthetic Data Generation and Feature Extraction. Information, 15.","DOI":"10.3390\/info15050264"},{"key":"ref_34","unstructured":"Guo, X., and Chen, Y. (2024). Generative AI for Synthetic Data Generation: Methods, Challenges and the Future. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Liu, N.F., Zhang, T., and Liang, P. (2023). Evaluating Verifiability in Generative Search Engines. arXiv.","DOI":"10.18653\/v1\/2023.findings-emnlp.467"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Zhao, S., Meng, R., He, D., Andi, S., and Bambang, P. (2018). Integrating Transformer and Paraphrase Rules for Sentence Simplification. arXiv.","DOI":"10.18653\/v1\/D18-1355"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1162\/coli_a_00370","article-title":"Data-Driven Sentence Simplification: Survey and Benchmark","volume":"46","author":"Scarton","year":"2020","journal-title":"Comput. Linguist."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"e2305016120","DOI":"10.1073\/pnas.2305016120","article-title":"ChatGPT Outperforms Crowd Workers for Text-Annotation Tasks","volume":"120","author":"Gilardi","year":"2023","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Ding, B., Qin, C., Zhao, R., Luo, T., Li, X., Chen, G., Xia, W., Hu, J., Luu, A.T., and Joty, S. (2024). Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.97"},{"key":"ref_40","unstructured":"Ormaechea, L., Tsourakis, N., Bouillon, P., Lecouteux, B., and Schwab, D. (2025, January 17\u201321). Towards High-Quality LLM-Based Data for French Spontaneous Speech Simplification: An Exo-Refinement Approach. Proceedings of the Interspeech, Rotterdam, The Netherlands."},{"key":"ref_41","unstructured":"Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X., and Zhou, D. (2023). Large Language Models Cannot Self-Correct Reasoning Yet. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"3580","DOI":"10.1109\/TKDE.2024.3352100","article-title":"Unifying Large Language Models and Knowledge Graphs: A Roadmap","volume":"36","author":"Pan","year":"2024","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_43","unstructured":"LangChain (2025, July 03). Summarization Use Cases (Python). Available online: https:\/\/python.langchain.com\/v0.1\/docs\/use_cases\/summarization\/."},{"key":"ref_44","unstructured":"LangChain (2025, July 03). RefineDocumentsChain (JavaScript). Available online: https:\/\/js.langchain.com\/v0.1\/docs\/modules\/chains\/document\/refine\/."},{"key":"ref_45","unstructured":"LangChain (2025, July 03). MapReduceDocumentsChain for Summarization. Available online: https:\/\/python.langchain.com\/docs\/how_to\/summarize_map_reduce\/."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 6\u201312). BLEU: A method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_47","unstructured":"Lin, C.Y. (2004, January 25\u201326). ROUGE: A Package for Automatic Evaluation of Summaries. Proceedings of the Text Summarization Branches Out, Barcelona, Spain."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"401","DOI":"10.1162\/tacl_a_00107","article-title":"Optimizing Statistical Machine Translation for Text Simplification","volume":"4","author":"Xu","year":"2016","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_49","unstructured":"Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., and Artzi, Y. (2019). BERTScore: Evaluating Text Generation with BERT. arXiv."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Reimers, N., and Gurevych, I. (2019, January 3\u20137). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Hong Kong, China.","DOI":"10.18653\/v1\/D19-1410"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Speer, R., Chin, J., and Havasi, C. (2018). ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. arXiv.","DOI":"10.1609\/aaai.v31i1.11164"},{"key":"ref_52","unstructured":"Shi, P., and Lin, J. (2019). Simple BERT Models for Relation Extraction and Semantic Role Labeling. arXiv."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Bai, X., Chen, Y., and Zhang, Y. (2022). Graph Pre-Training for AMR Parsing and Generation. arXiv.","DOI":"10.18653\/v1\/2022.acl-long.415"},{"key":"ref_54","unstructured":"Gupta, S., Ranjan, R., and Singh, S.N. (2024). A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. arXiv."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/3\/68\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:12:42Z","timestamp":1760033562000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/3\/68"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,19]]},"references-count":54,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2025,9]]}},"alternative-id":["make7030068"],"URL":"https:\/\/doi.org\/10.3390\/make7030068","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,19]]}}}