{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T14:40:49Z","timestamp":1781534449040,"version":"3.54.5"},"reference-count":28,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2026,4,26]],"date-time":"2026-04-26T00:00:00Z","timestamp":1777161600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"the Ministry of Education industry\u2013university cooperative education project","award":["231101418285337"],"award-info":[{"award-number":["231101418285337"]}]},{"award":["231101418285337"],"award-info":[{"award-number":["231101418285337"]}],"id":[{"id":"https:\/\/ror.org\/05tqgjy81","id-type":"ROR","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100009002","name":"Shanghai University","doi-asserted-by":"publisher","award":["22H00324"],"award-info":[{"award-number":["22H00324"]}],"id":[{"id":"10.13039\/501100009002","id-type":"DOI","asserted-by":"publisher"}]},{"award":["22H00324"],"award-info":[{"award-number":["22H00324"]}],"id":[{"id":"https:\/\/ror.org\/006teas31","id-type":"ROR","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Dense retrievers rely heavily on high-quality training triplets, yet existing data construction strategies remain inadequate for reasoning-intensive retrieval tasks involving multi-hop reasoning, entity relation tracing, and implicit evidence composition. Positive samples are often based on shallow semantic relevance and fail to capture explicit reasoning chains, while negative samples are typically sampled from lexical overlap or random candidates and therefore provide limited supervision for learning clear decision boundaries. To address these issues, we propose S-Gens, a structure-aware synthetic data generation framework for enhancing reasoning-intensive dense retrieval. S-Gens uses relation paths in an external knowledge graph to synthesize queries and structurally consistent positive samples, and further constructs semantically similar but structurally inconsistent hard negatives. To improve data reliability, we introduce a Siamese graph neural network-based consistency filtering mechanism. Because S-Gens operates entirely during offline supervision construction, it remains model-agnostic, preserves the original inference architecture, and is complementary to graph-guided retrieval or RAG pipelines that inject structure online. Experiments on five benchmark datasets show that S-Gens consistently improves multiple trainable retrievers, with the largest gains on multi-hop reasoning tasks such as WebQSP and HotpotQA. These results indicate that structure-aware synthetic supervision can effectively improve dense retrieval in reasoning-intensive settings.<\/jats:p>","DOI":"10.3390\/info17050413","type":"journal-article","created":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T11:33:37Z","timestamp":1777376017000},"page":"413","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["S-Gens: Structure-Aware Synthetic Data Generation for Enhancing Reasoning-Intensive Dense Retrieval"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-6824-991X","authenticated-orcid":false,"given":"Zhou","family":"Lei","sequence":"first","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-8713-8853","authenticated-orcid":false,"given":"Yanqi","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-5880-7259","authenticated-orcid":false,"given":"Shengbo","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2026,4,26]]},"reference":[{"key":"ref_1","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive NLP tasks","volume":"Volume 33","author":"Lewis","year":"2020","journal-title":"Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6\u201312 December 2020"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Izacard, G., and Grave, E. (2021). Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19\u201323 April 2021, Association for Computational Linguistics.","DOI":"10.18653\/v1\/2021.eacl-main.74"},{"key":"ref_3","unstructured":"Xiong, W., Li, X.L., Iyer, S., Du, J., Lewis, P., Wang, W.Y., Mehdad, Y., Yih, S., Riedel, S., and Kiela, D. (2021). Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval. Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 3\u20137 May 2021, OpenReview.net."},{"key":"ref_4","unstructured":"Asai, A., Wu, Z., Wang, Y., Sil, A., and Hajishirzi, H. (2024). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 7\u201311 May 2024, OpenReview.net."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.T. (2020, January 16\u201320). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online.","DOI":"10.18653\/v1\/2020.emnlp-main.550"},{"key":"ref_6","unstructured":"Xiong, L., Xiong, C., Li, Y., Tang, K., Liu, J., Bennett, P.N., Ahmed, J., and Overwijk, A. (2021). Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 3\u20137 May 2021, OpenReview.net."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., and Manning, C.D. (November, January 31). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.","DOI":"10.18653\/v1\/D18-1259"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Qu, Y., Ding, Y., Liu, J., Liu, K., Ren, R., Zhao, W.X., Dong, D., Wu, H., and Wang, H. (2021). RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6\u201311 June 2021, Association for Computational Linguistics.","DOI":"10.18653\/v1\/2021.naacl-main.466"},{"key":"ref_9","unstructured":"Hofst\u00e4tter, S., Althammer, S., Schr\u00f6der, M., Sertkan, M., and Hanbury, A. (2020). Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Bonifacio, L., Abonizio, H., Fadaee, M., and Nogueira, R. (2022, January 11\u201315). InPars: Unsupervised Dataset Generation for Information Retrieval. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA.","DOI":"10.1145\/3477495.3531863"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Wang, L., Yang, N., and Wei, F. (2023, January 6\u201310). Query2doc: Query Expansion with Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.585"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Gao, L., Ma, X., Lin, J., and Callan, J. (2023, January 9\u201314). Precise Zero-Shot Dense Retrieval without Relevance Labels. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada.","DOI":"10.18653\/v1\/2023.acl-long.99"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Lee, H., and Lim, S. (2026). Hybrid Retrieval-Augmented Generation: Semantic and Structural Integration for Large Language Model Reasoning. Appl. Sci., 16.","DOI":"10.3390\/app16052244"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Zhu, X., Xie, Y., Liu, Y., Li, Y., and Hu, W. (May, January 29). Knowledge Graph-Guided Retrieval Augmented Generation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA.","DOI":"10.18653\/v1\/2025.naacl-long.449"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Hofst\u00e4tter, S., Lin, S.C., Yang, J.H., Lin, J., and Hanbury, A. (2021, January 11\u201315). Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA.","DOI":"10.1145\/3404835.3462891"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Sun, H., Dhingra, B., Zaheer, M., Mazaitis, K., Salakhutdinov, R., and Cohen, W. (November, January 31). Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.","DOI":"10.18653\/v1\/D18-1455"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Sun, H., Bedrax-Weiss, T., and Cohen, W. (2019, January 3\u20137). PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China.","DOI":"10.18653\/v1\/D19-1242"},{"key":"ref_18","unstructured":"Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., and Larson, J. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Guti\u00e9rrez, B.J., Shu, Y., Gu, Y., Yasunaga, M., and Su, Y. (2024, January 10\u201315). HippoRAG: Neurobiologically inspired long-term memory for large language models. Proceedings of the 38th International Conference on Neural Information Processing Systems, Red Hook, NY, USA.","DOI":"10.52202\/079017-1902"},{"key":"ref_20","unstructured":"Dai, Z., Zhao, V.Y., Ma, J., Luan, Y., Ni, J., Lu, J., Bakalov, A., Guu, K., Hall, K.B., and Chang, M. (2023). Promptagator: Few-shot Dense Retrieval from 8 Examples. Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1\u20135 May 2023, OpenReview.net."},{"key":"ref_21","unstructured":"Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., and Bethard, S. (2013). Semantic Parsing on Freebase from Question-Answer Pairs. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, DC, USA, 18\u201321 October 2013, Association for Computational Linguistics."},{"key":"ref_22","first-page":"1","article-title":"MS MARCO: A Human Generated MAchine Reading COmprehension Dataset","volume":"Volume 1773","author":"Nguyen","year":"2016","journal-title":"Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016 Co-Located with the 30th Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 9 December 2016"},{"key":"ref_23","first-page":"452","article-title":"Natural Questions: A Benchmark for Question Answering Research","volume":"7","author":"Kwiatkowski","year":"2019","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_24","unstructured":"Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. (August, January 30). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1002\/asi.4630270302","article-title":"Relevance weighting of search terms","volume":"27","author":"Robertson","year":"1976","journal-title":"J. Am. Soc. Inf. Sci. Technol."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z. (2024, January 11\u201316). M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. Proceedings of the Findings of the Association for Computational Linguistics, Bangkok, Thailand.","DOI":"10.18653\/v1\/2024.findings-acl.137"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024, January 11\u201316). Improving Text Embeddings with Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand.","DOI":"10.18653\/v1\/2024.acl-long.642"},{"key":"ref_28","unstructured":"Lee, C., Roy, R., Xu, M., Raiman, J., Shoeybi, M., Catanzaro, B., and Ping, W. (2025). NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. Proceedings of the 13th International Conference on Learning Representations, Singapore, 24\u201328 April 2025, OpenReview.net."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/5\/413\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,14]],"date-time":"2026-05-14T04:17:46Z","timestamp":1778732266000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/5\/413"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,26]]},"references-count":28,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2026,5]]}},"alternative-id":["info17050413"],"URL":"https:\/\/doi.org\/10.3390\/info17050413","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,4,26]]}}}