{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,25]],"date-time":"2026-04-25T14:25:31Z","timestamp":1777127131610,"version":"3.51.4"},"reference-count":28,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,6,13]],"date-time":"2025-06-13T00:00:00Z","timestamp":1749772800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,6,13]],"date-time":"2025-06-13T00:00:00Z","timestamp":1749772800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100009567","name":"Budapest University of Technology and Economics","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100009567","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Discov Computing"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semantic coherence, leading to suboptimal retrieval and reduced output quality. To overcome this limitation, we introduce Max\u2013Min semantic chunking, a novel method utilizing semantic similarity and a Max\u2013Min algorithm to identify semantically coherent text. We evaluated our approach on three distinct datasets, assessing clustering efficiency via adjusted mutual information (AMI) and generation coherence through accuracy on a RAG-based multiple-choice question answering test. Across the datasets, Max\u2013Min semantic chunking achieved superior performance with average AMI scores of 0.85, 0.90, and an average accuracy of 0.56 (averaged across LLMs). This significantly outperformed the next best method, the Llama Semantic Splitter (AMI: 0.68, 0.70; accuracy: 0.53). The improvements in the AMI scores were statistically significant.<\/jats:p>","DOI":"10.1007\/s10791-025-09638-7","type":"journal-article","created":{"date-parts":[[2025,6,13]],"date-time":"2025-06-13T07:37:40Z","timestamp":1749800260000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Max\u2013Min semantic chunking of documents for RAG application"],"prefix":"10.1007","volume":"28","author":[{"given":"Csaba","family":"Kiss","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Marcell","family":"Nagy","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"P\u00e9ter","family":"Szil\u00e1gyi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,6,13]]},"reference":[{"key":"9638_CR1","first-page":"9459","volume":"33","author":"P Lewis","year":"2020","unstructured":"Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, K\u00fcttler H, Lewis M, Yih W-T, Rockt\u00e4schel T, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inform Process Syst. 2020;33:9459\u201374.","journal-title":"Adv Neural Inform Process Syst"},{"issue":"1","key":"9638_CR2","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1007\/s10791-023-09420-7","volume":"26","author":"S Francis","year":"2023","unstructured":"Francis S, Moens M-F. Investigating better context representations for generative question answering. Inform Retr J. 2023;26(1):7.","journal-title":"Inform Retr J"},{"key":"9638_CR3","unstructured":"Wadhwa H, Seetharaman R, Aggarwal S, Ghosh R, Basu S, Srinivasan S, Zhao W, Chaudhari S, Aghazadeh E. From RAGs to rich parameters: probing how language models utilize external knowledge over parametric information for factual queries. 2024."},{"key":"9638_CR4","unstructured":"Shi F, Chen X, Misra K, Scales N, Dohan D, Chi EH, Sch\u00e4rli N, Zhou D. Large language models can be easily distracted by irrelevant context. In: International Conference on Machine Learning. PMLR; 2023. pp. 31210\u201327."},{"key":"9638_CR5","unstructured":"Qian H, Liu Z, Mao K, Zhou Y, Dou Z. Grounding language model with chunking-free in-context retrieval. arxiv:2402.09760 [Preprint]. 2024."},{"key":"9638_CR6","unstructured":"LlamaIndex: Sentence splitter (Parse text with a preference for complete sentences.). 2024. https:\/\/docs.llamaindex.ai\/en\/stable\/api_reference\/node_parsers\/sentence_splitter\/. Accessed 4 Oct 2024."},{"key":"9638_CR7","unstructured":"LlamaIndex: Semantic Nodeparser (Splits a document into Nodes, with each node being a group of semantically related sentences.). 2024. https:\/\/docs.llamaindex.ai\/en\/v0.10.17\/api\/llama_index.core.node_parser.SemanticSplitterNodeParser.html. Accessed 4 Oct 2024."},{"key":"9638_CR8","unstructured":"Jimeno-Yepes A, You Y, Milczek J, Laverde S, Li R. Financial report chunking for effective retrieval augmented generation. ArXiv abs\/2402.05131. 2024."},{"key":"9638_CR9","unstructured":"Juvekar K, Purwar A. Introducing a new hyper-parameter for RAG: context window utilization. arxiv:2407.19794 [Preprint]. 2024."},{"key":"9638_CR10","doi-asserted-by":"publisher","first-page":"1316","DOI":"10.1162\/tacl_a_00605","volume":"11","author":"O Ram","year":"2023","unstructured":"Ram O, Levine Y, Dalmedigos I, Muhlgay D, Shashua A, Leyton-Brown K, Shoham Y. In-context retrieval-augmented language models. Trans Assoc Comput Linguist. 2023;11:1316\u201331. https:\/\/doi.org\/10.1162\/tacl_a_00605.","journal-title":"Trans Assoc Comput Linguist"},{"key":"9638_CR11","unstructured":"Guu K, Lee K, Tung Z, Pasupat P, Chang M-W. REALM: retrieval-augmented language model pre-training. arxiv:2002.08909 [Preprint]. 2020."},{"key":"9638_CR12","unstructured":"Schwaber-Cohen R. Chunking strategies for LLM applications. 2023. https:\/\/www.pinecone.io\/learn\/chunking-strategies\/. Accessed 9 Jan 2025."},{"issue":"251","key":"9638_CR13","first-page":"1","volume":"24","author":"G Izacard","year":"2023","unstructured":"Izacard G, Lewis P, Lomeli M, Hosseini L, Petroni F, Schick T, Dwivedi-Yu J, Joulin A, Riedel S, Grave E. Atlas: few-shot learning with retrieval augmented language models. J Mach Learn Res. 2023;24(251):1\u201343.","journal-title":"J Mach Learn Res"},{"key":"9638_CR14","unstructured":"Zhou X, Li G, Liu Z. LLM as DBA. 2023."},{"key":"9638_CR15","unstructured":"Kamradt G. Semantic chunking. 2024. https:\/\/www.youtube.com\/watch?v=8OJC21T2SL4&t=1933s. Accessed 9 Jan 2025."},{"key":"9638_CR16","unstructured":"LangChain: Semantic Chunker. 2024. https:\/\/api.python.langchain.com\/en\/latest\/text_splitter\/langchain_experimental.text_splitter.SemanticChunker.html. Accessed 9 Jan 2025. 2024"},{"key":"9638_CR17","doi-asserted-by":"crossref","unstructured":"Qu R, Tu R, Bao F. Is semantic chunking worth the computational cost?. 2024.","DOI":"10.32388\/M7YKDZ"},{"key":"9638_CR18","doi-asserted-by":"crossref","unstructured":"Liu Z, Simon C-E, Caspani F. Passage segmentation of documents for extractive question answering. 2025.","DOI":"10.1007\/978-3-031-88714-7_33"},{"key":"9638_CR19","unstructured":"Pesl RD, Mathew JG, Mecella M, Aiello M. Advanced system integration: analyzing OpenAPI chunking for retrieval-augmented generation. arxiv:2411.19804 [Preprint]. 2024."},{"key":"9638_CR20","doi-asserted-by":"crossref","unstructured":"Xiao S, Liu Z, Zhang P, Muennighoff N. C-pack: packaged resources to advance general Chinese embedding. 2023.","DOI":"10.1145\/3626772.3657878"},{"key":"9638_CR21","doi-asserted-by":"crossref","unstructured":"Khattab O, Zaharia M. Colbert: efficient and effective passage search via contextualized late interaction over bert. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020. pp. 39\u201348.","DOI":"10.1145\/3397271.3401075"},{"key":"9638_CR22","unstructured":"Nikbakht R, Benzaghta M, Geraci G. TSpec-LLM: an open-source dataset for LLM understanding of 3GPP specifications. 2024."},{"key":"9638_CR23","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825\u201330.","journal-title":"J Mach Learn Res"},{"key":"9638_CR24","doi-asserted-by":"crossref","unstructured":"Vinh N, Epps J, Bailey J. Information theoretic measures for clusterings comparison: variants. properties, normalization and correction for chance. 2009;18.","DOI":"10.1145\/1553374.1553511"},{"key":"9638_CR25","unstructured":"Rosenberg A, Hirschberg J. V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007. pp. 410\u201320."},{"key":"9638_CR26","doi-asserted-by":"publisher","first-page":"386","DOI":"10.1037\/1082-989X.9.3.386","volume":"9","author":"D Steinley","year":"2004","unstructured":"Steinley D. Properties of the hubert-arabie adjusted rand index. Psychol Methods. 2004;9:386\u201396. https:\/\/doi.org\/10.1037\/1082-989X.9.3.386.","journal-title":"Psychol Methods"},{"issue":"7","key":"9638_CR27","doi-asserted-by":"publisher","first-page":"2928","DOI":"10.4249\/scholarpedia.2928","volume":"4","author":"S Singer","year":"2009","unstructured":"Singer S, Nelder J. Nelder-mead algorithm. Scholarpedia. 2009;4(7):2928.","journal-title":"Scholarpedia"},{"key":"9638_CR28","unstructured":"Roychowdhury S, Soman S, Ranjani HG, Gunda N, Chhabra V, Bala SK. Evaluation of RAG metrics for question answering in the telecom domain. arxiv:2407.12873 [Preprint]. 2024."}],"container-title":["Discover Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10791-025-09638-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10791-025-09638-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10791-025-09638-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,13]],"date-time":"2025-06-13T07:37:48Z","timestamp":1749800268000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10791-025-09638-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,13]]},"references-count":28,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["9638"],"URL":"https:\/\/doi.org\/10.1007\/s10791-025-09638-7","relation":{},"ISSN":["2948-2992"],"issn-type":[{"value":"2948-2992","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,13]]},"assertion":[{"value":"19 February 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 May 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 June 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"We declare that the authors have no Conflict of interest as defined by Discover, or other interests that might be perceived to influence the results and\/or discussion reported in this paper. The authors declare no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"117"}}