{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,20]],"date-time":"2026-01-20T23:20:21Z","timestamp":1768951221893,"version":"3.49.0"},"reference-count":55,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2025,6,16]],"date-time":"2025-06-16T00:00:00Z","timestamp":1750032000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computation"],"abstract":"<jats:p>Semantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this paper, we propose a hybrid chunking; two-steps semantic text chunking method that combines the effectiveness of unsupervised semantic text chunking based on the similarities between sentences embeddings and the pre-trained language models (PLMs) especially BERT by fine-tuning the BERT on semantic textual similarity task (STS) to provide a flexible and effective semantic text chunking. We evaluated the proposed method in English and Arabic. To the best of our knowledge, there is an absence of an Arabic dataset created to assess semantic text chunking at this level. Therefore, we created an AraWiki50k to evaluate our proposed text chunking method inspired by an existing English dataset. Our experiments showed that exploiting the fine-tuned pre-trained BERT on STS enhances results over unsupervised semantic chunking by an average of 7.4 in the PK metric and by an average of 11.19 in the WindowDiff metric on four English evaluation datasets, and 0.12 in the PK and 2.29 in the WindowDiff for the Arabic dataset.<\/jats:p>","DOI":"10.3390\/computation13060151","type":"journal-article","created":{"date-parts":[[2025,6,16]],"date-time":"2025-06-16T05:28:33Z","timestamp":1750051713000},"page":"151","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT"],"prefix":"10.3390","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-3855-5209","authenticated-orcid":false,"given":"Mai","family":"Alammar","sequence":"first","affiliation":[{"name":"Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia"},{"name":"Department of Computer Science, College of Computer and Information Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11564, Saudi Arabia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2457-9961","authenticated-orcid":false,"given":"Khalil","family":"El Hindi","sequence":"additional","affiliation":[{"name":"Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7328-4935","authenticated-orcid":false,"given":"Hend","family":"Al-Khalifa","sequence":"additional","affiliation":[{"name":"Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia"}]}],"member":"1968","published-online":{"date-parts":[[2025,6,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Aumiller, D., Almasian, S., Lackner, S., and Gertz, M. (2021, January 21\u201325). Structural text segmentation of legal documents. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, S\u00e3o Paulo, Brazil.","DOI":"10.1145\/3462757.3466085"},{"key":"ref_2","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. Adv. Neural Inf. Process. Syst., 30."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Maraj, A., Martin, M.V., and Makrehchi, M. (2021, January 5\u201310). A more effective sentence-wise text segmentation approach using bert. Proceedings of the Document Analysis and Recognition\u2013ICDAR 2021: 16th International Conference, Lausanne, Switzerland. Proceedings, Part IV 16.","DOI":"10.1007\/978-3-030-86337-1_16"},{"key":"ref_4","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January July). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Florence, Italy."},{"key":"ref_5","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"169","DOI":"10.1162\/tacl_a_00261","article-title":"SECTOR: A neural model for coherent topic segmentation and classification","volume":"7","author":"Arnold","year":"2019","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"78","DOI":"10.31577\/cai_2022_1_78","article-title":"Semantic segmentation of text using deep learning","volume":"2022","author":"Lattisi","year":"2022","journal-title":"Comput. Inform."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Alshanqiti, A.M., Albouq, S., Alkhodre, A.B., Namoun, A., and Nabil, E. (2022). Employing a multilingual transformer model for segmenting unpunctuated arabic text. Appl. Sci., 12.","DOI":"10.20944\/preprints202208.0451.v1"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Koshorek, O., Cohen, A., Mor, N., Rotman, M., and Berant, J. (2018). Text segmentation as a supervised learning task. arXiv.","DOI":"10.18653\/v1\/N18-2075"},{"key":"ref_10","unstructured":"Somasundaran, S. (2020, January 7\u201312). Two-level transformer and auxiliary coherence modeling for improved text segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Maraj, A., Martin, M.V., and Makrehchi, M. (2024, January 15\u201316). Words That Stick: Using Keyword Cohesion to Improve Text Segmentation. Proceedings of the 28th Conference on Computational Natural Language Learning, Miami, FL, USA.","DOI":"10.18653\/v1\/2024.conll-1.1"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Badjatiya, P., Kurisinkel, L.J., Gupta, M., and Varma, V. (2018). Attention-based neural text segmentation. European Conference on Information Retrieval, Springer.","DOI":"10.1007\/978-3-319-76941-7_14"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Xing, L., Hackinen, B., Carenini, G., and Trebbi, F. (2020). Improving context modeling in neural topic segmentation. arXiv.","DOI":"10.18653\/v1\/2020.aacl-main.63"},{"key":"ref_14","unstructured":"Ahmad, S.R. (2024). Enhancing multilingual information retrieval in mixed human resources environments: A RAG model implementation for multicultural enterprise. arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Barrow, J., Jain, R., Morariu, V., Manjunatha, V., Oard, D.W., and Resnik, P. (2020, January 5\u201310). A joint model for document segmentation and segment labeling. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online.","DOI":"10.18653\/v1\/2020.acl-main.29"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Lo, K., Jin, Y., Tan, W., Liu, M., Du, L., and Buntine, W. (2021). Transformer over pre-trained transformer for neural text segmentation with enhanced topic coherence. arXiv.","DOI":"10.18653\/v1\/2021.findings-emnlp.283"},{"key":"ref_17","first-page":"33","article-title":"Text tiling: Segmenting text into multi-paragraph subtopic passages","volume":"23","author":"Hearst","year":"1997","journal-title":"Comput. Linguist."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Brants, T., Chen, F., and Tsochantaridis, I. (2002, January 4\u20139). Topic-based document segmentation with probabilistic latent semantic analysis. Proceedings of the Eleventh International Conference on Information and Knowledge Management, McLean, VA, USA.","DOI":"10.1145\/584826.584829"},{"key":"ref_19","unstructured":"Choi, F.Y. (2000). Advances in domain independent linear text segmentation. arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Glava\u0161, G., Nanni, F., and Ponzetto, S.P. (2016, January 11\u201312). Unsupervised text segmentation using semantic relatedness graphs. Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, Berlin, Germany.","DOI":"10.18653\/v1\/S16-2016"},{"key":"ref_21","first-page":"993","article-title":"Latent dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"46673","DOI":"10.1007\/s11042-023-15509-4","article-title":"Siamese bert architecture model with attention mechanism for textual semantic similarity","volume":"82","author":"Li","year":"2023","journal-title":"Multimed. Tools Appl."},{"key":"ref_23","first-page":"647","article-title":"Semantic textual similarity methods, tools, and applications: A survey","volume":"20","author":"Majumder","year":"2016","journal-title":"Comput. Sist."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Reimers, N., and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv.","DOI":"10.18653\/v1\/D19-1410"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"73","DOI":"10.52783\/jes.1099","article-title":"Semantic similarity caculating based on bert","volume":"20","author":"Yang","year":"2024","journal-title":"J. Electr. Syst."},{"key":"ref_26","unstructured":"Riedl, M., and Biemann, C. (2012, January 8\u201314). TopicTiling: A text segmentation algorithm based on LDA. Proceedings of the ACL 2012 Student Research Workshop, Jeju, Republic of Korea."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Malioutov, I.I.M. (2006). Minimum Cut Model for Spoken Lecture Segmentation. [Ph.D. Thesis, Massachusetts Institute of Technology].","DOI":"10.3115\/1220175.1220179"},{"key":"ref_28","unstructured":"Du, L., Buntine, W., and Johnson, M. (2013, January 9\u201314). Topic segmentation with a structured topic model. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia."},{"key":"ref_29","unstructured":"Malmasi, S., Dras, M., Johnson, M., Du, L., and Wolska, M. (August, January 30). Unsupervised text segmentation based on native language characteristics. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Sehikh, I., Fohr, D., and Illina, I. (2017, January 16\u201320). Topic segmentation in ASR transcripts using bidirectional RNNs for change detection. Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan.","DOI":"10.1109\/ASRU.2017.8268979"},{"key":"ref_31","unstructured":"Solbiati, A., Heffernan, K., Damaskinos, G., Poddar, S., Modi, S., and Cali, J. (2021). Unsupervised topic segmentation of meetings with bert embeddings. arXiv."},{"key":"ref_32","unstructured":"Glava\u0161, G., Ganesh, A., and Somasundaran, S. (2021, January 19\u201320). Training and domain adaptation for supervised text segmentation. Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, Online."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Ma, J., Han, S., Ye, S., and Wang, G. (2024, January 19\u201321). Enhanced Document Segmentation at Paragraph Level. Proceedings of the 2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT), Guangzhou, China.","DOI":"10.1109\/ECNCT63103.2024.10704470"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"100061","DOI":"10.1016\/j.nlp.2024.100061","article-title":"Improving paragraph segmentation using BERT with additional information from probability density function modeling of segmentation distances","volume":"6","author":"Yoo","year":"2024","journal-title":"Nat. Lang. Process. J."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. (2015, January 7\u201313). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.11"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Specia, L. (2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv.","DOI":"10.18653\/v1\/S17-2001"},{"key":"ref_37","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"J. Mach. Learn. Res."},{"key":"ref_38","unstructured":"FastText (2025, May 12). Fast Text. Available online: https:\/\/fasttext.cc\/."},{"key":"ref_39","unstructured":"Antoun, W., Baly, F., and Hajj, H. (2020). Arabert: Transformer-based model for arabic language understanding. arXiv."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Mueller, J., and Thyagarajan, A. (2016, January 12\u201317). Siamese recurrent architectures for learning sentence similarity. Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA.","DOI":"10.1609\/aaai.v30i1.10350"},{"key":"ref_41","unstructured":"Loshchilov, I., and Hutter, F. (2017). Decoupled weight decay regularization. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Vasileiou, A., and Eberle, O. (2024). Explaining text similarity in transformer models. arXiv.","DOI":"10.18653\/v1\/2024.naacl-long.435"},{"key":"ref_43","unstructured":"Kamradt, G. (2025, June 12). 5 Levels of Text Splitting. Available online: https:\/\/github.com\/FullStackRetrieval-com\/RetrievalTutorials."},{"key":"ref_44","unstructured":"(2025, May 10). OpenAI Platform. Available online: https:\/\/platform.openai.com\/docs\/guides\/embeddings."},{"key":"ref_45","unstructured":"(2025, June 10). Sentence Transformers. Available online: https:\/\/huggingface.co\/sentence-transformers."},{"key":"ref_46","unstructured":"(2025, May 12). Universal AnglE Embeddings. Available online: https:\/\/huggingface.co\/collections\/WhereIsAI\/universal-angle-embeddings-663b0618ade1a39663e48190."},{"key":"ref_47","unstructured":"(2025, June 13). Voyage AI. Available online: https:\/\/www.voyageai.com\/."},{"key":"ref_48","unstructured":"(2025, June 13). Cohere Embeddings. Available online: https:\/\/docs.cohere.com\/v2\/docs\/embeddings."},{"key":"ref_49","unstructured":"(2025, June 13). GATE-AraBert-v1. Available online: https:\/\/huggingface.co\/Omartificial-Intelligence-Space\/GATE-AraBert-v1."},{"key":"ref_50","unstructured":"Kazantseva, A., and Szpakowicz, S. (2011, January 27\u201331). Linear text segmentation using affinity propagation. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK."},{"key":"ref_51","unstructured":"(2025, June 13). The Standard Corpus of Present-Day Edited American English (the Brown Corpus). Available online: https:\/\/varieng.helsinki.fi\/CoRD\/corpora\/BROWN\/."},{"key":"ref_52","unstructured":"(2025, June 13). Project Gutenberg. Available online: https:\/\/gutenberg.org\/\/."},{"key":"ref_53","unstructured":"(2025, May 20). Arabic Wikipedia Dump 2021. Available online: https:\/\/www.kaggle.com\/datasets\/z3rocool\/arabic-wikipedia-dump-2021."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1023\/A:1007506220214","article-title":"Statistical models for text segmentation","volume":"34","author":"Beeferman","year":"1999","journal-title":"Mach. Learn."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1162\/089120102317341756","article-title":"A critique and improvement of an evaluation metric for text segmentation","volume":"28","author":"Pevzner","year":"2002","journal-title":"Comput. Linguist."}],"container-title":["Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2079-3197\/13\/6\/151\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:52:38Z","timestamp":1760032358000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2079-3197\/13\/6\/151"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,16]]},"references-count":55,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2025,6]]}},"alternative-id":["computation13060151"],"URL":"https:\/\/doi.org\/10.3390\/computation13060151","relation":{},"ISSN":["2079-3197"],"issn-type":[{"value":"2079-3197","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,16]]}}}