{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T15:59:15Z","timestamp":1762531155389,"version":"build-2065373602"},"reference-count":51,"publisher":"Springer Science and Business Media LLC","issue":"11","license":[{"start":{"date-parts":[[2025,7,19]],"date-time":"2025-07-19T00:00:00Z","timestamp":1752883200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,7,19]],"date-time":"2025-07-19T00:00:00Z","timestamp":1752883200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001554","name":"Massey University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001554","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Knowl Inf Syst"],"published-print":{"date-parts":[[2025,11]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Multilingual Pre-trained Language models (multiPLMs), trained on the Masked Language Modelling (MLM) objective are commonly being used for cross-lingual tasks such as bitext mining. However, the performance of these models is still suboptimal for low-resource languages (LRLs). To improve the language representation of a given multiPLM, it is possible to further pre-train it. This is known as continual pre-training. Previous research has shown that continual pre-training with MLM and subsequently with Translation Language Modelling (TLM) improves the cross-lingual representation of multiPLMs. However, during masking, both MLM and TLM give equal weight to all tokens in the input sequence, irrespective of the linguistic properties of the tokens. In this paper, we introduce a novel masking strategy,\n                    <jats:italic>Linguistic Entity Masking<\/jats:italic>\n                    (LEM) to be used in the continual pre-training step to further improve the cross-lingual representations of existing multiPLMs. In contrast to MLM and TLM, LEM limits masking to the linguistic entity types nouns, verbs and named entities, which hold a higher prominence in a sentence. Secondly, we limit masking to a single token within the linguistic entity span thus keeping more context, whereas, in MLM and TLM, tokens are masked randomly. We evaluate the effectiveness of LEM using three downstream tasks, namely bitext mining, parallel data curation and code-mixed sentiment analysis using three low-resource language pairs English-Sinhala, English-Tamil, and Sinhala-Tamil. Experiment results show that continually pre-training a multiPLM with LEM outperforms a multiPLM continually pre-trained with MLM+TLM for all three tasks.\n                  <\/jats:p>","DOI":"10.1007\/s10115-025-02520-4","type":"journal-article","created":{"date-parts":[[2025,7,19]],"date-time":"2025-07-19T06:27:22Z","timestamp":1752906442000},"page":"9905-9946","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Linguistic entity masking to improve cross-lingual representation of multilingual language models for low-resource languages"],"prefix":"10.1007","volume":"67","author":[{"given":"Aloka","family":"Fernando","sequence":"first","affiliation":[]},{"given":"Surangika","family":"Ranathunga","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,7,19]]},"reference":[{"key":"2520_CR1","doi-asserted-by":"crossref","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers). p. 4171\u20134186","DOI":"10.18653\/v1\/N19-1423"},{"key":"2520_CR2","doi-asserted-by":"crossref","unstructured":"Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzm\u00e1n F, et\u00a0al (2020) unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics 8440\u20138451","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"2520_CR3","unstructured":"\u00c1cs J, L\u00e9vai D, Kornai A (2021) Evaluating Transferability of BERT Models on Uralic Languages. In: Proceedings of the seventh international workshop on computational linguistics of Uralic Languages p. 8\u201317"},{"key":"2520_CR4","unstructured":"Dhananjaya V, Demotte P, Ranathunga S, Jayasena S (2022) BERTifying Sinhala-A comprehensive analysis of pre-trained language models for sinhala text classification. In: Proceedings of the thirteenth language resources and evaluation conference 7377\u20137385"},{"key":"2520_CR5","unstructured":"Hu J, Ruder S, Siddhant A, Neubig G, Firat O, Johnson M (2020) XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In: Proceedings of the 37th international conference on machine learning p. 4411\u20134421"},{"key":"2520_CR6","doi-asserted-by":"crossref","unstructured":"Hu J, Johnson M, Firat O, Siddhant A, Neubig G (2021) Explicit Alignment Objectives for Multilingual Bidirectional Encoders. In: Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: human language technologies p. 3633\u20133643","DOI":"10.18653\/v1\/2021.naacl-main.284"},{"key":"2520_CR7","unstructured":"Conneau A, Lample G (2019) Cross-lingual language model pretraining. Adva Neural Inf proc Syst. 32"},{"key":"2520_CR8","doi-asserted-by":"crossref","unstructured":"Nastase V, Merlo P (2024) Tracking linguistic information in transformer-based sentence embeddings through targeted sparsification. In: Proceedings of the 9th workshop on representation learning for NLP (RepL4NLP-2024) p. 203\u2013214","DOI":"10.18653\/v1\/2024.repl4nlp-1.15"},{"key":"2520_CR9","doi-asserted-by":"crossref","unstructured":"Nastase V, Merlo P (2023) Grammatical information in BERT sentence embeddings as two-dimensional arrays. In: Proceedings of the 8th workshop on representation learning for NLP (RepL4NLP 2023); 22\u201339","DOI":"10.18653\/v1\/2023.repl4nlp-1.3"},{"key":"2520_CR10","doi-asserted-by":"crossref","unstructured":"Aoyama T, Schneider N (2022) Probe-less probing of BERT\u2019s layer-wise linguistic knowledge with masked word prediction. In: Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies: student research workshop; 195\u2013201","DOI":"10.18653\/v1\/2022.naacl-srw.25"},{"key":"2520_CR11","unstructured":"Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, et\u00a0al (2019) Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223. 2019"},{"key":"2520_CR12","doi-asserted-by":"publisher","first-page":"64","DOI":"10.1162\/tacl_a_00300","volume":"8","author":"M Joshi","year":"2020","unstructured":"Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020) SpanBERT: improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist 8:64\u201377","journal-title":"Trans Assoc Comput Linguist"},{"key":"2520_CR13","unstructured":"Levine Y, Lenz B, Lieber O, Abend O, Leyton-Brown K, Tennenholtz M, et\u00a0al (2020) PMI-Masking: Principled masking of correlated spans. In: International conference on learning representations"},{"key":"2520_CR14","doi-asserted-by":"crossref","unstructured":"Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) SQuAD: 100,000+ Questions for machine comprehension of text. In: Proceedings of the 2016 conference on empirical methods in natural language processing; 2383\u20132392","DOI":"10.18653\/v1\/D16-1264"},{"key":"2520_CR15","doi-asserted-by":"crossref","unstructured":"Lai G, Xie Q, Liu H, Yang Y, Hovy E (2017) RACE: Large-scale ReAding comprehension dataset from examinations. In: Proceedings of the 2017 conference on empirical methods in natural language processing 785\u2013794","DOI":"10.18653\/v1\/D17-1082"},{"key":"2520_CR16","doi-asserted-by":"crossref","unstructured":"Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S (2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP 353\u2013355","DOI":"10.18653\/v1\/W18-5446"},{"key":"2520_CR17","doi-asserted-by":"crossref","unstructured":"Zhuang Y (2023) Heuristic masking for text representation pretraining. In: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE p. 1\u20135","DOI":"10.1109\/ICASSP49357.2023.10095445"},{"key":"2520_CR18","doi-asserted-by":"crossref","unstructured":"Golchin S, Surdeanu M, Tavabi N, Kiapour A (2023) Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords. In: Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023) 13\u201321","DOI":"10.18653\/v1\/2023.repl4nlp-1.2"},{"key":"2520_CR19","doi-asserted-by":"crossref","unstructured":"Wettig A, Gao T, Zhong Z, Chen D (2023) Should you mask 15% in masked language modeling? In: Proceedings of the 17th conference of the european chapter of the association for computational linguistics; 2023. p. 2977\u20132992","DOI":"10.18653\/v1\/2023.eacl-main.217"},{"key":"2520_CR20","doi-asserted-by":"crossref","unstructured":"Feng F, Yang Y, Cer D, Arivazhagan N, Wang W (2022) Language-agnostic BERT Sentence Embedding. In: Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers); 2022. p. 878\u2013891","DOI":"10.18653\/v1\/2022.acl-long.62"},{"key":"2520_CR21","doi-asserted-by":"crossref","unstructured":"Ranathunga S, Ranasinghea A, Shamala J, Dandeniyaa A, Galappaththia R, Samaraweeraa M (2024) A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala. arXiv preprint arXiv:2412.02056","DOI":"10.1016\/j.nlp.2025.100160"},{"key":"2520_CR22","unstructured":"Akbik A, Blythe D, Vollgraf R (2018) Contextual String Embeddings for Sequence Labeling. In: COLING 2018, 27th International conference on computational linguistics 1638\u20131649"},{"key":"2520_CR23","doi-asserted-by":"crossref","unstructured":"Fernando S, Ranathunga S (2018) Evaluation of different classifiers for sinhala pos tagging. In, (2018) Moratuwa Eng Res Conf (MERCon). IEEE 96\u2013101","DOI":"10.1109\/MERCon.2018.8421997"},{"key":"2520_CR24","unstructured":"Fernando S, Ranathunga S, Jayasena S, Dias G (2016) Comprehensive part-of-speech tag set and svm based pos tagger for sinhala. In: Proceedings of the 6th workshop on south and southeast asian natural language processing (WSSANLP2016) 173\u2013182"},{"key":"2520_CR25","unstructured":"Sarveswaran K, Dias G (2020) ThamizhiUDp: A dependency parser for tamil. In: Proceedings of the 17th international conference on natural language processing (ICON) 200\u2013207"},{"key":"2520_CR26","doi-asserted-by":"crossref","unstructured":"Artetxe M, Schwenk H (2019) Margin-based parallel corpus mining with multilingual sentence embeddings. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for computational linguistics 3197\u20133203","DOI":"10.18653\/v1\/P19-1309"},{"issue":"2","key":"2520_CR27","doi-asserted-by":"publisher","first-page":"571","DOI":"10.1007\/s10115-022-01761-x","volume":"65","author":"A Fernando","year":"2023","unstructured":"Fernando A, Ranathunga S, Sachintha D, Piyarathna L, Rajitha C (2023) Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages. Knowled Inf Syst 65(2):571\u2013612","journal-title":"Knowled Inf Syst"},{"key":"2520_CR28","doi-asserted-by":"publisher","first-page":"597","DOI":"10.1162\/tacl_a_00288","volume":"7","author":"M Artetxe","year":"2019","unstructured":"Artetxe M, Schwenk H (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans Assoc Comput Linguist 7:597\u2013610","journal-title":"Trans Assoc Comput Linguist"},{"key":"2520_CR29","doi-asserted-by":"crossref","unstructured":"Yang Y, Cer D, Ahmad A, Guo M, Law J, Constant N, et\u00a0al (2020) Multilingual universal sentence encoder for semantic retrieval. In: Proceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations 87\u201394","DOI":"10.18653\/v1\/2020.acl-demos.12"},{"key":"2520_CR30","doi-asserted-by":"crossref","unstructured":"Schwenk H, Wenzek G, Edunov S, Grave \u00c9, Joulin A, Fan A (2021) CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers) 6490\u20136500","DOI":"10.18653\/v1\/2021.acl-long.507"},{"key":"2520_CR31","unstructured":"Costa-juss\u00e0 MR, Cross J, \u00c7elebi O, Elbayad M, Heafield K, Heffernan K, et\u00a0al (2022) No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672"},{"key":"2520_CR32","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1162\/tacl_a_00447","volume":"10","author":"J Kreutzer","year":"2022","unstructured":"Kreutzer J, Caswell I, Wang L, Wahab A, van Esch D, Ulzii-Orshikh N et al (2022) Quality at a glance: an audit of web-crawled multilingual datasets. Trans Assoc Comput Linguist 10:50\u201372. https:\/\/doi.org\/10.1162\/tacl_a_00447","journal-title":"Trans Assoc Comput Linguist"},{"key":"2520_CR33","doi-asserted-by":"crossref","unstructured":"Ranathunga S, De\u00a0Silva N, Menan V, Fernando A, Rathnayake C (2024) Quality does matter: a detailed look at the quality and utility of web-mined parallel corpora. In: Graham Y, Purver M, editors. Proceedings of the 18th conference of the european chapter of the association for computational linguistics (Volume 1: Long Papers). St. Julian\u2019s, Malta: association for computational linguistics; 2024. p. 860\u2013880. Available from: https:\/\/aclanthology.org\/2024.eacl-long.52","DOI":"10.18653\/v1\/2024.eacl-long.52"},{"key":"2520_CR34","doi-asserted-by":"crossref","unstructured":"Post M (2018) A Call for Clarity in Reporting BLEU Scores. In: Proceedings of the third conference on machine translation: research papers. belgium, brussels: association for computational linguistics 186\u2013191. Available from: https:\/\/www.aclweb.org\/anthology\/W18-6319","DOI":"10.18653\/v1\/W18-6319"},{"key":"2520_CR35","doi-asserted-by":"crossref","unstructured":"Popovi\u0107 M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation 392\u2013395","DOI":"10.18653\/v1\/W15-3049"},{"key":"2520_CR36","doi-asserted-by":"crossref","unstructured":"Popovi\u0107 M (2017) chrF++: words helping character n-grams. In: Proceedings of the second conference on machine translation 612\u2013618","DOI":"10.18653\/v1\/W17-4770"},{"key":"2520_CR37","doi-asserted-by":"publisher","first-page":"522","DOI":"10.1162\/tacl_a_00474","volume":"10","author":"N Goyal","year":"2022","unstructured":"Goyal N, Gao C, Chaudhary V, Chen PJ, Wenzek G, Ju D et al (2022) The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Trans Assoc Comput Linguist 10:522\u2013538","journal-title":"Trans Assoc Comput Linguist"},{"key":"2520_CR38","unstructured":"Barbieri F, Anke LE, Camacho-Collados J (2022) XLM-T: Multilingual language models in twitter for sentiment analysis and beyond. In: Proceedings of the thirteenth language resources and evaluation conference 258\u2013266"},{"key":"2520_CR39","unstructured":"Myers-Scotton C, Jake J Duelling languages. Grammatical structure in Codeswitching\u2013Clarendon Press. Oxford"},{"key":"2520_CR40","doi-asserted-by":"crossref","unstructured":"Joshi P, Santy S, Budhiraja A, Bali K, Choudhury M (2020) The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th annual meeting of the association for computational linguistics 6282\u20136293","DOI":"10.18653\/v1\/2020.acl-main.560"},{"key":"2520_CR41","doi-asserted-by":"crossref","unstructured":"Ranathunga S, de\u00a0Silva N (2022) Some languages are more equal than others: probing deeper into the linguistic disparity in the NLP world. In: Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing 823\u2013848","DOI":"10.18653\/v1\/2022.aacl-main.62"},{"key":"2520_CR42","unstructured":"de\u00a0Silva N (2023) Survey on publicly available sinhala natural language processing tools and research. arXiv preprint arXiv:1906.02358v20"},{"key":"2520_CR43","unstructured":"Kudugunta S, Caswell I, Zhang B, Garcia X, Xin D, Kusupati A, et\u00a0al (2024) Madlad-400: A multilingual and document-level large audited dataset. Adv Neural Inf Proc Syst 36"},{"key":"2520_CR44","unstructured":"Fernando A, Ranathunga S, Dias G (2020) Data augmentation and terminology integration for domain-specific sinhala-english-tamil statistical machine translation. arXiv preprint arXiv:2011.02821"},{"key":"2520_CR45","doi-asserted-by":"crossref","unstructured":"El-Kishky A, Chaudhary V, Guzm\u00e1n F, Koehn P (2020) CCAligned: a massive collection of cross-lingual web-document pairs. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) 5960\u20135969","DOI":"10.18653\/v1\/2020.emnlp-main.480"},{"key":"2520_CR46","doi-asserted-by":"crossref","unstructured":"Ba\u00f1\u00f3n M, Chen P, Haddow B, Heafield K, Hoang H, Espl\u00e0-Gomis M, et\u00a0al (2020) ParaCrawl: web-scale acquisition of parallel corpora. In: Proceedings of the 58th annual meeting of the association for computational linguistics 4555\u20134567","DOI":"10.18653\/v1\/2020.acl-main.417"},{"issue":"7","key":"2520_CR47","doi-asserted-by":"publisher","first-page":"1937","DOI":"10.1007\/s10115-022-01698-1","volume":"64","author":"H Rathnayake","year":"2022","unstructured":"Rathnayake H, Sumanapala J, Rukshani R, Ranathunga S (2022) Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification. Knowled Inf Syst 64(7):1937\u20131966","journal-title":"Knowled Inf Syst"},{"key":"2520_CR48","unstructured":"Koloski B, \u0160krlj B, Robnik-\u0160ikonja M, Pollak S (2023) Measuring catastrophic forgetting in cross-lingual transfer paradigms: exploring tuning strategies. arXiv preprint arXiv:2309.06089"},{"issue":"5","key":"2520_CR49","doi-asserted-by":"publisher","first-page":"63","DOI":"10.1007\/s11280-024-01302-2","volume":"27","author":"P Udawatta","year":"2024","unstructured":"Udawatta P, Udayangana I, Gamage C, Shekhar R, Ranathunga S (2024) Use of prompt-based learning for code-mixed and code-switched text classification. World Wide Web 27(5):63","journal-title":"World Wide Web"},{"key":"2520_CR50","doi-asserted-by":"crossref","unstructured":"Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, et\u00a0al (2019) fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: demonstrations 48\u201353","DOI":"10.18653\/v1\/N19-4009"},{"key":"2520_CR51","doi-asserted-by":"crossref","unstructured":"Kocmi T, Zouhar V, Federmann C, Post M (2024) Navigating the metrics maze: reconciling score magnitudes and accuracies. In: Ku LW, Martins A, Srikumar V, editors. Proceedings of the 62nd annual meeting of the association for computational linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics 1999\u20132014. Available from: https:\/\/aclanthology.org\/2024.acl-long.110","DOI":"10.18653\/v1\/2024.acl-long.110"}],"container-title":["Knowledge and Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10115-025-02520-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10115-025-02520-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10115-025-02520-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T15:51:50Z","timestamp":1762530710000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10115-025-02520-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,19]]},"references-count":51,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025,11]]}},"alternative-id":["2520"],"URL":"https:\/\/doi.org\/10.1007\/s10115-025-02520-4","relation":{},"ISSN":["0219-1377","0219-3116"],"issn-type":[{"type":"print","value":"0219-1377"},{"type":"electronic","value":"0219-3116"}],"subject":[],"published":{"date-parts":[[2025,7,19]]},"assertion":[{"value":"12 May 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 April 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 June 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 July 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}