{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:36:34Z","timestamp":1760060194704,"version":"build-2065373602"},"reference-count":44,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2025,8,11]],"date-time":"2025-08-11T00:00:00Z","timestamp":1754870400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>The increasing use of Arabic and Urdu on social media platforms, particularly Twitter, has created a growing need for robust Named Entity Recognition (NER) systems capable of handling noisy, informal, and code-mixed content. However, both languages remain significantly underrepresented in NER research, especially in social media contexts. To address this gap, this study makes four key contributions: (1) We introduced a manual entity consolidation step to enhance the consistency and accuracy of named entity annotations. In the original datasets, entities such as person names and organization names were often split into multiple tokens (e.g., first name and last name labeled separately). We manually refined the annotations to merge these segments into unified entities, ensuring improved coherence for both training and evaluation. (2) We selected two publicly available datasets from GitHub\u2014one in Arabic and one in Urdu\u2014and applied two novel strategies to tackle low-resource challenges: a joint multilingual approach and a translation-based approach. The joint approach involved merging both datasets to create a unified multilingual corpus, while the translation-based approach utilized automatic translation to generate cross-lingual datasets, enhancing linguistic diversity and model generalizability. (3) We presented a comprehensive and reproducible pseudocode-driven framework that integrates translation, manual refinement, dataset merging, preprocessing, and multilingual model fine-tuning. (4) We designed, implemented, and evaluated a customized XLM-RoBERTa model integrated with a novel attention mechanism, specifically optimized for the morphological and syntactic complexities of Arabic and Urdu. Based on the experiments, our proposed model (XLM-RoBERTa) achieves 0.98 accuracy across Arabic, Urdu, and multilingual datasets. While it shows a 7\u20138% improvement over traditional baselines (RF), it also achieves a 2.08% improvement over a deep learning (BiLSTM = 0.96), highlighting the effectiveness of our cross-lingual, resource-efficient approach for NER in low-resource, code-mixed social media text.<\/jats:p>","DOI":"10.3390\/computers14080323","type":"journal-article","created":{"date-parts":[[2025,8,11]],"date-time":"2025-08-11T14:32:36Z","timestamp":1754922756000},"page":"323","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Multilingual Named Entity Recognition in Arabic and Urdu Tweets Using Pretrained Transfer Learning Models"],"prefix":"10.3390","volume":"14","author":[{"given":"Fida","family":"Ullah","sequence":"first","affiliation":[{"name":"Centro de Investigaci\u00f3n en Computaci\u00f3n (CIC), Instituto Polit\u00e9cnico Nacional (IPN), Mexico City 07700, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-8799-8212","authenticated-orcid":false,"given":"Muhammad","family":"Ahmad","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Computaci\u00f3n (CIC), Instituto Polit\u00e9cnico Nacional (IPN), Mexico City 07700, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3901-3522","authenticated-orcid":false,"given":"Grigori","family":"Sidorov","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Computaci\u00f3n (CIC), Instituto Polit\u00e9cnico Nacional (IPN), Mexico City 07700, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0241-7902","authenticated-orcid":false,"given":"Ildar","family":"Batyrshin","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Computaci\u00f3n (CIC), Instituto Polit\u00e9cnico Nacional (IPN), Mexico City 07700, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9828-3568","authenticated-orcid":false,"given":"Edgardo Manuel Felipe","family":"River\u00f3n","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Computaci\u00f3n (CIC), Instituto Polit\u00e9cnico Nacional (IPN), Mexico City 07700, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7845-9039","authenticated-orcid":false,"given":"Alexander","family":"Gelbukh","sequence":"additional","affiliation":[{"name":"Centro de Investigaci\u00f3n en Computaci\u00f3n (CIC), Instituto Polit\u00e9cnico Nacional (IPN), Mexico City 07700, Mexico"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,8,11]]},"reference":[{"key":"ref_1","unstructured":"Pinto, A., Gon\u00e7alo Oliveira, H., and Oliveira Alves, A. (2016, January 20\u201321). Comparing the performance of different NLP toolkits in formal and social media text. Proceedings of the 5th Symposium on Languages, Applications and Technologies (SLATE\u201916), Schloss Dagstuhl\u2013Leibniz-Zentrum fuer Informatik, Wadern, Germany."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Camacho-Collados, J., Rezaee, K., Riahi, T., Ushio, A., Loureiro, D., Antypas, D., Boisson, J., Anke, L.E., Liu, F., and C\u00e1mara, E.M. (2022). TweetNLP: Cutting-edge natural language processing for social media. arXiv.","DOI":"10.18653\/v1\/2022.emnlp-demos.5"},{"key":"ref_3","first-page":"503","article-title":"Using large language models for sentiment analysis of health-related social media data: Empirical evaluation and practical tips","volume":"Volume 2024","author":"He","year":"2024","journal-title":"AMIA Annual Symposium Proceedings"},{"key":"ref_4","first-page":"949","article-title":"Leveraging transfer learning for detecting misinformation on social media","volume":"16","author":"Reshi","year":"2024","journal-title":"Int. J. Inf. Technol."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Kaur, N., Saha, A., Swami, M., Singh, M., and Dalal, R. (July, January 29). Bert-Ner: A Transformer-Based Approach For Named Entity Recognition. Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kanpur, India.","DOI":"10.1109\/ICCCNT61001.2024.10724703"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Deshmukh, P., Kulkarni, N., Kulkarni, S., Manghani, K., Khadkikar, P.A., and Joshi, R. (2024, January 13\u201314). Named entity recognition for Indic languages: A comprehensive survey. Proceedings of the 2024 1st International Conference on Trends in Engineering Systems and Technologies (ICTEST), Mumbai, India.","DOI":"10.1109\/ICTEST60614.2024.10576183"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"109","DOI":"10.1007\/s11063-024-11576-2","article-title":"CLSTM-SNP: Convolutional neural network to enhance spiking neural P systems for named entity recognition based on long short-term memory network","volume":"56","author":"Deng","year":"2024","journal-title":"Neural Process. Lett."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"113471","DOI":"10.1016\/j.knosys.2025.113471","article-title":"Enhancing named entity recognition with external knowledge from large language model","volume":"318","author":"Li","year":"2025","journal-title":"Knowl. Based Syst."},{"key":"ref_9","unstructured":"Peddavenkatagari, C. (2024). EMPOWERING INFORMATION RETRIEVAL: A FRAMEWORK FOR EFFECTIVE DATA SUMMARIZATION USING NLP AND SBERT. Int. Res. J. Mod. Eng. Technol. Sci., 6."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1145\/3714457","article-title":"Natural language processing methods for symbolic music generation and information retrieval: A survey","volume":"57","author":"Le","year":"2025","journal-title":"ACM Comput. Surv."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"166","DOI":"10.1007\/s10462-025-11162-5","article-title":"BERT applications in natural language processing: A review","volume":"58","author":"Gardazi","year":"2025","journal-title":"Artif. Intell. Rev."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1145\/3711680","article-title":"Natural language understanding and inference with mllm in visual question answering: A survey","volume":"57","author":"Kuang","year":"2025","journal-title":"ACM Comput. Surv."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"100130","DOI":"10.1016\/j.nlp.2025.100130","article-title":"Tweet question classification for enhancing Tweet Question Answering System","volume":"10","author":"Mallikarjuna","year":"2025","journal-title":"Nat. Lang. Process. J."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"104","DOI":"10.29207\/resti.v9i1.6163","article-title":"Question Answering through Transfer Learning on Closed-Domain Educational Websites","volume":"9","author":"Laugiwa","year":"2025","journal-title":"J. RESTI (Rekayasa Sist. Dan Teknol. Inf.)"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Ahmad, M., Ameer, I., Sharif, W., Usman, S., Muzamil, M., Hamza, A., Jalal, M., Batyrshin, I., and Sidorov, G. (2025). Multilingual hope speech detection from tweets using transfer learning models. Sci. Rep., 15.","DOI":"10.1038\/s41598-025-88687-w"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"31","DOI":"10.17323\/jle.2024.22443","article-title":"Hope Speech Detection Using Social Media Discourse (Posi-Vox-2024): A Transfer Learning Approach","volume":"10","author":"Ahmad","year":"2024","journal-title":"J. Lang. Educ."},{"key":"ref_17","unstructured":"Ullah, F., Zamir, M.T., Ahmad, M., Sidorov, G., and Gelbukh, A. (2024, January 18\u201320). Hope: A multilingual approach to identifying positive communication in social media. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), Salamanca, Spain."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"21249","DOI":"10.1007\/s11042-024-19793-6","article-title":"Opinion mining for stock trend prediction using deep learning","volume":"84","author":"Albahli","year":"2025","journal-title":"Multimed. Tools Appl."},{"key":"ref_19","first-page":"420","article-title":"Slang Word Identification on Twitter","volume":"7","author":"Shroff","year":"2016","journal-title":"Int. J. Comput. Technol. Appl."},{"key":"ref_20","first-page":"89","article-title":"Data preprocessing in sentiment analysis using twitter data","volume":"3","author":"Prakash","year":"2019","journal-title":"Int. Educ. Appl. Res. J."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Saba, T., Almazyad, A.S., and Rehman, A. (2015, January 1\u20133). Language independent rule based classification of printed & handwritten text. Proceedings of the 2015 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS), Douai, France.","DOI":"10.1109\/EAIS.2015.7368806"},{"key":"ref_22","unstructured":"Shoukat, E., Irfan, R., Basharat, I., Tahir, M.A., and Shaukat, S. (2025). Attention based Bidirectional GRU hybrid model for inappropriate content detection in Urdu language. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"89","DOI":"10.1007\/s10462-024-11071-z","article-title":"Textual variations in social media text processing applications: Challenges, solutions, and trends","volume":"58","author":"Khan","year":"2025","journal-title":"Artif. Intell. Rev."},{"key":"ref_24","unstructured":"Wikipedia (2025, June 13). Arabic. Available online: https:\/\/en.wikipedia.org\/wiki\/Arabic."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1145\/3705002","article-title":"A comprehensive analysis dashboard for detecting similar saudi twitter accounts by using stylometric features","volume":"24","author":"Bagies","year":"2025","journal-title":"ACM Trans. Asian Low Resour. Lang. Inf. Process."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Salah, R.E., and Zakaria, L.Q.B. (2018, January 26\u201328). Building the classical Arabic named entity recognition corpus (CANERCorpus). Proceedings of the 2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP), Kota Kinabalu, Malaysia.","DOI":"10.1109\/INFRKM.2018.8464820"},{"key":"ref_27","first-page":"1323","article-title":"A Feature-Rich Vietnamese Named Entity Recognition Model","volume":"26","year":"2022","journal-title":"Comput. Sist."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"21918","DOI":"10.48084\/etasr.10205","article-title":"Exploring the Impact of Annotation Schemes on Arabic Named Entity Recognition across General and Specific Domains","volume":"15","author":"Loqman","year":"2025","journal-title":"Eng. Technol. Appl. Sci. Res."},{"key":"ref_29","first-page":"681","article-title":"Comparing Pre-Trained Language Model for Arabic Hate Speech Detection","volume":"28","author":"Daouadi","year":"2024","journal-title":"Comput. Sist."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"100904","DOI":"10.1109\/ACCESS.2025.3576784","article-title":"A Benchmark Dataset and a Framework for Urdu Multimodal Named Entity Recognition","volume":"13","author":"Ahmad","year":"2025","journal-title":"IEEE Access"},{"key":"ref_31","unstructured":"Ullah, F., Ahmad, M., Zamir, M.T., Arif, M., River\u00f3n, E.M.F., and Gelbukh, A. (2025). EDU-NER-2025: Named Entity Recognition in Urdu Educational Texts using XLM-RoBERTa with X (formerly Twitter). arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Azhar, N., Latif, S., and Arshad, S. (2024, January 23\u201325). Fine-tuning Urdu NER Models Using Context-Aware Embeddings. Proceedings of the 2024 14th International Conference on Software Technology and Engineering (ICSTE), Shanghai, China.","DOI":"10.1109\/ICSTE63875.2024.00030"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Ullah, F., Gelbukh, A., Zamir, M.T., River\u00f3n, E.M.F., and Sidorov, G. (2024). Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu. Computers, 13.","DOI":"10.3390\/computers13100258"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Ali, M.N., Tan, G., and Hussain, A. (2018). Bidirectional recurrent neural network approach for Arabic named entity recognition. Future Internet, 10.","DOI":"10.3390\/fi10120123"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Albahli, S. (2025). An Advanced Natural Language Processing Framework for Arabic Named Entity Recognition: A Novel Approach to Handling Morphological Richness and Nested Entities. Appl. Sci., 15.","DOI":"10.3390\/app15063073"},{"key":"ref_36","unstructured":"Al-Duwais, M., Al-Khalifa, H., and Al-Salman, A. (2024). CLEANANERCorp: Identifying and Correcting Incorrect Labels in the ANERcorp Dataset. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Hamdan, N., Hamoud, H., Abou Chakra, C., Al Mraikhat, O.R., Albared, D., and Zaraket, F.A. (2024, January 15\u201316). DRU at WojoodNER 2024: ICL LLM for Arabic NER. Proceedings of the Second Arabic Natural Language Processing Conference, London, UK.","DOI":"10.18653\/v1\/2024.arabicnlp-1.106"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Sadallah, A.B., Ahmed, O., Mohamed, S., Hatem, O., Hesham, D., and Yousef, A.H. (2023, January 13\u201315). ANER: Arabic and Arabizi named entity recognition using transformer-based approach. Proceedings of the 2023 Intelligent Methods, Systems, and Applications (IMSA), Tunis, Tunisia.","DOI":"10.1109\/IMSA58542.2023.10217635"},{"key":"ref_39","first-page":"8","article-title":"Urdu named entity recognition: Corpus generation and deep learning applications","volume":"19","author":"Kanwal","year":"2019","journal-title":"ACM Trans. Asian Low Resour. Lang. Inf. Process."},{"key":"ref_40","first-page":"2","article-title":"Urdu named entity recognition and classification system using artificial neural network","volume":"17","author":"Malik","year":"2017","journal-title":"ACM Trans. Asian Low Resour. Lang. Inf. Process."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1856","DOI":"10.1093\/comjnl\/bxac047","article-title":"Urdu named entity recognition system using deep learning approaches","volume":"66","author":"Haq","year":"2023","journal-title":"Comput. J."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"90","DOI":"10.4218\/etrij.2018-0553","article-title":"Deep recurrent neural networks with word embeddings for Urdu named entity recognition","volume":"42","author":"Khan","year":"2020","journal-title":"ETRI J."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Anam, R., Anwar, M.W., Jamal, M.H., Bajwa, U.I., Diez, I.d.l.T., Alvarado, E.S., Flores, E.S., Ashraf, I., and Khan, H.U. (2024). A deep learning approach for Named Entity Recognition in Urdu language. PLoS ONE, 19.","DOI":"10.1371\/journal.pone.0300725"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Ahmad, M., Sidorov, G., Amjad, M., Ameer, I., and Batyrshin, I. (2025). Opioid Crisis Detection in Social Media Discourse Using Deep Learning Approach. Information, 16.","DOI":"10.3390\/info16070545"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/8\/323\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:24:53Z","timestamp":1760034293000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/8\/323"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,11]]},"references-count":44,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2025,8]]}},"alternative-id":["computers14080323"],"URL":"https:\/\/doi.org\/10.3390\/computers14080323","relation":{},"ISSN":["2073-431X"],"issn-type":[{"type":"electronic","value":"2073-431X"}],"subject":[],"published":{"date-parts":[[2025,8,11]]}}}