{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,31]],"date-time":"2026-01-31T02:35:11Z","timestamp":1769826911048,"version":"3.49.0"},"reference-count":56,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2023,3,20]],"date-time":"2023-03-20T00:00:00Z","timestamp":1679270400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.<\/jats:p>","DOI":"10.3390\/info14030195","type":"journal-article","created":{"date-parts":[[2023,3,20]],"date-time":"2023-03-20T03:09:37Z","timestamp":1679281777000},"page":"195","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":16,"title":["Learned Text Representation for Amharic Information Retrieval and Natural Language Processing"],"prefix":"10.3390","volume":"14","author":[{"given":"Tilahun","family":"Yeshambel","sequence":"first","affiliation":[{"name":"IT Doctorial Program, Addis Ababa University, Addis Ababa P.O. Box 1176, Ethiopia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9273-2193","authenticated-orcid":false,"given":"Josiane","family":"Mothe","sequence":"additional","affiliation":[{"name":"Componsante INSPE, IRIT, UMR5505 CNRS, Universit\u00e9 de Toulouse Jean-Jaur\u00e8s, 118 Rte de Narbonne, F31400 Toulouse, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7591-9298","authenticated-orcid":false,"given":"Yaregal","family":"Assabie","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Addis Ababa University, Addis Ababa P.O. Box 1176, Ethiopia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,3,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., and Sun, M. (2020). Representation Learning for Natural Language Processing, Springer. Available online: https:\/\/link.springer.com\/book\/10.1007\/978-981-15-5573-2.","DOI":"10.1007\/978-981-15-5573-2"},{"key":"ref_2","unstructured":"Manning, C., Raghavan, P., and Schutze, H. (2010). Introduction to Information Retrieval, Cambridge University Press. Available online: https:\/\/nlp.stanford.edu\/IR-book\/pdf\/irbookprint.pdf."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/505282.505283","article-title":"Machine learning in automated text categorization","volume":"34","author":"Sebastiani","year":"2002","journal-title":"ACM Comput. Surv."},{"key":"ref_4","unstructured":"Tellex, S., Katz, B., Lin, J., Fernandes, A., and Marton, G. (August, January 28). Quantitative evaluation of passage retrieval algorithms for question answering. Proceedings of the 26th Annual international ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, ON, Canada."},{"key":"ref_5","unstructured":"Turian, J., Ratinov, L., and Yoshua, B. (2010, January 11\u201316). Word representations: A simple and general method for semi-supervised learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden."},{"key":"ref_6","unstructured":"Socher, R., Bauer, J., Manning, C., and Ng, A.Y. (2013, January 4\u20139). Parsing with compositional vector grammars. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Babic, K., Martin\u010di\u0107-Ip\u0161i\u0107, S., and Me\u0161trovi\u2019c, A. (2020). Survey of neural text representation models. Information, 11.","DOI":"10.3390\/info11110511"},{"key":"ref_8","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013, January 5\u201310). Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_9","unstructured":"Le, Q., and Mikolov, T. (2014, January 21\u201326). Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning, Beijing, China."},{"key":"ref_10","unstructured":"Logeswaran, L., and Lee, H. (May, January 30). An efficient framework for learning sentence representations. Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada."},{"key":"ref_11","unstructured":"Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. (2015, January 6\u201311). From word embeddings to document distances. Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zhou, G., He, T., Zhao, J., and Hu, P. (2015, January 27\u201329). Learning continuous word embedding with metadata for question retrieval in community question answering. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China.","DOI":"10.3115\/v1\/P15-1025"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014, January 25\u201329). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Athiwaratkun, B., Gordon, A., and Anandkumar, A. (2018, January 15\u201320). Probabilistic fastText for multi-sense word embeddings. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.","DOI":"10.18653\/v1\/P18-1001"},{"key":"ref_15","unstructured":"Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019, January 3\u20135). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota."},{"key":"ref_16","unstructured":"Antoun, W., Bal, F., and Hajj, H. (2020, January 12). Arabert: Transformer-based model for Arabic language under-standing. Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Delobelle, P., Winters, T., and Berendt, B. (2020, January 16\u201318). RobBERT: A Dutch RoBERTa-based language model. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online.","DOI":"10.18653\/v1\/2020.findings-emnlp.292"},{"key":"ref_18","unstructured":"Polignano, M., Basile, P., Gemmis, M., Semeraro, G., and Basile, V. (2019, January 13\u201315). AlBERTo: Italian BERT language understanding model for NLP challenging tasks based on tweets. Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-It 2019), Bari, Italy."},{"key":"ref_19","unstructured":"Terumi, E., Vitor, J., Knafou, J., Copara, J., Oliveira, L., Gumiel, Y., Oliveira, L., Teodoro, D., Cabrera, E., and Moro, C. (2020, January 19). BioBERTpt-A Portuguese neural language model for clinical named entity recognition. Proceedings of the 3rd Clinical Natural Language Processing Workshop, Online."},{"key":"ref_20","unstructured":"Kuratov, Y., and Arkhipov, M. (June, January 29). Adaptation of deep bidirectional multilingual transformers for Russian language. Proceedings of the Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference \u201cDialogue 2019\u201d, Newral Networks and Deep Learning Lab, Moscow, Russia."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Martin, L., Muller, B., Javier, P., Su\u00e1rez, O., Dupont, Y., Romary, L., Villemonte, \u00c9., Clergerie, D., Seddah, D., and Sagot, B. (2020, January 5\u201310). CamemBERT: A Tasty French language model. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online.","DOI":"10.18653\/v1\/2020.acl-main.645"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"225","DOI":"10.1016\/j.aiopen.2021.08.002","article-title":"Pre-trained models: Past, present and future","volume":"2","author":"Han","year":"2022","journal-title":"AI Open"},{"key":"ref_23","unstructured":"Yeshambel, T., Mothe, J., and Assabie, Y. (2020, January 2\u20134). Evaluation of corpora, resources and tools for Amharic information retrieval. Proceedings of the ICAST2020, Bahir Dar, Ethiopia."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Diaz, F., Mitra, B., and Craswell, N. (2016, January 7\u201312). Query expansion with locally-trained word embeddings. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany.","DOI":"10.18653\/v1\/P16-1035"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Aklouche, B., Bounhas, I., and Slimani, Y. (2018, January 14\u201316). Query expansion based on NLP and word embeddings. Proceedings of the 27th Text REtrieval Conference (TREC 2018), Gaithersburg, ML, USA.","DOI":"10.6028\/NIST.SP.500-331.core-JARIR"},{"key":"ref_26","unstructured":"Getnet, B., and Assabie, Y. (2021, January 2\u20134). Amharic information retrieval based on query expansion using semantic vocabulary. Proceedings of the 8th EAI International Conference on Advancements of Science and Technology, Bahir Dar, Ethiopia."},{"key":"ref_27","unstructured":"Deho, O., Agangiba, W., Aryeh, F., and Ansah, J. (2018, January 22\u201324). Sentiment analysis with word embedding. Proceedings of the 2018 IEEE 7th International Conference on Adaptive Science & Technology (ICAST), Accra, Ghana."},{"key":"ref_28","unstructured":"Acosta, J., Norissa, L., Mingxiao, L., Ezra, F., and Andreea, C. (2017, January 5). Sentiment analysis of twitter messages using word2Vec. Proceedings of the Student Faculty Research Day, CSIS, New York, NY, USA."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Medved, M., and Hor\u00e1k, A. (2018, January 16\u201318). Sentence and Word embedding employed in Open question-Answering. Proceedings of the 10th International Conference on Agents and Artificial Intelligence (ICAART 2018), Funchal, Portugal.","DOI":"10.5220\/0006595904860492"},{"key":"ref_30","unstructured":"Sun, Y., Zheng, Y., Hao, C., and Qiu, H. (2022, January 12\u201317). NSP-BERT: A Prompt-based few-shot learner through an original pre-training task\u2014Next sentence prediction. Proceedings of the 29th International Conference on Computational Linguistics, Yeongju, Republic of Korea."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Shi, W., and Demberg, V. (2019, January 3\u20137). Next sentence prediction helps implicit discourse relation classification within and across domains. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China.","DOI":"10.18653\/v1\/D19-1586"},{"key":"ref_32","unstructured":"Bai, H., and Zhao, H. (2018, January 20\u201326). Deep enhanced representation for implicit discourse relation recognition. Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA."},{"key":"ref_33","unstructured":"Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., and Webber, B. (2008, January 28\u201330). The Penn Discourse TreeBank 2.0. Proceedings of the Sixth Conference on International Language Resources and Evaluation (LREC-2008), Marrakech, Morocco."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"3504","DOI":"10.1109\/TASLP.2021.3124365","article-title":"Pre-training with whole word masking for Chinese BERT","volume":"29","author":"Cui","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_35","first-page":"e10","article-title":"Classification of fake news by fine-tuning deep bidirectional transformers based language model","volume":"27","author":"Aggarwal","year":"2020","journal-title":"EAI Endorsed Trans. Scalable Inf. Syst."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Protasha, N., Sam, A., Kowsher, M., Murad, S., Bairagi, A., Masud, M., and Baz, M. (2022). Transfer learning for sentiment analysis using BERT based supervised fine-tuning. Sensors, 22.","DOI":"10.3390\/s22114157"},{"key":"ref_37","unstructured":"Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T. (2018, January 7\u201312). Learning word vectors for 157 languages. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"358","DOI":"10.22161\/ijaers.78.39","article-title":"Learning word and sub-word vectors for Amharic (Less Resourced Language)","volume":"7","author":"Eshetu","year":"2020","journal-title":"Int. J. Adv. Eng. Res. Sci. (IJAERS)"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Muhie, S., Ayele, A., Venkatesh, G., Gashaw, I., and Biemann, C. (2021). Introducing various semantic models for Amharic: Experimentation and evaluation with multiple tasks and datasets. Future Internet, 13.","DOI":"10.3390\/fi13110275"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Yeshambel, T., Mothe, J., and Assabie, Y. (2021). Amharic adhoc information retrieval system based on morphological features. Appl. Sci., 12.","DOI":"10.3390\/app12031294"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzm\u00e1n, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020, January 5\u201310). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online.","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"ref_42","unstructured":"Yeshambel, T., Mothe, J., and Assabie, Y. (2020, January 2\u20134). Construction of morpheme-based Amharic stopword list for information retrieval system. Proceedings of the 8th EAI International Conference on Advancements of Science and Technology, Bahir Dar, Ethiopia."},{"key":"ref_43","first-page":"254","article-title":"The effectiveness of stemming for information retrieval in Amharic","volume":"37","author":"Alemayehu","year":"2003","journal-title":"Program Electron. Libr. Inf. Syst."},{"key":"ref_44","unstructured":"Yimam, B. (2000). Yamarigna Sewasiw (Amharic Grammar), CASE. [2nd ed.]."},{"key":"ref_45","unstructured":"Wolf, L. (1995). Reference Grammar of Amharic, Otto Harrassowitz. [1st ed.]."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Yeshambel, T., Mothe, J., and Assabie, Y. (2020, January 2\u20134). Amharic document representation for adhoc retrieval. Proceedings of the 12th International Conference on Knowledge Discovery and Information Retrieval, Online.","DOI":"10.5220\/0010177301240134"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Arora, P., Foster, J., and Jones, G. (2017, January 11\u201314). Query expansion for sentence retrieval using pseudo relevance feedback and word embedding. Proceedings of the CLEF 2017, Dubline, Ireland.","DOI":"10.1007\/978-3-319-65813-1_8"},{"key":"ref_48","unstructured":"Sun, C., Qiu, X., Xu, Y., and Huang, X. (2020, January 14\u201316). How to fine-tune BERT for text classification?. Proceedings of the 21st China National Conference on Chinese Computational Linguistics, Nanchang, China."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Palotti, J., Scells, H., and Zuccon, G. (2019, January 21\u201325). TrecTools: An Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR\u201919), Paris, France.","DOI":"10.1145\/3331184.3331399"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Yeshambel, T., Mothe, J., and Assabie, Y. (2020, January 22\u201325). 2AIRTC: The Amharic Adhoc information retrieval test collection. Proceedings of the CLEF 2020, LNCS 12260, Thessaloniki, Greece.","DOI":"10.1007\/978-3-030-58219-7_5"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Yeshambel, T., Mothe, J., and Assabie, Y. (2021, January 11\u201315). Morphologically annotated Amharic text corpora. Proceedings of the 44th ACM SIGIR Conference on Research and Development in Information Retrieval, Online.","DOI":"10.1145\/3404835.3463237"},{"key":"ref_52","unstructured":"Kingma, D., and Ba, J. (2015, January 7\u20139). Adam: A Method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA."},{"key":"ref_53","unstructured":"Lin, J., Nogueira, R., and Yates, A. (2021, January 6\u201311). Pre-trained transformers for text ranking: BERT and Beyond. Proceedings of the NAACL-HLT, Mexico City, Mexico."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Limsopatham, N. (2021, January 10). Effectively leveraging BERT for legal document classification. Proceedings of the Natural Legal Language Processing Workshop 2021, Punta Cana, Dominican Republic.","DOI":"10.18653\/v1\/2021.nllp-1.22"},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"34046","DOI":"10.1109\/ACCESS.2022.3162614","article-title":"A long-text classification method of Chinese news based on BERT and CNN","volume":"10","author":"Chen","year":"2022","journal-title":"IEEE Access"},{"key":"ref_56","first-page":"1429","article-title":"Automatic query expansion using word embedding based on fuzzy graph connectivity measures","volume":"5","author":"Goyal","year":"2021","journal-title":"Int. J. Trend Sci. Res. Dev. (IJTSRD)"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/14\/3\/195\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:58:59Z","timestamp":1760122739000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/14\/3\/195"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,20]]},"references-count":56,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2023,3]]}},"alternative-id":["info14030195"],"URL":"https:\/\/doi.org\/10.3390\/info14030195","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,20]]}}}