{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,25]],"date-time":"2026-02-25T00:18:29Z","timestamp":1771978709179,"version":"3.50.1"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,4,6]],"date-time":"2023-04-06T00:00:00Z","timestamp":1680739200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Meridian Institute through Lacuna Fund","award":["0393-S-001"],"award-info":[{"award-number":["0393-S-001"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2023,4,30]]},"abstract":"<jats:p>The need for question-answering (QA) datasets in low-resource languages is the motivation of this research, leading to the development of the Kencorpus Swahili Question Answering Dataset (KenSwQuAD). This dataset is annotated from raw story texts of Swahili, a low-resource language that is predominantly spoken in eastern Africa and in other parts of the world. Question-answering datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold-standard question-answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting in a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.<\/jats:p>","DOI":"10.1145\/3578553","type":"journal-article","created":{"date-parts":[[2023,1,17]],"date-time":"2023-01-17T11:55:26Z","timestamp":1673956526000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["KenSwQuAD\u2014A Question Answering Dataset for Swahili Low-resource Language"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0198-3179","authenticated-orcid":false,"given":"Barack W.","family":"Wanjawa","sequence":"first","affiliation":[{"name":"University of Nairobi, Nairobi, Kenya"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6718-900X","authenticated-orcid":false,"given":"Lilian D. A.","family":"Wanzare","sequence":"additional","affiliation":[{"name":"Maseno University, Maseno, Kenya"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4285-770X","authenticated-orcid":false,"given":"Florence","family":"Indede","sequence":"additional","affiliation":[{"name":"Maseno University, Maseno, Kenya"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1276-0395","authenticated-orcid":false,"given":"Owen","family":"Mconyango","sequence":"additional","affiliation":[{"name":"Maseno University, Maseno, Kenya"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5911-5679","authenticated-orcid":false,"given":"Lawrence","family":"Muchemi","sequence":"additional","affiliation":[{"name":"University of Nairobi, Nairobi, Kenya"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1787-0681","authenticated-orcid":false,"given":"Edward","family":"Ombui","sequence":"additional","affiliation":[{"name":"Africa Nazarene University, Nairobi, Kenya"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,4,6]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_1_3_2","unstructured":"J. Devlin M.-W. Chang K. Lee and K. Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805."},{"key":"e_1_3_1_4_2","unstructured":"J. Libovick\u00fd R. Rosa and A. Fraser. 2019. How language-neutral is multilingual BERT?. arXiv:1911.03310. Retrieved from https:\/\/arxiv.org\/abs\/1911.03310."},{"key":"e_1_3_1_5_2","doi-asserted-by":"crossref","unstructured":"P. Rajpurkar J. Zhang K. Lopyrev and P. Liang. 2016. Squad: 100 000+ questions for machine comprehension of text. arXiv:1606.05250. Retrieved from https:\/\/arxiv.org\/abs\/1606.05250.","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_1_6_2","first-page":"193","volume-title":"\u2013Proceedings of the Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (EMNLP\u201913)","author":"Richardson M.","year":"2013","unstructured":"M. Richardson, C. J. C. Burges, and E. Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In \u2013Proceedings of the Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (EMNLP\u201913). 193\u2013203."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1237"},{"key":"e_1_3_1_8_2","volume-title":"Proceedings of LREC\u20192000 Workshop on Using Evaluation within HLT Programs: Results and Trends","author":"Voorhees E. M.","year":"2000","unstructured":"E. M. Voorhees and D. M. Tice. 2000. Implementing a question answering evaluation. In Proceedings of LREC\u20192000 Workshop on Using Evaluation within HLT Programs: Results and Trends."},{"key":"e_1_3_1_9_2","unstructured":"J. H. Clark et\u00a0al. 2020. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. arXiv:2003.05002. Retrieved from https:\/\/arxiv.org\/abs\/2003.05002."},{"key":"e_1_3_1_10_2","article-title":"TETEYEQ: Amharic question","author":"Yimam S. M.","year":"2009","unstructured":"S. M. Yimam and M. Libsie. 2009. TETEYEQ: Amharic question answering for factoid questions. Proceedings of Information Retrieval and Information Extraction for Less Resourced Languages (IE-IR-LRL), 3, 4 (2009), 17--25.","journal-title":"Proceedings of Information Retrieval and Information Extraction for Less Resourced Languages (IE-IR-LRL)"},{"key":"e_1_3_1_11_2","unstructured":"L. Marais. 2021. Approximating a Zulu GF concrete syntax with a neural network for natural language understanding."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1090"},{"key":"e_1_3_1_13_2","unstructured":"A. B. E. Mabrouk M. B. H. Hmida C. Fourati H. Haddad and A. Messaoudi. 2021. A multilingual african embedding for FAQ chatbots. arXiv:2103.09185. Retrieved from https:\/\/arxiv.org\/abs\/2103.09185."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2018.10.467"},{"key":"e_1_3_1_15_2","unstructured":"B. Wanjawa L. Wanzare F. Indede O. McOnyango L. Muchemi and E. Ombui. 2022. Kencorpus\u2014Kenyan languages corpus. Retrieved May 5 2022 from https:\/\/kencorpus.co.ke\/."},{"key":"e_1_3_1_16_2","volume-title":"Ethnologue: Languages of the World","author":"Eberhard D. M.","year":"2021","unstructured":"D. M. Eberhard, G. F. Simons, and C. D. Fennig. 2021. Ethnologue: Languages of the World. SIL International, Dallas, TX."},{"key":"e_1_3_1_17_2","unstructured":"Wikipedia. 2022. Swahili language\u2014Wikipedia. Retrieved January 20 2022 from https:\/\/en.wikipedia.org\/wiki\/Swahili_language."},{"key":"e_1_3_1_18_2","unstructured":"omniglot. 2021. Swahili alphabet pronunciation and language. Retrieved January 26 2022 from https:\/\/omniglot.com\/writing\/swahili.htm."},{"key":"e_1_3_1_19_2","unstructured":"Wikipedia. Retrieved November 27 2020 from https:\/\/www.wikipedia.org."},{"key":"e_1_3_1_20_2","unstructured":"V. Berment. 2004. M\u00e9thodes pour informatiser les langues et les groupes de langues peu dot\u00e9es. PhD Thesis. Ufr D'informatique Et Math\u00e9matiques Appliqu\u00e9es Universit\u00e9 Joseph Fourier-Grenoble I."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2013.07.008"},{"key":"e_1_3_1_22_2","first-page":"1","volume-title":"Proceedings of the IST-Africa Conference (IST-Africa\u201921)","author":"Wanjawa B.","year":"2021","unstructured":"B. Wanjawa and L. Muchemi. 2021. Model for semantic network generation from low resource languages as applied to question answering\u2013case of swahili. In Proceedings of the IST-Africa Conference (IST-Africa\u201921). 1\u20138."},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.aaa8685"},{"key":"e_1_3_1_24_2","volume-title":"Compilers\u2016: Institute for Asian and African Studies","author":"Hurskainen A.","year":"2004","unstructured":"A. Hurskainen. 2004. Helsinki corpus of Swahili. Compilers\u2016: Institute for Asian and African Studies (University of Helsinki) and CSC- Scientific Computing Ltd."},{"key":"e_1_3_1_25_2","unstructured":"aflat. 2020. Kiswahili Part-of-Speech Tagger\u2014Demo. Retrieved September 2020 from https:\/\/www.aflat.org\/swatag."},{"key":"e_1_3_1_26_2","volume-title":"Proceedings of the Language Resources and Evaluation Conference (LREC'10)","unstructured":"K. Chege, P. Wagacha, G. De Pauw, L. Muchemi, and W. Ng'ang'a. 2010. Developing an open source spell checker for G\u0131kuyu. In Proceedings of the Language Resources and Evaluation Conference (LREC'10)."},{"key":"e_1_3_1_27_2","unstructured":"D. I. Adelani et\u00a0al. 2021. MasakhaNER: Named entity recognition for african languages. arXiv:2103.11811. Retrieved January 29 2022."},{"key":"e_1_3_1_28_2","volume-title":"Proceedings of the 10th Dutch-Belgian Information Retrieval Workshop","author":"Muhie S.","year":"2010","unstructured":"S. Muhie and M. Libsie. 2010. Amharic question answering (AQA). In Proceedings of the 10th Dutch-Belgian Information Retrieval Workshop (2010)."},{"key":"e_1_3_1_29_2","first-page":"110","volume-title":"Proceedings of the Workshop on Widening NLP","author":"Taffa T. A.","year":"2019","unstructured":"T. A. Taffa and M. Libsie. 2019. Amharic question answering for biography, definition, and description questions. In Proceedings of the Workshop on Widening NLP, 110\u2013113."},{"key":"e_1_3_1_30_2","unstructured":"K. H. Amare. 2016. Tigrigna question answering system for factoid questions. MSc. Thesis College of Natural Sciences Addis Ababa University."},{"key":"e_1_3_1_31_2","doi-asserted-by":"crossref","unstructured":"F. Faisal S. Keshava M. M. ibn Alam and A. Anastasopoulos. 2021. SD-QA: Spoken dialectal question answering for the real world. arXiv:2109.12072. Retrieved from https:\/\/arxiv.org\/abs\/2109.12072.","DOI":"10.18653\/v1\/2021.findings-emnlp.281"},{"key":"e_1_3_1_32_2","volume-title":"Proceedings of the Text Retrieval Conference (TREC\u201902)","author":"Oard D. W.","year":"2002","unstructured":"D. W. Oard and F. C. Gey. 2002. The TREC 2002 Arabic\/English CLIR Track. In Proceedings of the Text Retrieval Conference (TREC\u201902)."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-59569-6_46"},{"key":"e_1_3_1_34_2","first-page":"1","volume-title":"Proceedings of the Conference and Labs of the Evaluation Forum (CLEF\u201911)","author":"Pe\u00f1as A.","year":"2011","unstructured":"A. Pe\u00f1as et\u00a0al. 2011. Overview of QA4MRE at CLEF 2011: Question answering for machine reading evaluation. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF\u201911). 1\u201320."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.3115\/1621969.1621976"},{"key":"e_1_3_1_36_2","first-page":"278","volume-title":"Proceedings of the International Conference on Applied Human Factors and Ergonomics","author":"Wanjawa B.","year":"2020","unstructured":"B. Wanjawa and L. Muchemi. 2020. Using semantic networks for question answering-case of low-resource languages such as swahili. In Proceedings of the International Conference on Applied Human Factors and Ergonomics. 278\u2013285."},{"key":"e_1_3_1_37_2","unstructured":"A. Singhal. 2017. Introducing the Knowledge Graph: Things Not Strings. Retrieved November 05 2017 from http:\/\/insidesearch.blogspot.com\/2012\/05\/introducing-knowledge-graph-things-not.html."},{"key":"e_1_3_1_38_2","volume-title":"Proceedings of the 5th USENIX Workshop on Hot Topics in Cloud Computing","author":"Wang R.","year":"2013","unstructured":"R. Wang, C. Conrad, and S. Shah. 2013. Using set cover to optimize a large-scale low latency distributed graph. In Proceedings of the 5th USENIX Workshop on Hot Topics in Cloud Computing."},{"key":"e_1_3_1_39_2","unstructured":"S. Sankar S. Lassen and M. Curtiss. 2017. Under the Hood: Building out the infrastructure for Graph Search. Retrieved November 06 2017 from http:\/\/www.facebook.com\/notes\/facebook-engineering\/under-the-hood-building-out-the-infrastructure-for-graph-search\/10151347573598920\/."},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jksuci.2014.12.004"},{"key":"e_1_3_1_41_2","first-page":"7059","volume-title":"Advances in Neural Information Processing Systems.","author":"Conneau A.","year":"2019","unstructured":"A. Conneau and G. Lample. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems. 7059\u20137069."},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.naacl-main.23"},{"key":"e_1_3_1_43_2","unstructured":"RDF Grapher. [n.d.]. Retrieved November 2021 from https:\/\/www.ldf.fi\/service\/rdf-grapher."},{"key":"e_1_3_1_44_2","unstructured":"Pytext. [n.d.]. 2022. XLM-RoBERTa. Retrieved November 15 2022 from https:\/\/pytext.readthedocs.io\/en\/master\/xlm_r.html."},{"key":"e_1_3_1_45_2","unstructured":"Paperswithcode. [n.d.]. 2022. Question Answering on SQuAD2.0. Retrieved November 14 2022 from https:\/\/paperswithcode.com\/sota\/question-answering-on-squad20."},{"key":"e_1_3_1_46_2","unstructured":"I. Orife et\u00a0al. [n.d.]. Masakhane\u2013Machine Translation for Africa. arXiv:2003.11529. Retrieved from https:\/\/arxiv.orb\/abs\/2003.11529."}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3578553","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3578553","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:08:37Z","timestamp":1750183717000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3578553"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,6]]},"references-count":45,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,4,30]]}},"alternative-id":["10.1145\/3578553"],"URL":"https:\/\/doi.org\/10.1145\/3578553","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"value":"2375-4699","type":"print"},{"value":"2375-4702","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,4,6]]},"assertion":[{"value":"2022-06-25","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-12-22","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-04-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}