{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,25]],"date-time":"2026-06-25T05:36:46Z","timestamp":1782365806292,"version":"3.54.5"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,12,27]],"date-time":"2022-12-27T00:00:00Z","timestamp":1672099200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Centre for Human Language Technology"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2023,3,31]]},"abstract":"<jats:p>Named entity recognition has been one of the most widely researched natural language processing technologies over the past two decades. For the South African languages, however, relatively little research and development work has been done. This changed with the release of the NCHLT named entity annotated resources, a collection of named entity annotated data and Conditional Random Field-based named entity recognisers for ten of the official languages.<\/jats:p>\n          <jats:p>In this work, we provide a detailed description and linguistic analysis of the named entity (NE) annotated data for the agglutinative isiXhosa language, by analysing the morphosyntactic features relevant to the three main types of NE, viz. person, location, and organisation. From the data, we identify suffix and capitalisation features that may be good predictors of the different NE types. Based on these features, we describe the named entity recogniser and feature set developed as part of the NCHLT release. The recogniser has high precision, 0.9713 overall, but relatively low recall, 0.7409, especially for person names, 0.5963, resulting in an overall F-score of 0.8406. Although there are various avenues to improve the named entity recogniser, this is a significant release for a historically under-resourced language.<\/jats:p>","DOI":"10.1145\/3531478","type":"journal-article","created":{"date-parts":[[2022,6,2]],"date-time":"2022-06-02T11:28:37Z","timestamp":1654169317000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["IsiXhosa Named Entity Recognition Resources"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8612-5175","authenticated-orcid":false,"given":"Roald","family":"Eiselen","sequence":"first","affiliation":[{"name":"Centre for Text Technology, North-West University, Potchefstroom, South Africa"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6667-4599","authenticated-orcid":false,"given":"Andiswa","family":"Bukula","sequence":"additional","affiliation":[{"name":"South African Centre for Digital Language Resources, North-West University, Potchefstroom, South Africa"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2022,12,27]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-85287-2_42"},{"key":"e_1_3_2_3_2","first-page":"3349","volume-title":"Proceedings of the European Language Resources Association (LREC\u201916)","author":"Ehrmann Maud","year":"2016","unstructured":"Maud Ehrmann, Damien Nouvel, and Sophie Rosset. 2016. Named entity resources-overview and outlook. In Proceedings of the European Language Resources Association (LREC\u201916). 3349\u20133356."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.3115\/1609822.1609823"},{"key":"e_1_3_2_5_2","first-page":"52","volume-title":"Proceedings of the Association for Computational Linguistics.","author":"Balahur Alexandra","year":"2012","unstructured":"Alexandra Balahur and Marco Turchi. 2012. Multilingual sentiment analysis using machine translation? In Proceedings of the Association for Computational Linguistics. 52\u201360."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.3115\/1219840.1219885"},{"key":"e_1_3_2_7_2","first-page":"287","volume-title":"Proceedings of the DARPA Speech Recognition and Natural Language Workshop. Morgan Kaufmann","author":"Kubala Francis","year":"1998","unstructured":"Francis Kubala, Richard Schwartz, Rebecca Stone, and Ralph Weischedel. 1998. Named entity extraction from speech. In Proceedings of the DARPA Speech Recognition and Natural Language Workshop. Morgan Kaufmann. 287\u20131992."},{"key":"e_1_3_2_8_2","first-page":"3344","volume-title":"Proceedings of the European Language Resources Association (LREC\u201916)","author":"Eiselen Roald","year":"2016","unstructured":"Roald Eiselen. 2016. Government domain named entity recognition for South African languages. In Proceedings of the European Language Resources Association (LREC\u201916). 3344\u20133348."},{"key":"e_1_3_2_9_2","volume-title":"Proceedings of the Pattern Recognition Association of South Africa Conference","author":"Fourie W.","year":"2014","unstructured":"W. Fourie, J. V. Du Toit, and D. P. Snyman. 2014. Comparing support vector machine and multinomial naive Bayes for named entity classification of South African languages. In Proceedings of the Pattern Recognition Association of South Africa Conference. Retrieved from http:\/\/hdl.handle.net\/10394\/16239."},{"key":"e_1_3_2_10_2","volume-title":"Benoemde\u2013Entiteitherkenning Vir Afrikaans","author":"Matthew Gordon","year":"2013","unstructured":"Gordon Matthew. 2013. Benoemde\u2013Entiteitherkenning Vir Afrikaans. PhD Thesis, North-West University, Vanderbijlpark."},{"key":"e_1_3_2_11_2","volume-title":"Outomatiese Afrikaanse Tekseenheididentifisering","author":"Puttkammer Martin J.","year":"2006","unstructured":"Martin J. Puttkammer. 2006. Outomatiese Afrikaanse Tekseenheididentifisering. PhD Thesis, North-West University, Potchefstroom."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.3115\/1596374.1596399"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.4314\/jlt.v48i2.7"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.3115\/1072399.1072402"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.3115\/1220575.1220631"},{"key":"e_1_3_2_16_2","first-page":"15","volume-title":"Proceedings of the DARPA Speech Recognition Workshop.","author":"Garofolo John S.","year":"1997","unstructured":"John S. Garofolo, Jonathan G. Fiscus, and William M. Fisher. 1997. Design and preparation of the 1996 Hub-4 broadcast news benchmark test corpora. In Proceedings of the DARPA Speech Recognition Workshop. Morgan Kaufmann, 15\u201321."},{"key":"e_1_3_2_17_2","first-page":"1","volume-title":"Proceedings of the 7th Message Understanding Conference (MUC\u201997).","author":"Chinchor Nancy","year":"1997","unstructured":"Nancy Chinchor and Patricia Robinson. 1997. MUC-7 named entity task definition. In Proceedings of the 7th Message Understanding Conference (MUC\u201997). 1\u201321."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.3115\/1118853.1118877"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.3115\/1119176.1119195"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1075\/li.30.1.03nad"},{"key":"e_1_3_2_21_2","doi-asserted-by":"crossref","unstructured":"Alexei Baevski Sergey Edunov Yinhan Liu Luke Zettlemoyer and Michael Auli. 2019. Cloze-driven pretraining of self-attention networks. Retrieved from https:\/\/arXiv:1903.07785.","DOI":"10.18653\/v1\/D19-1539"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1367"},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Myle Ott Sergey Edunov Alexei Baevski Angela Fan Sam Gross Nathan Ng David Grangier and Michael Auli. 2019. FAIRSEQ: A fast extensible toolkit for sequence modeling. Retrieved from https:\/\/arXiv:1904.01038.","DOI":"10.18653\/v1\/N19-4009"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1519"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1002\/asi.21090"},{"key":"e_1_3_2_26_2","unstructured":"Ilias G. Maglogiannis. 2007. Emerging artificial intelligence applications in computer engineering: Real word AI systems with applications in eHealth HCI information retrieval and pervasive technologies. IOS Press Amsterdam."},{"key":"e_1_3_2_27_2","doi-asserted-by":"crossref","unstructured":"Guillaume Lample Miguel Ballesteros Sandeep Subramanian Kazuya Kawakami and Chris Dyer. 2016. Neural architectures for named entity recognition. Retrieved from https:\/\/arXiv:1603.01360.","DOI":"10.18653\/v1\/N16-1030"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.3115\/1119176.1119206"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1310"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-4013"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2016.2516590"},{"key":"e_1_3_2_32_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https:\/\/arXiv:1810.04805."},{"key":"e_1_3_2_33_2","doi-asserted-by":"crossref","unstructured":"Matthew E. Peters Mark Neumann Mohit Iyyer Matt Gardner Christopher Clark Kenton Lee and Luke Zettlemoyer. 2018. Deep contextualized word representations. Retrieved from https:\/\/arXiv:1802.05365.","DOI":"10.18653\/v1\/N18-1202"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-4010"},{"key":"e_1_3_2_35_2","first-page":"1638","volume-title":"Proceedings of the Association for Computational Linguistics (COLING\u201918)","author":"Akbik Alan","year":"2018","unstructured":"Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the Association for Computational Linguistics (COLING\u201918). 1638\u20131649."},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.523"},{"key":"e_1_3_2_37_2","first-page":"4615","volume-title":"Proceedings of the European Language Resources Association (LREC\u201920)","author":"Luoma Jouni","year":"2020","unstructured":"Jouni Luoma, Miika Oinonen, Maria Pyyk\u00f6nen, Veronika Laippala, and Sampo Pyysalo. 2020. A Broad-coverage corpus for finnish named entity recognition. In Proceedings of the European Language Resources Association (LREC\u201920). 4615\u20134624."},{"key":"e_1_3_2_38_2","first-page":"105","volume-title":"Proceedings of the Association for Computational Linguistics","author":"Yeniterzi Reyyan","year":"2011","unstructured":"Reyyan Yeniterzi. 2011. Exploiting morphology in turkish named entity recognition system. In Proceedings of the Association for Computational Linguistics. 105\u2013110."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/IALP.2015.7451557"},{"key":"e_1_3_2_40_2","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1007\/978-3-642-28604-9_26","volume-title":"Computational Linguistics and Intelligent Text Processing","author":"Abdallah Sherief","year":"2012","unstructured":"Sherief Abdallah, Khaled Shaalan, and Muhammad Shoaib. 2012. Integrating rule-based system with classification for Arabic named entity recognition. In Computational Linguistics and Intelligent Text Processing, A. Gelbukh (ed.). Springer, Berlin, 311\u2013322."},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00178"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.3115\/1564508.1564526"},{"key":"e_1_3_2_43_2","volume-title":"Introduction to the Phonology of the Bantu Languages, Trans","author":"Meinhof Carl","year":"1932","unstructured":"Carl Meinhof. 1932. Introduction to the Phonology of the Bantu Languages, Trans. Reimer, Berlin."},{"key":"e_1_3_2_44_2","first-page":"114","volume-title":"Proceedings of the Conference on Generative Approaches to Language Acquisition (GALANA\u201907)","author":"Gxilishe Sandile","year":"2007","unstructured":"Sandile Gxilishe, Peter de Villiers, Jill de Villiers, A. Belikova, L. Meroni, and Mari Umeda. 2007. The acquisition of subject agreement in xhosa. In Proceedings of the Conference on Generative Approaches to Language Acquisition (GALANA\u201907). Citeseer, 114\u2013123."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.3115\/1564508.1564522"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.4324\/9780203987926"},{"key":"e_1_3_2_47_2","unstructured":"Travis W. Perry. 2020. Isixhosa noun classes. Retrieved from http:\/\/facweb.furman.edu\/\u223cperrytravis\/courses\/bio39\/Academics\/Isixhosa\/nounclasses.html."},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1075\/z.121"},{"key":"e_1_3_2_49_2","volume-title":"Revision of Isixhosa Orthography and other Editorial Matters","author":"Mini Buyiswa","year":"2005","unstructured":"Buyiswa Mini and Nonkosi Tyolwana. 2005. Revision of Isixhosa Orthography and other Editorial Matters. PANSALB, Pretoria."},{"key":"e_1_3_2_50_2","unstructured":"K. Podile and R. Eiselen. 2016. NCHLT isiXhosa named entity annotated corpus. Dataset. Centre for Text Technology. Retrieved from https:\/\/hdl.handle.net\/20.500.12185\/312."},{"key":"e_1_3_2_51_2","unstructured":"Martin Puttkammer Martin Schlemmer Wikus Pienaar and Ruan Bekker. 2014. NCHLT isiXhosa Text Corpora. Dataset. Centre for Text Technology. Retrieved from https:\/\/hdl.handle.net\/20.500.12185\/314."},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-4008"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-40585-3_20"},{"key":"e_1_3_2_54_2","unstructured":"Taku Kudo. 2013. CRF++: Yet another CRF toolkit. Version 0.58. Retrieved from https:\/\/taku910.github.io\/crfpp\/."},{"key":"e_1_3_2_55_2","first-page":"3698","volume-title":"Proceedings of the European Language Resources Association (LREC\u201914)","author":"Eiselen Roald","year":"2014","unstructured":"Roald Eiselen and Martin J. Puttkammer. 2014. Developing text resources for ten South African languages. In Proceedings of the European Language Resources Association (LREC\u201914). 3698\u20133703."},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.3390\/info11010041"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3531478","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3531478","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:21Z","timestamp":1750183761000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3531478"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,27]]},"references-count":55,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,3,31]]}},"alternative-id":["10.1145\/3531478"],"URL":"https:\/\/doi.org\/10.1145\/3531478","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"value":"2375-4699","type":"print"},{"value":"2375-4702","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,12,27]]},"assertion":[{"value":"2021-09-07","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-04-12","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-12-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}