{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,9]],"date-time":"2026-06-09T16:05:50Z","timestamp":1781021150559,"version":"3.54.1"},"reference-count":31,"publisher":"World Scientific Pub Co Pte Ltd","issue":"01","funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["JP23K11227"],"award-info":[{"award-number":["JP23K11227"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Int. J. As. Lang. Proc."],"published-print":{"date-parts":[[2024,3]]},"abstract":"<jats:p> The Lhasa dialect, the most widely spoken Tibetan dialect in Tibet, is also renowned for its rich historical archive of written scripts. Exploring speech recognition methodologies specific to the Lhasa dialect is paramount in safeguarding Tibet\u2019s distinct linguistic heritage. Previous studies in Tibetan speech recognition have been largely confined to academic research using nonpublic datasets, focusing on elements such as the selection of phone-level acoustic modeling units and the integration of tonal information. However, these studies have not significantly benefited the community due to the scarcity of available data. To mitigate the challenge posed by limited data resources, we present the NICT-Tib1 (phase 1) dataset, a new open-source dataset collected from native speakers dedicated to investigating speech recognition for the Lhasa dialect. Speech recognition with deep neural networks (DNNs) evolved three generations from systems hybrid with hidden Markov model (HMM) (e.g., DNN-HMM) to End-to-End systems (e.g., Transformer), and finally to self-supervised learning (SSL) systems (e.g., Wav2Vec2.0), each generation improving accuracy and simplifying the training process, with the latest generation achieving state-of-the-art performance, especially for low-resource languages. Besides the early DNN-HMM-based system using Kaldi, we further update benchmark systems with the Conformer and Wav2Vec2.0 trained by ESPnet and Huggingface on this dataset, respectively. Experimental results show that these state-of-the-art models outperformed the models in previous work. <\/jats:p>","DOI":"10.1142\/s2717554524500012","type":"journal-article","created":{"date-parts":[[2024,5,31]],"date-time":"2024-05-31T07:42:34Z","timestamp":1717141354000},"source":"Crossref","is-referenced-by-count":1,"title":["Voices of the Himalayas: Benchmarking Speech Recognition Systems for the Tibetan Language"],"prefix":"10.1142","volume":"34","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7636-3797","authenticated-orcid":false,"given":"Sheng","family":"Li","sequence":"first","affiliation":[{"name":"National Institute of Information and Communications Technology, Soraku-gun, Kyoto 619-0289, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4997-3850","authenticated-orcid":false,"given":"Jiyi","family":"Li","sequence":"additional","affiliation":[{"name":"University of Yamanashi, Kofu, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9848-6384","authenticated-orcid":false,"given":"Chenhui","family":"Chu","sequence":"additional","affiliation":[{"name":"Graduate School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"219","published-online":{"date-parts":[[2024,7,23]]},"reference":[{"key":"S2717554524500012BIB001","doi-asserted-by":"publisher","DOI":"10.1109\/5.18626"},{"key":"S2717554524500012BIB002","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2011.2134090"},{"key":"S2717554524500012BIB003","doi-asserted-by":"publisher","DOI":"10.1109\/APSIPA.2016.7820795"},{"key":"S2717554524500012BIB004","doi-asserted-by":"publisher","DOI":"10.1109\/ISCSLP.2016.7918447"},{"key":"S2717554524500012BIB005","first-page":"1764","volume-title":"Int. Conf. Machine Learning","author":"Graves A.","year":"2014"},{"key":"S2717554524500012BIB006","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2015.7404790"},{"key":"S2717554524500012BIB007","volume-title":"Advances in Neural Information Processing Systems 28 (NIPS 2015)","author":"Chorowski J.","year":"2015"},{"key":"S2717554524500012BIB008","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472621"},{"key":"S2717554524500012BIB009","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2017.2763455"},{"key":"S2717554524500012BIB011","doi-asserted-by":"publisher","DOI":"10.1109\/APSIPAASC47483.2019.9023100"},{"key":"S2717554524500012BIB012","doi-asserted-by":"publisher","DOI":"10.1186\/s13636-021-00233-4"},{"key":"S2717554524500012BIB013","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-10015"},{"key":"S2717554524500012BIB014","doi-asserted-by":"publisher","DOI":"10.1109\/O-COCOSDA202257103.2022.9997917"},{"key":"S2717554524500012BIB015","volume-title":"Advances in Neural Information Processing Systems 34 (NeurIPS 2021)","author":"Baevski A.","year":"2021"},{"key":"S2717554524500012BIB016","first-page":"12449","volume-title":"Advances in Neural Information Processing Systems 33 (NeurIPS 2020)","author":"Baevski A.","year":"2020"},{"key":"S2717554524500012BIB017","doi-asserted-by":"publisher","DOI":"10.2307\/2718544"},{"key":"S2717554524500012BIB019","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1423"},{"key":"S2717554524500012BIB020","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1456"},{"key":"S2717554524500012BIB021","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462576"},{"key":"S2717554524500012BIB022","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-1296"},{"key":"S2717554524500012BIB023","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462506"},{"key":"S2717554524500012BIB026","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1107"},{"key":"S2717554524500012BIB028","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-374"},{"key":"S2717554524500012BIB029","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-2104"},{"key":"S2717554524500012BIB030","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-2092"},{"key":"S2717554524500012BIB031","doi-asserted-by":"publisher","DOI":"10.1109\/APSIPAASC47483.2019.9023137"},{"key":"S2717554524500012BIB032","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054233"},{"key":"S2717554524500012BIB034","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2014.6854049"},{"key":"S2717554524500012BIB035","volume-title":"IEEE 2011 Workshop on Automatic Speech Recognition and Understanding","author":"Povey D.","year":"2011"},{"key":"S2717554524500012BIB036","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143891"},{"key":"S2717554524500012BIB037","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472152"}],"container-title":["International Journal of Asian Language Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S2717554524500012","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T07:56:01Z","timestamp":1722930961000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/10.1142\/S2717554524500012"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3]]},"references-count":31,"journal-issue":{"issue":"01","published-print":{"date-parts":[[2024,3]]}},"alternative-id":["10.1142\/S2717554524500012"],"URL":"https:\/\/doi.org\/10.1142\/s2717554524500012","relation":{},"ISSN":["2717-5545","2424-791X"],"issn-type":[{"value":"2717-5545","type":"print"},{"value":"2424-791X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3]]},"article-number":"2450001"}}