{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T17:09:16Z","timestamp":1754154556615,"version":"3.41.2"},"reference-count":15,"publisher":"Association for Computing Machinery (ACM)","issue":"7","funder":[{"name":"National Science and Technology Council, Taiwan","award":["NSTC 113-2223-E-007-019"],"award-info":[{"award-number":["NSTC 113-2223-E-007-019"]}]},{"name":"Google\u2019s TPU Research Cloud (TRC) program"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2025,7,31]]},"abstract":"<jats:p>\n            In this article, we present our works to create the first Hong Kong content-based public pre-training dataset and the experiments which resulted in the creation of ELECTRA-based models for commonly used languages in Hong Kong. The creation of pre-training dataset is required for us to study the effect of diglossia on Hong Kong language model, and this is the first ever study on the effect starting all the way from dataset creation phase. Our experiment shows that removing diglossia from pre-training data hurts model performance. We will release our data and models to encourage future studies in Hong Kong languages.\n            <jats:xref ref-type=\"fn\">\n              <jats:sup>1<\/jats:sup>\n            <\/jats:xref>\n          <\/jats:p>","DOI":"10.1145\/3744341","type":"journal-article","created":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:18:34Z","timestamp":1750281514000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content"],"prefix":"10.1145","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-0979-4509","authenticated-orcid":false,"given":"Yiu Cheong","family":"Yung","sequence":"first","affiliation":[{"name":"National Cheng Kung University","place":["Tainan, Taiwan"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4347-0232","authenticated-orcid":false,"given":"Ying-Jia","family":"Lin","sequence":"additional","affiliation":[{"name":"National Cheng Kung University","place":["Tainan, Taiwan"]},{"name":"Department of Artificial Intelligence, Chang Gung University","place":["Tainan, Taiwan"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8890-8544","authenticated-orcid":false,"given":"Hung-Yu","family":"Kao","sequence":"additional","affiliation":[{"name":"National Cheng Kung University","place":["Tainan, Taiwan"]},{"name":"Department of Computer Science, National Tsing Hua University","place":["Tainan, Taiwan"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,7,24]]},"reference":[{"key":"e_1_3_3_2_1","unstructured":"2023. Hong Kong. Retrieved February 5 2023 from https:\/\/www.cia.gov\/the-world-factbook\/countries\/hong-kong\/#people-and-society"},{"key":"e_1_3_3_3_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00041"},{"key":"e_1_3_3_4_1","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"29","author":"Bolukbasi Tolga","year":"2016","unstructured":"Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Proceedings of the Advances in Neural Information Processing Systems. D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, Curran Associates, Inc. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2016\/file\/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf"},{"key":"e_1_3_3_5_1","volume-title":"Proceedings of theICLR","author":"Clark Kevin","year":"2020","unstructured":"Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In Proceedings of theICLR. Retrieved from https:\/\/openreview.net\/pdf?id=r1xMH1BtvB"},{"key":"e_1_3_3_6_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"e_1_3_3_7_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_3_3_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458723"},{"key":"e_1_3_3_9_1","doi-asserted-by":"publisher","unstructured":"Sarah Holland Ahmed Hosny Sarah Newman Joshua Joseph and Kasia Chmielinski. 2018. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards. DOI:DOI:10.48550\/arXiv.1805.03677arXiv:1805.03677 [cs].","DOI":"10.48550\/arXiv.1805.03677"},{"key":"e_1_3_3_10_1","doi-asserted-by":"publisher","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. DOI:DOI:10.48550\/arXiv.1907.11692arXiv:1907.11692 [cs].","DOI":"10.48550\/arXiv.1907.11692"},{"key":"e_1_3_3_11_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1202"},{"key":"e_1_3_3_12_1","doi-asserted-by":"publisher","unstructured":"Chih Chieh Shao Trois Liu Yuting Lai Yiying Tseng and Sam Tsai. 2019. DRCD: A Chinese Machine Reading Comprehension Dataset. DOI:DOI:10.48550\/arXiv.1806.00920arXiv:1806.00920 [cs].","DOI":"10.48550\/arXiv.1806.00920"},{"key":"e_1_3_3_13_1","unstructured":"Robyn Speer. 2017. ConceptNet Numberbatch 17.04: Better Less-stereotyped Word Vectors. Retrieved 20 January 2023 from http:\/\/blog.conceptnet.io\/posts\/2017\/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors\/"},{"key":"e_1_3_3_14_1","volume-title":"Proceedings of the Advances in Neural Information Processing Systems.","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141 ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems.I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, Curran Associates, Inc. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_3_3_15_1","first-page":"4003","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference.","author":"Wenzek Guillaume","year":"2020","unstructured":"Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm\u00e1n, Armand Joulin, and Edouard Grave. 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference.Nicoletta Calzolari, Fr\u00e9d\u00e9ric B\u00e9chet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H\u00e9l\u00e8ne Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.), European Language Resources Association, Marseille, France, 4003\u20134012. Retrieved from https:\/\/aclanthology.org\/2020.lrec-1.494\/"},{"key":"e_1_3_3_16_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3744341","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T12:26:01Z","timestamp":1753359961000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3744341"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,24]]},"references-count":15,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2025,7,31]]}},"alternative-id":["10.1145\/3744341"],"URL":"https:\/\/doi.org\/10.1145\/3744341","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2025,7,24]]},"assertion":[{"value":"2023-10-12","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-08","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-24","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}