{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,12]],"date-time":"2026-04-12T06:50:10Z","timestamp":1775976610290,"version":"3.50.1"},"reference-count":122,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2024,6,8]],"date-time":"2024-06-08T00:00:00Z","timestamp":1717804800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,6,8]],"date-time":"2024-06-08T00:00:00Z","timestamp":1717804800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Hong Kong Polytechnic University"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Lang Resources &amp; Evaluation"],"published-print":{"date-parts":[[2025,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current \u201cpre-training and fine-tuning\u201d paradigm that is dominated by Transformer architectures. In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models. We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: <jats:italic>colloquialism<\/jats:italic> and <jats:italic>multilinguality<\/jats:italic>\n          <\/jats:p>","DOI":"10.1007\/s10579-024-09744-w","type":"journal-article","created":{"date-parts":[[2024,6,8]],"date-time":"2024-06-08T05:01:47Z","timestamp":1717822907000},"page":"1747-1773","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Cantonese natural language processing in the transformers era: a survey and current challenges"],"prefix":"10.1007","volume":"59","author":[{"given":"Rong","family":"Xiang","sequence":"first","affiliation":[]},{"given":"Emmanuele","family":"Chersoni","sequence":"additional","affiliation":[]},{"given":"Yixia","family":"Li","sequence":"additional","affiliation":[]},{"given":"Jing","family":"Li","sequence":"additional","affiliation":[]},{"given":"Chu-Ren","family":"Huang","sequence":"additional","affiliation":[]},{"given":"Yushan","family":"Pan","sequence":"additional","affiliation":[]},{"given":"Yushi","family":"Li","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,6,8]]},"reference":[{"key":"9744_CR1","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, FL., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S. et\u00a0al. (2023) GPT-4 Technical Report. arXiv preprint arXiv:2303.08774"},{"key":"9744_CR2","unstructured":"Ahrens, K. (2015) Corpus of Political Speeches. Hong Kong Baptist University Library, URL https:\/\/digital.lib.hkbu.edu.hk\/corpus\/"},{"key":"9744_CR3","unstructured":"Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, \u00c9., Hesslow, D., Launay, J., & Malartic, Q. et\u00a0al (2023) The Falcon Series of Open Language Models. arXiv preprint arXiv:2311.16867"},{"key":"9744_CR4","unstructured":"Bai J., Bai S., Chu Y., Cui Z., Dang K., Deng X., Fan Y., Ge W., Han Y., Huang F. et\u00a0al (2023) Qwen Technical Teport. arXiv preprint arXiv:2309.16609"},{"key":"9744_CR5","doi-asserted-by":"crossref","unstructured":"Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., Do, Q.V., Xu, Y., & Fung, P. (2023) A Multitask., Multilingual., Multimodal Evaluation of ChatGPT on Reasoning., Hallucination., and Interactivity. arXiv preprint arXiv:2302.04023","DOI":"10.18653\/v1\/2023.ijcnlp-main.45"},{"issue":"1","key":"9744_CR6","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1515\/glochi-2018-0006","volume":"4","author":"RS Bauer","year":"2018","unstructured":"Bauer, R. S. (2018). Cantonese as written language in Hong Kong. Global Chinese, 4(1), 103\u2013142.","journal-title":"Global Chinese"},{"key":"9744_CR7","doi-asserted-by":"crossref","unstructured":"Black S., Biderman S., Hallahan E., Anthony Q., Gao L., Golding L., He H., Leahy C., McDonell K., Phang J. et\u00a0al (2022) GPT-NeoX-20B: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745","DOI":"10.18653\/v1\/2022.bigscience-1.9"},{"key":"9744_CR8","unstructured":"Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020) Language models are few-shot learners. In: Larochelle H., Ranzato M., Hadsell R., Balcan M., Lin H (eds) Advances in neural information processing systems"},{"key":"9744_CR9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3641289","volume":"15","author":"Y Chang","year":"2023","unstructured":"Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al. (2023). A survey on evaluation of large language models. ACM Trans Intel Syst Technol, 15, 1\u201345.","journal-title":"ACM Trans Intel Syst Technol"},{"key":"9744_CR10","doi-asserted-by":"crossref","unstructured":"Chen J., Liu Y., Zhang G., Cai Y., Wang T., Min H (2013) Sentiment analysis for Cantonese opinion mining. In: International Conference on Emerging Intelligent Data and Web Technologies., IEEE","DOI":"10.1109\/EIDWT.2013.89"},{"issue":"5","key":"9744_CR11","doi-asserted-by":"publisher","first-page":"541","DOI":"10.1007\/s12652-014-0237-8","volume":"6","author":"J Chen","year":"2015","unstructured":"Chen, J., Huang, D. P., Hu, S., Liu, Y., Cai, Y., & Min, H. (2015). An opinion mining framework for Cantonese reviews. Journal of Ambient Intelligence and Humanized Computing, 6(5), 541\u2013547.","journal-title":"Journal of Ambient Intelligence and Humanized Computing"},{"issue":"20","key":"9744_CR12","doi-asserted-by":"publisher","first-page":"7093","DOI":"10.3390\/app10207093","volume":"10","author":"X Chen","year":"2020","unstructured":"Chen, X., Ke, L., Lu, Z., Su, H., & Wang, H. (2020). A novel hybrid model for Cantonese rumor detection on Twitter. Applied Sciences, 10(20), 7093.","journal-title":"Applied Sciences"},{"key":"9744_CR13","volume-title":"The dictionary of Hong Kong Cantonese","author":"LY Cheung","year":"2018","unstructured":"Cheung, L. Y., Ngai, L. W., & Poon, L. M. (2018). The dictionary of Hong Kong Cantonese. Cosmo Books."},{"key":"9744_CR14","unstructured":"Chin A (2015) A Linguistics Corpus of Mid-20th Century Hong Kong Cantonese. Department of Linguistics and Modern Language Studies., The Hong Kong Institute of Education., Retrieved 23(3):2015"},{"key":"9744_CR15","doi-asserted-by":"crossref","unstructured":"Choi, H., Kim, J., Joe, S., Min, S., & Gwon, Y. (2021) Analyzing Zero-shot cross-lingual transfer in supervised NLP tasks. In: International Conference on Pattern Recognition., IEEE., pp 9608\u20139613","DOI":"10.1109\/ICPR48806.2021.9412570"},{"key":"9744_CR16","unstructured":"Clark K., Luong MT., Le QV., Manning CD (2020) ELECTRA: Pre-training text encoders as discriminators rather than generators. In: Proceedings of the International Conference on Learning Representations"},{"key":"9744_CR17","doi-asserted-by":"crossref","unstructured":"Conneau A., Khandelwal K., Goyal N., Chaudhary V., Wenzek G., Guzm\u00e1n F., Grave E., Ott M., Zettlemoyer L., & Stoyanov, V. (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of ACL","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"9744_CR18","unstructured":"Cui Y., Yang Z., Yao X (2023) Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca. arXiv preprint arXiv:2304.08177"},{"key":"9744_CR19","doi-asserted-by":"publisher","first-page":"3504","DOI":"10.1109\/TASLP.2021.3124365","volume":"29","author":"Y Cui","year":"2021","unstructured":"Cui, Y., Che, W., Liu, T., Qin, B., & Yang, Z. (2021). Pre-training with whole word masking for Chinese BERT. IEEE\/ACM Transactions on Audio Speech and Language Processing, 29, 3504\u20133514.","journal-title":"IEEE\/ACM Transactions on Audio Speech and Language Processing"},{"key":"9744_CR20","doi-asserted-by":"crossref","unstructured":"Dai, N., Liang, J., Qiu, X., & Huang, X. (2019). Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation. arXiv preprint arXiv:1905.05621","DOI":"10.18653\/v1\/P19-1601"},{"key":"9744_CR21","unstructured":"Dare, M., Fajardo\u00a0Diaz, V., So, AHZ., Wang, Y., Zhang, S. (2023) Unsupervised Mandarin-Cantonese Machine Translation . arXiv preprint arXiv:2301.03971"},{"key":"9744_CR22","unstructured":"De Marneffe, M.C., Dozat, T., Silveira, N., Haverinen, K., Ginter, F., Nivre, J., Manning, C.D. (2014) Universal Stanford Dependencies: A Cross-linguistic Typology. In: Proceedings of LREC"},{"key":"9744_CR23","first-page":"318","volume":"35","author":"T Dettmers","year":"2022","unstructured":"Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). GPT3. int8: 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35, 318\u2013332.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"9744_CR24","unstructured":"Devlin J., Chang MW., Lee K., Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL"},{"key":"9744_CR25","doi-asserted-by":"crossref","unstructured":"Ding, H., Zhang, Y., Liu, H., Huang, C.R. (2017) A preliminary phonetic investigation of alphabetic words in Mandarin Chinese. In: Interspeech., pp 3028\u20133032","DOI":"10.21437\/Interspeech.2017-876"},{"key":"9744_CR26","volume-title":"Ethnologue: Languages of the World","author":"DM Eberhard","year":"2022","unstructured":"Eberhard, D. M., Simons, G. F., & Fennig, C. D. (2022). Ethnologue: Languages of the World. SIL International."},{"key":"9744_CR27","unstructured":"Eckart\u00a0de Castilho R., Dore G., Margoni T., Labropoulou P., Gurevych I (2018) A legal perspective on training models for natural language processing. In: Proceedings of LREC"},{"key":"9744_CR28","unstructured":"Fung, G., Debosschere, M., Wang, D., Li, B., Zhu, J., & Wong, K.F. (2017) NLPTEA 2017 shared task\u2013Chinese spelling check. In: Proceedings of the IJCNLP Workshop on Natural Language Processing Techniques for Educational Applications"},{"key":"9744_CR29","doi-asserted-by":"crossref","unstructured":"Gao T., Yao X., Chen D (2021) SIMCSE: Simple contrastive learning of sentence embeddings. In: Proceedings of EMNLP","DOI":"10.18653\/v1\/2021.emnlp-main.552"},{"key":"9744_CR30","volume-title":"The multilingual apple: Languages in New York City","author":"O Garc\u00eda","year":"2011","unstructured":"Garc\u00eda, O., & Fishman, J. A. (2011). The multilingual apple: Languages in New York City. Walter de Gruyter."},{"key":"9744_CR31","unstructured":"Gulzar, M.A., Peng, N., Kim, M. et\u00a0al. (2022). Sibylvariant transformations for robust text classification. In: Findings of ACL."},{"key":"9744_CR32","doi-asserted-by":"crossref","unstructured":"Hale, J. (2001). A probabilistic earley parser as a psycholinguistic model. In: Proceedings of NAACL-HLT","DOI":"10.3115\/1073336.1073357"},{"issue":"9","key":"9744_CR33","doi-asserted-by":"publisher","first-page":"397","DOI":"10.1111\/lnc3.12196","volume":"10","author":"J Hale","year":"2016","unstructured":"Hale, J. (2016). Information-theoretical complexity metrics. Language and Linguistics Compass, 10(9), 397\u2013412.","journal-title":"Language and Linguistics Compass"},{"key":"9744_CR34","doi-asserted-by":"crossref","unstructured":"Hollenstein N., Chersoni E., Jacobs CL., Oseki Y., Pr\u00e9vot L., Santus E. (2022). CMCL 2022 shared task on multilingual and Crosslingual Prediction of Human Reading Behavior. In: Proceedings of the ACL Workshop on Cognitive Modeling and Computational Linguistics","DOI":"10.18653\/v1\/2022.cmcl-1.14"},{"key":"9744_CR35","doi-asserted-by":"crossref","unstructured":"Hollenstein N., Pirovano F., Zhang C., J\u00e4ger L., Beinborn L. (2021b). Multilingual language models predict human reading behavior. In: Proceedings of NAACL","DOI":"10.18653\/v1\/2021.naacl-main.10"},{"key":"9744_CR36","doi-asserted-by":"crossref","unstructured":"Hollenstein, N., Chersoni E., Jacobs CL., Oseki Y., Pr\u00e9vot L., Santus E. (2021a). CMCL 2021 shared task on eye-tracking prediction. In: Proceedings of the NAACL Workshop on Cognitive Modeling and Computational Linguistics","DOI":"10.18653\/v1\/2021.cmcl-1.7"},{"key":"9744_CR37","unstructured":"Huang, C.R. (2009). Tagged Chinese Gigaword Version 2.0. Linguistic Data Consortium"},{"key":"9744_CR38","doi-asserted-by":"crossref","unstructured":"Huang, C.R., & Chen, K.j. (1992). A Chinese corpus for linguistic research. In: Proceedings of COLING","DOI":"10.3115\/992424.992467"},{"key":"9744_CR39","doi-asserted-by":"crossref","unstructured":"Huang, G., Gorin, A., Gauvain, J.L., & Lamel, L. (2016). Machine translation based data augmentation for Cantonese keyword spotting. In: IEEE International Conference on Acoustics., Speech and Signal Processing., IEEE., pp 6020\u20136024","DOI":"10.1109\/ICASSP.2016.7472833"},{"key":"9744_CR40","unstructured":"Jiang, AQ., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, DS., Casas, Ddl., Bressand, F., Lengyel, G., Lample, G., & Saulnier, L. et al. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825"},{"key":"9744_CR41","doi-asserted-by":"crossref","unstructured":"Jin, Z., Jin, D., Mueller, J., Matthews, N., Santus, E. (2019). IMaT: Unsupervised text attribute transfer via iterative matching and translation. In: Proceedings of EMNLP","DOI":"10.18653\/v1\/D19-1306"},{"key":"9744_CR42","unstructured":"Johnson, K.A., Babel, M., Fong, I., Yiu, N. (2020) SpiCE: A new open-access corpus of conversational bilingual speech in Cantonese and English. In: Proceedings of LREC"},{"key":"9744_CR43","doi-asserted-by":"crossref","unstructured":"Ke, L., Chen, X., Lu, Z., Su, H., Wang, H. (2020). A novel approach for cantonese rumor detection based on deep neural network. In: 2020 IEEE International Conference on Systems., Man., and Cybernetics (SMC)., IEEE., pp 1610\u20131615","DOI":"10.1109\/SMC42975.2020.9283056"},{"key":"9744_CR44","unstructured":"Klyueva, N., Long, Y., Huang, CR., & Lu, Q. (2018) Food-related sentiment analysis for Cantonese. In: Proceedings of the PACLIC Joint Workshop on Linguistics and Language Processing"},{"key":"9744_CR45","unstructured":"Kwong, O.O .(2015) Toward a corpus of Cantonese Verbal Comments and their classification by multi-dimensional analysis. In: Proceedings of PACLIC"},{"key":"9744_CR46","doi-asserted-by":"crossref","unstructured":"Lai, H.M. (2004) Becoming Chinese American: A History of Communities and Institutions., vol\u00a013. Rowman Altamira","DOI":"10.5771\/9780759115545"},{"key":"9744_CR47","unstructured":"Lai, R., & Winterstein, G. (2020). Cifu: A frequency Lexicon of Hong Kong Cantonese. In: Proceedings of LREC"},{"key":"9744_CR48","unstructured":"Lau, C.M., Chan, G.W.y., Tse RKw., Chan LSy. (2022a). Words.hk: A comprehensive cantonese dictionary dataset with definitions., Translations and transliterated examples. In: Proceedings of the LREC Workshop on Dataset Creation for Lower-Resourced Languages"},{"key":"9744_CR49","unstructured":"Lau, M., Zhong, M., Lau, C.M., Su, J., Chan, H., Cheung, B. (2022b). Rime-Cantonese: A Normalized Cantonese Jyutping Lexicon. LDC2022L01. Web Download. Philadelphia: Linguistic Data Consortium."},{"key":"9744_CR50","unstructured":"Lee J., Chen L., Lam C., Lau CM., Tsui, TH., (2022). PyCantonese: Cantonese Linguistics and NLP in Python. In: Proceedings of LREC"},{"key":"9744_CR51","unstructured":"Lee, J.S. (2011). Toward a parallel corpus of spoken cantonese and written Chinese. In: Proceedings of IJCNLP"},{"key":"9744_CR52","unstructured":"Lee, J., (2019) An emotion detection system for Cantonese. In: Proceedings of FLAIRS"},{"key":"9744_CR53","unstructured":"Lee, J., Cai, T., Xie, W., & Xing, L. (2020). A counselling corpus in Cantonese. In: Proceedings of the LREC Joint Workshop on Spoken Language Technologies for Under-resourced languages and Collaboration and Computing for Under-Resourced Languages"},{"key":"9744_CR54","doi-asserted-by":"crossref","unstructured":"Lee, J.S., Liang, B., & Fong, H. (2021). Restatement and question generation for counsellor chatbot. In: Proceedings of the Workshop on NLP for Positive Impact","DOI":"10.18653\/v1\/2021.nlp4posimpact-1.1"},{"key":"9744_CR55","doi-asserted-by":"publisher","DOI":"10.4324\/9781315707211","volume-title":"Multilingualism online","author":"C Lee","year":"2016","unstructured":"Lee, C. (2016). Multilingualism online. Routledge."},{"issue":"2","key":"9744_CR56","doi-asserted-by":"publisher","first-page":"211","DOI":"10.3406\/clao.1998.1535","volume":"27","author":"T Lee","year":"1998","unstructured":"Lee, T., & Wong, C. (1998). CANCORP: The Hong Kong Cantonese Child Language Corpus. Cahiers de Linguistique Asie Orientale, 27(2), 211\u2013228.","journal-title":"Cahiers de Linguistique Asie Orientale"},{"key":"9744_CR57","first-page":"277","volume":"2","author":"A Lenci","year":"2023","unstructured":"Lenci, A. (2023). Understanding natural language understanding systems. A critical analysis. Sistemi Intelligenti, 2, 277\u2013302.","journal-title":"Sistemi Intelligenti"},{"issue":"2","key":"9744_CR58","doi-asserted-by":"publisher","first-page":"305","DOI":"10.1075\/ijcl.6.2.06leu","volume":"6","author":"MT Leung","year":"2001","unstructured":"Leung, M. T., & Law, S. P. (2001). HKCAC: The Hong Kong Cantonese Adult Language Corpus. International Journal of Corpus Linguistics, 6(2), 305\u2013325.","journal-title":"International Journal of Corpus Linguistics"},{"issue":"3","key":"9744_CR59","doi-asserted-by":"publisher","first-page":"1126","DOI":"10.1016\/j.cognition.2007.05.006","volume":"106","author":"R Levy","year":"2008","unstructured":"Levy, R. (2008). Expectation-Based Syntactic Comprehension. Cognition, 106(3), 1126\u20131177.","journal-title":"Cognition"},{"key":"9744_CR60","doi-asserted-by":"crossref","unstructured":"Li, J., Peng, B., Hsu, Y.Y., & Chersoni, E. (2023). Comparing and predicting eye-tracking data of mandarin and Cantonese. In: Proceedings of the EACL Workshop for Similar Languages., Varieties and Dialects","DOI":"10.18653\/v1\/2023.vardial-1.12"},{"issue":"3","key":"9744_CR61","doi-asserted-by":"publisher","first-page":"305","DOI":"10.1111\/1467-971X.00181","volume":"19","author":"DC Li","year":"2000","unstructured":"Li, D. C. (2000). Cantonese-English Code-Switching Research in Hong Kong: A Y2K Review. World Englishes, 19(3), 305\u2013322.","journal-title":"World Englishes"},{"key":"9744_CR62","volume-title":"Maritime silk road","author":"Q Li","year":"2006","unstructured":"Li, Q. (2006). Maritime silk road. Intercontinental Press."},{"key":"9744_CR63","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-44195-5","volume-title":"Multilingual Hong Kong: Languages Literacies and Identities","author":"DC Li","year":"2017","unstructured":"Li, D. C. (2017). Multilingual Hong Kong: Languages Literacies and Identities. Springer."},{"issue":"1","key":"9744_CR64","first-page":"77","volume":"37","author":"DC Li","year":"2009","unstructured":"Li, D. C., & Costa, V. (2009). Punning in Hong Kong Chinese media: Forms and functions. Journal of Chinese Linguistics, 37(1), 77\u2013107.","journal-title":"Journal of Chinese Linguistics"},{"key":"9744_CR65","unstructured":"Liesenfeld, A.M. (2018). MYCanCor: A Video Corpus of Spoken Malaysian Cantonese. In: Proceedings of LREC"},{"key":"9744_CR66","doi-asserted-by":"crossref","unstructured":"Liu ,Y., & Lapata, M. (2019). Text Summarization with Pretrained Encoders. arXiv preprint arXiv:1908.08345","DOI":"10.18653\/v1\/D19-1387"},{"key":"9744_CR67","unstructured":"Liu, E.K.Y. (2022) Low-resource neural machine translation: A Case Study of Cantonese. In: Proceedings of the COLING Workshop on NLP for Similar Languages., Varieties and Dialects"},{"key":"9744_CR68","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692"},{"key":"9744_CR69","doi-asserted-by":"publisher","first-page":"726","DOI":"10.1162\/tacl_a_00343","volume":"8","author":"Y Liu","year":"2020","unstructured":"Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., & Zettlemoyer, L. (2020). Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8, 726\u2013742.","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"9744_CR70","unstructured":"Luke, K. (1995). Between big words and small talk: The Writing System in Cantonese Paperbacks in Hong Kong."},{"issue":"2015","key":"9744_CR71","first-page":"309","volume":"25","author":"KK Luke","year":"2015","unstructured":"Luke, K. K., & Wong, M. L. (2015). The Hong Kong Cantonese corpus: Design and uses. Journal of Chinese Linguistics, 25(2015), 309\u2013330.","journal-title":"Journal of Chinese Linguistics"},{"key":"9744_CR72","volume-title":"Cantonese: A comprehensive grammar","author":"S Matthews","year":"2011","unstructured":"Matthews, S., & Yip, V. (2011). Cantonese: A comprehensive grammar. Routledge Grammars."},{"key":"9744_CR73","doi-asserted-by":"crossref","unstructured":"Min, J., McCoy, R.T., Das, D., Pitler, E., & Linzen, T. (2020). Syntactic data augmentation increases robustness to inference heuristics. In: Proceedings of ACL","DOI":"10.18653\/v1\/2020.acl-main.212"},{"key":"9744_CR74","unstructured":"Misra, K. (2022) minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models. arXiv preprint arXiv:2203.13112"},{"key":"9744_CR75","doi-asserted-by":"crossref","unstructured":"Ng RW., Kwan AC., Lee T., Hain., T. (2017). Shefce: A cantonese-english bilingual speech corpus for pronunciation assessment. In: IEEE International Conference on Acoustics., Speech and Signal Processing (ICASSP)., IEEE., pp 5825\u20135829","DOI":"10.1109\/ICASSP.2017.7953273"},{"key":"9744_CR76","unstructured":"Ngai E., Lee M., Choi Y., Chai, P., (2018) Multiple-domain sentiment classification for cantonese using a combined approach. In: Proceedings of PACIS"},{"key":"9744_CR77","doi-asserted-by":"crossref","unstructured":"Nguyen DQ., Vu T., Nguyen, A.T. (2020). BERTweet: a pre-trained language model for English tweets. arXiv preprint arXiv:2005.10200","DOI":"10.18653\/v1\/2020.emnlp-demos.2"},{"key":"9744_CR78","doi-asserted-by":"crossref","unstructured":"Nivre J., De\u00a0Marneffe MC., Ginter F., Goldberg Y., Hajic J., Manning CD., McDonald R., Petrov S., Pyysalo S., Silveira, N. (2016). Universal dependencies v1: A multilingual treebank collection. In: Proceedings of LREC","DOI":"10.1162\/coli_a_00402"},{"key":"9744_CR79","unstructured":"Ouyang, J. (1993). Putonghua Guangzhouhua De Bijiao Yu Xuexi (The Comparison and Learning of Mandarin and Cantonese)"},{"key":"9744_CR80","doi-asserted-by":"crossref","unstructured":"Pan, J. (2019). The Chinese\/English Political Interpreting Corpus (CEPIC): A New Electronic Resource for Translators and Interpreters. In: Proceedings of the Human-Informed Translation and Interpreting Technology Workshop., pp 82\u201388","DOI":"10.26615\/issn.2683-0078.2019_010"},{"key":"9744_CR81","unstructured":"Parker R., Graff D., Chen K., Kong J., Kazuaki, M. (2011). Chinese Gigaword. In: Web Download. Philadelphia: Linguistic Data Consortium"},{"key":"9744_CR82","doi-asserted-by":"crossref","unstructured":"Pfeiffer J., Goyal N., Lin XV., Li X., Cross J., Riedel S., Artetxe, M. (2022). Lifting the curse of Multilinguality by pre-training modular transformers. In: Proceedings of NAACL","DOI":"10.18653\/v1\/2022.naacl-main.255"},{"key":"9744_CR83","doi-asserted-by":"crossref","unstructured":"Pires, T., Schlinger, E., Garrette, D. (2019). How multilingual is multilingual BERT? In: Proceedings of ACL","DOI":"10.18653\/v1\/P19-1493"},{"key":"9744_CR84","unstructured":"Ren X., Zhou P., Meng X., Huang X., Wang Y., Wang W., Li P., Zhang X., Podolskiy A., Arshinov G. et\u00a0al. (2023). PanGu-$$\\Sigma$$: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing. arXiv preprint arXiv:2303.10845"},{"issue":"95","key":"9744_CR85","first-page":"130","volume":"26","author":"GT Sachs","year":"2007","unstructured":"Sachs, G. T., & Li, D. C. (2007). Cantonese as an additional language in Hong Kong. Multilingua, 26(95), 130.","journal-title":"Multilingua"},{"key":"9744_CR86","doi-asserted-by":"crossref","unstructured":"\u015eahin GG., Steedman M (2018) Data augmentation via dependency tree morphing for low-resource languages. In: Proceedings of EMNLP","DOI":"10.18653\/v1\/D18-1545"},{"key":"9744_CR87","doi-asserted-by":"crossref","unstructured":"Salazar, J., Liang, D., Nguyen, TQ., & Kirchhoff, K. (2020) Masked language model scoring. In: Proceedings of ACL","DOI":"10.18653\/v1\/2020.acl-main.240"},{"key":"9744_CR88","unstructured":"Scao TL., Fan A., Akiki C., Pavlick E., Ili\u0107 S., Hesslow D., Castagn\u00e9 R., Luccioni AS., Yvon F., Gall\u00e9 M. et\u00a0al. (2022). BLOOM: A 176B-parameter Open-access Multilingual Language Model. arXiv preprint arXiv:2211.05100"},{"key":"9744_CR89","doi-asserted-by":"crossref","unstructured":"Sedghamiz H., Raval S., Santus E., Alhanai T., Ghassemi, M. (2021). SupCL-Seq: Supervised contrastive learning for downstream optimized sequence representations. In: Findings of EMNLP","DOI":"10.18653\/v1\/2021.findings-emnlp.289"},{"issue":"1","key":"9744_CR90","doi-asserted-by":"publisher","first-page":"58","DOI":"10.14569\/SpecialIssue.2014.040109","volume":"4","author":"M Shardlow","year":"2014","unstructured":"Shardlow, M. (2014). A survey of automated text simplification. International Journal of Advanced Computer Science and Applications, 4(1), 58\u201370.","journal-title":"International Journal of Advanced Computer Science and Applications"},{"key":"9744_CR91","doi-asserted-by":"crossref","unstructured":"Shi, H., Livescu, K., & Gimpel, K. (2021). Substructure substitution: Structured data augmentation for NLP. arXiv preprint arXiv:2101.00411","DOI":"10.18653\/v1\/2021.findings-acl.307"},{"key":"9744_CR92","unstructured":"Shliazhko O., Fenogenova A., Tikhonova M., Mikhailov V., Kozlova A., Shavrina, T. (2022) mGPT: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580"},{"key":"9744_CR93","unstructured":"Sio, J.U.S., Da\u00a0Costa, L.M. (2019) Building the Cantonese Wordnet. In: Proceedings of the Global WordNet Conference"},{"key":"9744_CR94","doi-asserted-by":"publisher","DOI":"10.1515\/9789882200531","volume-title":"Cantonese as written language: The growth of a written Chinese vernacular","author":"D Snow","year":"2004","unstructured":"Snow, D. (2004). Cantonese as written language: The growth of a written Chinese vernacular. Hong Kong University Press."},{"key":"9744_CR95","unstructured":"Taori R., Gulrajani I., Zhang T., Dubois Y., Li X., Guestrin C., Liang P., Hashimoto, T.B. (2023). Stanford alpaca: An instruction-following LLaMA model"},{"key":"9744_CR96","unstructured":"Touvron H., Lavril T., Izacard G., Martinet X., Lachaux MA., Lacroix T., Rozi\u00e8re B., Goyal N., Hambro E., Azhar F., Rodriguez A., Joulin A., Edouard G., Lample, G., (2023a). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971"},{"key":"9744_CR97","unstructured":"Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov ,N., Batra, S., Bhargava, P., & Bhosale, S. et\u00a0al. (2023b). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288"},{"key":"9744_CR98","unstructured":"Wang Y., Chen H., Tang Y., Guo T., Han K., Nie Y., Wang X., Hu H., Bai Z., Wang Y. et\u00a0al (2023) PanGu-$$\\pi$$: Enhancing Language Model Architectures via Nonlinearity Compensation. arXiv preprint arXiv:2312.17276"},{"key":"9744_CR99","unstructured":"Wang H., Li M., Zhou Z., Fung GPC., Wong, K.F. (2020). KddRES: A multi-level Knowledge-driven dialogue dataset for restaurant towards customized dialogue system. arXiv preprint arXiv:2011.08772"},{"key":"9744_CR100","doi-asserted-by":"crossref","unstructured":"Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S.R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461","DOI":"10.18653\/v1\/W18-5446"},{"key":"9744_CR101","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1905.00537","author":"A Wang","year":"2019","unstructured":"Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2019). SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems. https:\/\/doi.org\/10.48550\/arXiv.1905.00537","journal-title":"Advances in Neural Information Processing Systems"},{"key":"9744_CR102","doi-asserted-by":"crossref","unstructured":"Wei J., Zou, K. (2019) EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of EMNLP-IJCNLP","DOI":"10.18653\/v1\/D19-1670"},{"key":"9744_CR103","unstructured":"Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou ,D., & Metzler, D. et\u00a0al. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682"},{"key":"9744_CR104","unstructured":"Winterstein, G., Vergnaud, D., Lupien, J., Laperle, S., Yu, H., Davis, C., Luk, P.S.Z. (2023). An Empirical., Corpus-based., Approach to Cantonese Nominal Expressions. In: Proceedings of PACLIC"},{"key":"9744_CR105","unstructured":"Wong Ts., Gerdes K., Leung H., Lee, J.S. (2017) Quantitative comparative syntax on the cantonese-mandarin parallel dependency treebank. In: Proceedings of Depling"},{"key":"9744_CR106","unstructured":"Wong Ts., Lee, J.S. (2018). Register-sensitive translation: A case study of mandarin and cantonese. In: Proceedings of the Conference of the Association for Machine Translation in the Americas., pp 89\u201396"},{"issue":"1","key":"9744_CR107","doi-asserted-by":"publisher","first-page":"21","DOI":"10.4018\/jthi.2006010102","volume":"2","author":"PW Wong","year":"2006","unstructured":"Wong, P. W. (2006). The Specification of POS Tagging of the Hong Kong University Cantonese Corpus. International Journal of Technology and Human Interaction, 2(1), 21\u201338.","journal-title":"International Journal of Technology and Human Interaction"},{"key":"9744_CR108","unstructured":"Wu Y., Li X., Lun, S.C. (2006). A structural-based approach to Cantonese-English machine translation. In: International Journal of Computational Linguistics & Chinese Language Processing"},{"key":"9744_CR109","doi-asserted-by":"crossref","unstructured":"Wu, D. (1994). Aligning a parallel English-Chinese corpus statistically with lexical ccriteria. arXiv preprint cmp-lg\/9406007","DOI":"10.3115\/981732.981744"},{"key":"9744_CR110","doi-asserted-by":"crossref","unstructured":"Xiang R., Chersoni E., Long Y., Lu Q., Huang, C.R. (2020a). Lexical data augmentation for text classification in deep learning. In: Canadian Conference on Artificial Intelligence., Springer","DOI":"10.1007\/978-3-030-47358-7_53"},{"key":"9744_CR111","unstructured":"Xiang R., Jiao Y., Lu, Q. (2019). Sentiment-augmented attention Network for Cantonese restaurant review analysis. In: Proceedings of KDD Workshop on Issues of Sentiment Discovery and Opinion Mining (WISDOM)"},{"key":"9744_CR112","doi-asserted-by":"crossref","unstructured":"Xiang R., Tan H., Li J., Wan M., Wong, K.F. (2022). When Cantonese NLP Meets Pre-training: Progress and Challenges. In: Proceedings of AACL-IJCNLP: Tutorials","DOI":"10.18653\/v1\/2022.aacl-tutorials.3"},{"key":"9744_CR113","doi-asserted-by":"crossref","unstructured":"Xiang R., Wan M., Su Q., Huang CR., Lu, Q. (2020b). Sina Mandarin Alphabetical Words: A Web-driven Code-mixing Lexical Resource. In: Proceedings of AACL-IJCNLP","DOI":"10.18653\/v1\/2020.aacl-main.84"},{"issue":"11","key":"9744_CR114","doi-asserted-by":"publisher","first-page":"1432","DOI":"10.1002\/asi.24493","volume":"72","author":"R Xiang","year":"2021","unstructured":"Xiang, R., Chersoni, E., Lu, Q., Huang, C. R., Li, W., & Long, Y. (2021). Lexical data augmentation for sentiment analysis. Journal of the Association for Information Science and Technology, 72(11), 1432\u20131447.","journal-title":"Journal of the Association for Information Science and Technology"},{"key":"9744_CR115","unstructured":"Yang Z., Xu Z., Cui Y., Wang B., Lin M., Wu D., Chen, Z. (2022). CINO: A Chinese minority pre-trained language model. In: Proceedings of COLING"},{"key":"9744_CR116","unstructured":"Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237"},{"key":"9744_CR117","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511620744","volume-title":"The bilingual child. Early development and language contact","author":"V Yip","year":"2007","unstructured":"Yip, V., & Matthews, S. (2007). The bilingual child. Early development and language contact. Cambridge University Press."},{"key":"9744_CR118","first-page":"124","volume-title":"Routledge Handbook of the Chinese Diaspora","author":"H Yu","year":"2013","unstructured":"Yu, H. (2013). Mountains of Gold: Canada, North America, and the Cantonese Pacific. Routledge Handbook of the Chinese Diaspora (pp. 124\u2013137). Routledge."},{"issue":"3","key":"9744_CR119","first-page":"292","volume":"1","author":"A Yue-Hashimoto","year":"1991","unstructured":"Yue-Hashimoto, A. (1991). The Yue dialect. Journal of Chinese Linguistics Monograph Series, 1(3), 292\u2013322.","journal-title":"Journal of Chinese Linguistics Monograph Series"},{"key":"9744_CR120","doi-asserted-by":"crossref","unstructured":"Zhang, X. (1998). Dialect MT: A Case Study between Cantonese and Mandarin. In: Proceedings of COLING","DOI":"10.3115\/980432.980807"},{"issue":"6","key":"9744_CR121","doi-asserted-by":"publisher","first-page":"7674","DOI":"10.1016\/j.eswa.2010.12.147","volume":"38","author":"Z Zhang","year":"2011","unstructured":"Zhang, Z., Ye, Q., Zhang, Z., & Li, Y. (2011). Sentiment classification of internet restaurant reviews written in Cantonese. Expert Systems with Applications, 38(6), 7674\u20137682.","journal-title":"Expert Systems with Applications"},{"key":"9744_CR122","unstructured":"Zhao WX., Zhou K., Li J., Tang T., Wang X., Hou Y., Min Y., Zhang B., Zhang J., Dong Z. et\u00a0al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223"}],"container-title":["Language Resources and Evaluation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-024-09744-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10579-024-09744-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-024-09744-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,18]],"date-time":"2025-05-18T15:03:01Z","timestamp":1747580581000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10579-024-09744-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,8]]},"references-count":122,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6]]}},"alternative-id":["9744"],"URL":"https:\/\/doi.org\/10.1007\/s10579-024-09744-w","relation":{},"ISSN":["1574-020X","1574-0218"],"issn-type":[{"value":"1574-020X","type":"print"},{"value":"1574-0218","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,8]]},"assertion":[{"value":"19 April 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 June 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they do not have any competing financial and non-financial interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}