{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,23]],"date-time":"2026-04-23T07:12:50Z","timestamp":1776928370459,"version":"3.51.2"},"reference-count":69,"publisher":"Cambridge University Press (CUP)","issue":"6","license":[{"start":{"date-parts":[[2020,3,9]],"date-time":"2020-03-09T00:00:00Z","timestamp":1583712000000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2020,11]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This paper proposes a robust text classification and correspondence analysis approach to identification of similar languages. In particular, we propose to use the readily available information of clauses and word length distribution to model similar languages. The modeling and classification are based on the hypothesis that languages are self-adaptive complex systems and hence can be classified by dynamic features describing the system, especially in terms of distributional relations of constituents of a system. For similar languages whose grammatical differences are often subtle, classification based on dynamic system features should be more effective. To test this hypothesis, we considered both regional and genre varieties of Mandarin Chinese for classification. The data are extracted from two comparable balanced corpora to minimize possible confounding factors. The two corpora are the Sinica Corpus from Taiwan and the Lancaster Corpus of Mandarin Chinese from Mainland China, and the two genres are reportage and review. Our text classification and correspondence analysis results show that the linguistically felicitous two-level constituency model combining power functions between word and clauses effectively classifies the two varieties of Chinese for both genres. In addition, we found that genres do have compounding effect on classification of regional varieties. In particular, reportage in two varieties is more likely to be classified than review, corroborating the complex system view of language variations. That is, language variations and changes typically do not take place evenly across the board for the complete language system. This further enhances our hypothesis that dynamic complex system features, such as the power functions captured by the Menzerath\u2013Altmann law, provide effective models in classifications of similar languages.<\/jats:p>","DOI":"10.1017\/s1351324920000121","type":"journal-article","created":{"date-parts":[[2020,3,9]],"date-time":"2020-03-09T11:05:01Z","timestamp":1583751901000},"page":"613-640","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":8,"title":["Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora"],"prefix":"10.1017","volume":"26","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2510-6277","authenticated-orcid":false,"given":"Renkui","family":"Hou","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8526-5520","authenticated-orcid":false,"given":"Chu-Ren","family":"Huang","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2020,3,9]]},"reference":[{"key":"S1351324920000121_ref2","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511801686"},{"key":"S1351324920000121_ref59","first-page":"791","volume-title":"Quantitative Linguistics. An International Handbook","author":"Wimmer","year":"2005"},{"key":"S1351324920000121_ref19","first-page":"19","article-title":"\u4ee5\u4e2d\u6587\u5341\u5104\u8a5e\u8a9e\u6599\u5eab\u70ba\u57fa\u790e\u4e4b\u5169\u5cb8\u8a5e\u5f59\u5c0d\u6bd4\u7814\u7a76 (Cross-strait lexical differences: A comparative study based on Chinese Gigaword Corpus)","volume":"18","author":"Hong","year":"2013","journal-title":"Computational Linguistics and Chinese Language Processing"},{"key":"S1351324920000121_ref33","doi-asserted-by":"publisher","DOI":"10.1002\/cplx.10030"},{"key":"S1351324920000121_ref52","doi-asserted-by":"publisher","DOI":"10.1162\/089120100750105920"},{"key":"S1351324920000121_ref14","first-page":"256","article-title":"\u7528\u8ba1\u91cf\u65b9\u6cd5\u7814\u7a76\u8bed\u8a00. Foreign","volume":"44","author":"Feng","year":"2012","journal-title":"Language Teaching and Research"},{"key":"S1351324920000121_ref39","volume-title":"Routledge Handbook on Chinese Applied Linguistics","author":"Lin","year":"2018"},{"key":"S1351324920000121_ref68","unstructured":"Zipf, G.K. (1935). The Psycho-Biology of Language: An Introduction to Dynamic Philology. Oxford, England: Houghton, Mifflin."},{"key":"S1351324920000121_ref55","doi-asserted-by":"publisher","DOI":"10.1515\/cllt-2013-0020"},{"key":"S1351324920000121_ref4","first-page":"136","article-title":"The distribution of rhythmic units in German short prose","volume":"3","author":"Best","year":"2002","journal-title":"Glottometrics"},{"key":"S1351324920000121_ref61","unstructured":"Xu, D. (1995). \u5169\u5cb8\u8a5e\u8a9e\u5dee\u7570\u4e4b\u6bd4\u8f03 (Lexical difference between Mainland and Taiwan Chinese). 1 st symposium on Cross-Strait Lexical and Character differences (\u7b2c\u4e00\u5c46\u5169\u5cb8\u6f22\u8a9e\u8a9e\u5f59\u6587\u5b57\u5b78\u8853\u7814\u8a0e \u6703\u8ad6\u6587\u96c6"},{"key":"S1351324920000121_ref46","doi-asserted-by":"publisher","DOI":"10.1080\/09296174.2015.1106269"},{"key":"S1351324920000121_ref67","volume-title":"Lectures on Grammar","author":"Zhu","year":"1982"},{"key":"S1351324920000121_ref62","volume-title":"Perspective","author":"Xu","year":"2019"},{"key":"S1351324920000121_ref45","first-page":"3","article-title":"The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study","volume":"17","author":"McEnery","year":"2004","journal-title":"Religion"},{"key":"S1351324920000121_ref49","first-page":"124","article-title":"Unified modeling of length in language","volume":"2","author":"Popescu","year":"2014","journal-title":"Language"},{"key":"S1351324920000121_ref6","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511519871"},{"key":"S1351324920000121_ref29","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-5301"},{"key":"S1351324920000121_ref15","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1002\/cplx.20296","article-title":"The self-organization of genomes","volume":"15","author":"Ferrer-I-Cancho","year":"2010","journal-title":"Complexity"},{"key":"S1351324920000121_ref51","doi-asserted-by":"publisher","DOI":"10.1111\/j.0039-3193.2004.00109.x"},{"key":"S1351324920000121_ref64","unstructured":"Zampieri, M. , Malmasi, S. , Scherrer, Y. , Samard\u017eic, T. , Tyers, F. , Silfverberg, M.P. , Klyueva, N , Pan, T.L. , Huang, C.R. , Ionescu, R.T. , Butnaru, A. (2019). A Report on the Third VarDial Evaluation Campaign. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019). Association for Computational Linguistics, pp.1\u201316."},{"key":"S1351324920000121_ref50","volume-title":"R: A Language and Environment for Statistical Computing","year":"2016"},{"key":"S1351324920000121_ref7","volume-title":"A Grammar of Spoken Chinese","author":"Chao","year":"1968"},{"key":"S1351324920000121_ref1","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-011-1769-2_1"},{"key":"S1351324920000121_ref40","first-page":"97","article-title":"A comparative study of stylistics between \u201cReading News\u201d and \u201cTalking News\u201d","volume":"1","author":"Liu","year":"2011","journal-title":"Language Teaching and Linguistic Studies"},{"key":"S1351324920000121_ref53","unstructured":"\u0160tajner, S. and Mitkov, R. (2012). Style of religious texts in 20th century. In proceedings of the Workshop on Language Resource and Evaluation for Religious Texts (LRE-Rel), held in conjunction with LREC 2012, pp. 81\u20137. 23 May. Istanbul, Turkey."},{"key":"S1351324920000121_ref23","doi-asserted-by":"crossref","unstructured":"Hou, R. , Huang, C.-R. , Ahrens, K. and Lee, Y.S. (2019). Linguistic characteristics of Chinese register based on the Menzerath-Altmann law and text clustering. Digital Scholarship in the Humanities. https:\/\/doi.org\/10.1093\/llc\/fqz005","DOI":"10.1093\/llc\/fqz005"},{"key":"S1351324920000121_ref60","first-page":"329","volume-title":"Contributions to the Science of Text and Language","author":"Wimmer","year":"2007"},{"key":"S1351324920000121_ref20","doi-asserted-by":"publisher","DOI":"10.1080\/09296174.2017.1314411"},{"key":"S1351324920000121_ref65","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-53"},{"key":"S1351324920000121_ref8","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/9.4.281"},{"key":"S1351324920000121_ref47","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1515\/9783110362879-011","volume-title":"Sequences in Language and Text","author":"Paw\u0142owski","year":"2015"},{"key":"S1351324920000121_ref66","unstructured":"Zeng, R. (1995). \u5169\u5cb8\u8a9e\u8a00\u8a5e\u5f59\u6574\u7406\u4e4b\u6211\u898b (Opinion on cross-Strait language differences)1 st symposium on Cross-Strait Lexical and Character differences \u7b2c\u4e00\u5c46\u5169\u5cb8\u6f22\u8a9e\u8a9e\u5f59\u6587\u5b57\u5b78\u8853\u7814\u8a0e\u6703 \u8ad6\u6587\u96c6"},{"key":"S1351324920000121_ref44","doi-asserted-by":"publisher","DOI":"10.1155\/2019\/6979830"},{"key":"S1351324920000121_ref17","first-page":"15","volume-title":"Contributions to the Science of Text and Language","author":"Grzybek","year":"2007"},{"key":"S1351324920000121_ref16","doi-asserted-by":"publisher","DOI":"10.1002\/cplx.21429"},{"key":"S1351324920000121_ref5","unstructured":"Best, K.H. (2005). Quantitative Linguistics-An International Handbook, chapter Satzl\u00e4nge (Sentence length), pages 298\u2013304, de Gruyter."},{"key":"S1351324920000121_ref69","doi-asserted-by":"publisher","DOI":"10.1515\/cllt-2019-0049"},{"key":"S1351324920000121_ref56","unstructured":"Wang, T. and Li, X. (1996). \u5169\u5cb8\u8a5e\u5f59\u6bd4\u8f03\u7814\u7a76\u7ba1\u898b (Research on lexical differences between Mainland and Taiwan Mandarin), World Chinese (\u3008\u83ef\u6587\u4e16\u754c\u3009), volume 81."},{"key":"S1351324920000121_ref63","volume-title":"Studies in Natural Language Processing book series","author":"Zampieri","year":"2019"},{"key":"S1351324920000121_ref3","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1111\/j.1467-9922.2009.00533.x","article-title":"Language is a complex adaptive system: Position paper","volume":"59","author":"Beckner","year":"2009","journal-title":"Language learning"},{"key":"S1351324920000121_ref32","unstructured":"Jauhiainen, T. , Lind\u00e9n, K. and Jauhiainen, H. (2019). Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 178\u2013187."},{"key":"S1351324920000121_ref9","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-010-0201-1_13"},{"key":"S1351324920000121_ref11","unstructured":"Christensen, M. (1994). Varation in Spoken and Written Mandarin Narrative Discourse. Ph.D. thesis. Ohio State University, Columbus."},{"key":"S1351324920000121_ref10","first-page":"167","volume-title":"Proceeding of the 11th Pacific Asia Conference on Language, Information and Computation","author":"Chen","year":"1996"},{"key":"S1351324920000121_ref34","doi-asserted-by":"publisher","DOI":"10.1515\/9783110272925"},{"key":"S1351324920000121_ref12","first-page":"29","article-title":"A quantitative linguistic study on the relationship between word length and word frequency","volume":"36","author":"Deng","year":"2013","journal-title":"Journal of Foreign Language"},{"key":"S1351324920000121_ref27","unstructured":"Huang, C.R. , Chen, K.J. , and Gao, Z.M. (1998). Noun class extraction from a corpus-based collocation dictionary: An integration of computational and qualitative approaches. In B. Tsou et al. (eds.), Quantitative and Computational Studies of Chinese Linguistics 339\u2013352. Hong Kong: City University of Hong Kong."},{"key":"S1351324920000121_ref28","unstructured":"Huang, C.-R. and Lee, L.H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation (pp. 404\u2013410)."},{"key":"S1351324920000121_ref43","unstructured":"Malmasi, S. , Zampieri, M. , Ljube\u0161i\u0107, N. , Nakov, P. , Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1\u201314."},{"key":"S1351324920000121_ref18","first-page":"221","article-title":"\u8a9e\u6599\u5eab\u70ba\u672c\u7684\u5169\u5cb8\u5c0d\u61c9\u8a5e\u5f59\u767c\u6398. (A corpus-based approach to the discovery of cross-strait lexical contrasts)","volume":"9","author":"Hong","year":"2008","journal-title":"Language and Linguistics"},{"key":"S1351324920000121_ref24","doi-asserted-by":"publisher","DOI":"10.1080\/09296174.2014.911508"},{"key":"S1351324920000121_ref22","doi-asserted-by":"publisher","DOI":"10.1017\/S135132491900010X"},{"key":"S1351324920000121_ref31","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781139028462"},{"key":"S1351324920000121_ref54","volume-title":"The Fields of Linguistics","author":"Thomason","year":"1997"},{"key":"S1351324920000121_ref42","unstructured":"Lu, J. (1993). The features of Chinese sentences. Chinese Language Learning No.1, 1\u20136."},{"key":"S1351324920000121_ref48","doi-asserted-by":"crossref","first-page":"469","DOI":"10.1017\/S1351324910000161","article-title":"The automatic identification of lexical variation between language varieties","volume":"16","author":"Peirsman","year":"2010","journal-title":"Natural Language Engineering"},{"key":"S1351324920000121_ref30","unstructured":"Huang, C.-R. and Lin, J. (2013). The ordering of Mandarin Chinese light verbs. In Proceedings of the 13th Chinese Lexical Semantics Workshop. D. Ji and G. Xiao (Eds.) CLSW 2012, LNAI 7717, pages 728\u2013735. Heidelberg: Springer."},{"key":"S1351324920000121_ref35","unstructured":"Kroch, T. (1994). Morphosyntactic variation. In Beals K., Denton J., Knippen R., Melnar L., Suzuki H. and Zeinfeld E. (eds.), Papers from the 30th regional meeting of the Chicago Linguistics Society: Parasession on variation and linguistic theory, vol. 2. Chicago: Chicago Linguistics Society, pp. 180\u2013201."},{"key":"S1351324920000121_ref57","first-page":"5","article-title":"Language is a complex adaptive system\u8bed\u8a00\u662f\u4e00\u4e2a\u590d\u6742\u9002\u5e94\u7cfb\u7edf","volume":"21","author":"Wang","year":"2006","journal-title":"Journal of Tsinghua University (Philosophy and Social Science)"},{"key":"S1351324920000121_ref36","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511792519"},{"key":"S1351324920000121_ref41","first-page":"178","article-title":"Quantitative linguistics: State of the art, theories and methods","volume":"42","author":"Liu","year":"2012","journal-title":"Journal of Zhejiang University (Humanities and Social Sciences)"},{"key":"S1351324920000121_ref38","doi-asserted-by":"publisher","DOI":"10.1002\/cplx.20398"},{"key":"S1351324920000121_ref21","doi-asserted-by":"publisher","DOI":"10.1515\/cllt-2016-0062"},{"key":"S1351324920000121_ref37","doi-asserted-by":"publisher","DOI":"10.2307\/412333"},{"key":"S1351324920000121_ref13","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1002\/cplx.21498","article-title":"Language-like behavior of protein length distribution in proteomes","volume":"20","author":"Eroglu","year":"2014","journal-title":"Complexity"},{"key":"S1351324920000121_ref58","doi-asserted-by":"publisher","DOI":"10.2307\/411748"},{"key":"S1351324920000121_ref26","first-page":"47","article-title":"On the mathematical properties of Mandarin Chinese \u8a66\u8ad6\u6f22\u8a9e\u7684\u6578\u5b78\u898f\u7bc4\u6027\u8cea","volume":"60","author":"Huang","year":"1989","journal-title":"Bulletin of the Institute of History and Philology"},{"key":"S1351324920000121_ref25","unstructured":"Hu, H. , Li, W. , Zhou, H. , Tian, Z. , Zhang, Y. , and Zou, L. (2019). Ensemble Methods to Distinguish Mainland and Taiwan Chinese. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 165\u2013171."}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324920000121","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,1]],"date-time":"2024-08-01T21:07:45Z","timestamp":1722546465000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324920000121\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,9]]},"references-count":69,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2020,11]]}},"alternative-id":["S1351324920000121"],"URL":"https:\/\/doi.org\/10.1017\/s1351324920000121","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,3,9]]},"assertion":[{"value":"\u00a9 Cambridge University Press 2020","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}