{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T17:53:19Z","timestamp":1754157199023,"version":"3.41.2"},"reference-count":26,"publisher":"Emerald","issue":"3","license":[{"start":{"date-parts":[[2007,5,1]],"date-time":"2007-05-01T00:00:00Z","timestamp":1177977600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2007,5,1]]},"abstract":"<jats:sec><jats:title content-type=\"abstract-heading\">Purpose<\/jats:title><jats:p>The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Design\/methodology\/approach<\/jats:title><jats:p>Na\u00efve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation\u2010based approach was compared with the non\u2010segmentation\u2010based approach.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Findings<\/jats:title><jats:p>There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Practical implications<\/jats:title><jats:p>Apply the findings to real web text classification is ongoing work.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Originality\/value<\/jats:title><jats:p>The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.<\/jats:p><\/jats:sec>","DOI":"10.1108\/00220410710743306","type":"journal-article","created":{"date-parts":[[2007,4,20]],"date-time":"2007-04-20T11:01:21Z","timestamp":1177066881000},"page":"378-397","source":"Crossref","is-referenced-by-count":11,"title":["Machine learning for Asian language text classification"],"prefix":"10.1108","volume":"63","author":[{"given":"Fuchun","family":"Peng","sequence":"first","affiliation":[]},{"given":"Xiangji","family":"Huang","sequence":"additional","affiliation":[]}],"member":"140","reference":[{"key":"key2022032020354366100_b1","unstructured":"Aizawa, A. (2001), \u201cLinguistic techniques to improve the performance of automatic text categorization\u201d, Proceedings of the 6th Natural Language Processing Pacific Rim Symposium."},{"key":"key2022032020354366100_b2","unstructured":"Beaulieu, M., Gatford, M., Huang, X., Robertson, S., Walker, S. and Williams, P. (1997), \u201cOkapi at TREC\u20105\u201d, in Harman, D.K. (Ed.), Proceedings of TREC\u20105."},{"key":"key2022032020354366100_b3","unstructured":"Burges, C. (1998), \u201cA tutorial on support vector machines for pattern recognition\u201d, Data Mining and Knowledge Discovery, Vol. 2 No. 2, pp. 955\u201074."},{"key":"key2022032020354366100_b4","doi-asserted-by":"crossref","unstructured":"Byrd, R.H., Lu, P. and Nocedal, J. (1995), \u201cA limited memory algorithm for bound constrained optimization\u201d, SIAM Journal on Scientific and Statistical Computing, Vol. 16 No. 5, pp. 1190\u2010208.","DOI":"10.1137\/0916069"},{"key":"key2022032020354366100_b5","unstructured":"Cavnar, W. and Trenkle, J. (1994), \u201cN\u2010gram\u2010based text categorization\u201d, Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR)."},{"key":"key2022032020354366100_b6","unstructured":"Chen, S. and Goodman, J. (1998), An Empirical Study of Smoothing Techniques for Language Modeling, Technical Report TR\u201010\u201098, Harvard University, Boston, MA."},{"key":"key2022032020354366100_b7","doi-asserted-by":"crossref","unstructured":"Cleary, J. and Witten, I. (1984), \u201cData compression using adaptive coding and partial string matching\u201d, IEEE Transactions on Communications, Vol. 32 No. 4, pp. 396\u2010402.","DOI":"10.1109\/TCOM.1984.1096090"},{"key":"key2022032020354366100_b8","doi-asserted-by":"crossref","unstructured":"Damashek, M. (1995), \u201cGauging similarity with n\u2010grams: language\u2010independent categorization of text?\u201d, Science, Vol. 267 No. 10, pp. 843\u20108.","DOI":"10.1126\/science.267.5199.843"},{"key":"key2022032020354366100_b9","doi-asserted-by":"crossref","unstructured":"Dumais, S., Platt, J., Heckerman, D. and Sahami, M. (1998), \u201cInductive learning algorithms and representations for text categorization\u201d, Proceedings of the 7th International Conference on Information and Knowledge Management.","DOI":"10.1145\/288627.288651"},{"key":"key2022032020354366100_b10","unstructured":"Eyheramendy, S., Lewis, D. and Madigan, D. (2003), \u201cOn the naive Bayes model for text categorization\u201d, Proceedings of the 9th International Conference on Artificial Intelligence and Statistics (AISTATS)."},{"key":"key2022032020354366100_b11","unstructured":"He, J., Tan, A. and Tan, C. (2003), \u201cOn machine learning methods for Chinese documents classification\u201d, Applied Intelligence, Special Issue on Text and Web Mining, Vol. 18, pp. 311\u201022."},{"key":"key2022032020354366100_b12","doi-asserted-by":"crossref","unstructured":"Huang, X., Peng, F., Schuurmans, D., Cercone, N. and Robertson, S. (2003), \u201cApplying machine learning for text segmentation in information retrieval\u201d, Information Retrieval, Vol. 6 Nos 3\u20104, pp. 333\u201062.","DOI":"10.1023\/A:1026028229881"},{"key":"key2022032020354366100_b13","doi-asserted-by":"crossref","unstructured":"Joachims, T. (1998), \u201cText categorization with support vector machines: learning with many relevant features\u201d, Proceedings of the 10th European Conference on Machine Learning (ECML).","DOI":"10.1007\/BFb0026683"},{"key":"key2022032020354366100_b15","unstructured":"McCallum, A. and Nigam, K. (1998), \u201cA comparison of event models for naive Bayes text classification\u201d, Proceedings of AAAI\u201098 Workshop on Learning for Text Categorization."},{"key":"key2022032020354366100_b14","doi-asserted-by":"crossref","unstructured":"Malouf, R. (2002), \u201cA comparison of algorithms for maximum entropy parameter estimation\u201d, Proceedings of the 6th Conference on Natural Language Learning.","DOI":"10.3115\/1118853.1118871"},{"key":"key2022032020354366100_b17","doi-asserted-by":"crossref","unstructured":"Peng, F. and Schuurmans, D. (2001), \u201cSelf\u2010supervised Chinese word segmentation\u201d, Advances in Intelligent Data Analysis: Proceedings of the 4th International Conference, pp. 238\u201047.","DOI":"10.1007\/3-540-44816-0_24"},{"key":"key2022032020354366100_b16","doi-asserted-by":"crossref","unstructured":"Peng, F., Huang, X., Schuurmans, D. and Cercone, N. (2002), \u201cInvestigating the relationship of word segmentation performance and retrieval performance in Chinese IR\u201d, Proceedings of the 19th International Conference on Computational Linguistics (COLING), pp. 793\u20109.","DOI":"10.3115\/1072228.1072376"},{"key":"key2022032020354366100_b18","unstructured":"Pietra, S., Pietra, V. and Lafferty, J. (1995), \u201cInducing features of random fields\u201d, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19 No. 4."},{"key":"key2022032020354366100_b19","doi-asserted-by":"crossref","unstructured":"Rabiner, L. (1989), \u201cA tutorial on hidden Markov models and selected applications in speech recognition\u201d, Proceedings of IEEE, Vol. 77 No. 2, pp. 257\u201086.","DOI":"10.1109\/5.18626"},{"key":"key2022032020354366100_b20","unstructured":"Scott, S. and Matwin, S. (1999), \u201cFeature engineering for text classification\u201d, Proceedings of 16th International Conference on Machine Learning (ICML)."},{"key":"key2022032020354366100_b21","doi-asserted-by":"crossref","unstructured":"Sebastiani, F. (2002), \u201cMachine learning in automated text categorization\u201d, ACM Computing Surveys, Vol. 34 No. 1.","DOI":"10.1145\/505282.505283"},{"key":"key2022032020354366100_b22","unstructured":"Teahan, W. and Harper, D. (2001), \u201cUsing compression\u2010based language models for text categorization\u201d, Proceedings of the Workshop on Language Models for Information Retrieval (LMIR)."},{"key":"key2022032020354366100_b23","doi-asserted-by":"crossref","unstructured":"Teahan, W., Wen, Y., McNab, R. and Witten, I.H. (2001), \u201cA compression\u2010based algorithm for Chinese word segmentation\u201d, Computational Linguistics, Vol. 26 No. 3, p. 2001.","DOI":"10.1162\/089120100561746"},{"key":"key2022032020354366100_b24","doi-asserted-by":"crossref","unstructured":"Vapnik, V. (1995), The Nature of Statistical Learning Theory, Springer\u2010Verlag, Berlin.","DOI":"10.1007\/978-1-4757-2440-0"},{"key":"key2022032020354366100_b25","unstructured":"Yang, Y. (1999), \u201cAn evaluation of statistical approaches to text categorization\u201d, Information Retrieval, Vol. 1 Nos 1\/2."},{"key":"key2022032020354366100_b26","doi-asserted-by":"crossref","unstructured":"Yang, Y. and Liu, X. (1999), \u201cA re\u2010examination of text categorization methods\u201d, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).","DOI":"10.1145\/312624.312647"}],"container-title":["Journal of Documentation"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/www.emeraldinsight.com\/doi\/full-xml\/10.1108\/00220410710743306","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/00220410710743306\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/00220410710743306\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T23:37:48Z","timestamp":1753400268000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/jd\/article\/63\/3\/378-397\/203906"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,5,1]]},"references-count":26,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2007,5,1]]}},"alternative-id":["10.1108\/00220410710743306"],"URL":"https:\/\/doi.org\/10.1108\/00220410710743306","relation":{},"ISSN":["0022-0418"],"issn-type":[{"type":"print","value":"0022-0418"}],"subject":[],"published":{"date-parts":[[2007,5,1]]}}}