{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,22]],"date-time":"2025-07-22T11:09:11Z","timestamp":1753182551882,"version":"3.41.0"},"reference-count":23,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2008,5,1]],"date-time":"2008-05-01T00:00:00Z","timestamp":1209600000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Speech Lang. Process."],"published-print":{"date-parts":[[2008,5]]},"abstract":"<jats:p>Chinese word segmentation (CWS) is a necessary step in Chinese-English statistical machine translation (SMT) and its performance has an impact on the results of SMT. However, there are many choices involved in creating a CWS system such as various specifications and CWS methods. The choices made will create a new CWS scheme, but whether it will produce a superior or inferior translation has remained unknown to date. This article examines the relationship between CWS and SMT. The effects of CWS on SMT were investigated using different specifications and CWS methods. Four specifications were selected for investigation: Beijing University (PKU), Hong Kong City University (CITYU), Microsoft Research (MSR), and Academia SINICA (AS). We created 16 CWS schemes under different settings to examine the relationship between CWS and SMT. Our experimental results showed that the MSR's specifications produced the lowest quality translations. In examining the effects of CWS methods, we tested dictionary-based and CRF-based approaches and found there was no significant difference between the two in the quality of the resulting translations. We also found the correlation between the CWS F-score and SMT BLEU score was very weak. We analyzed CWS errors and their effect on SMT by evaluating systems trained with and without these errors. This article also proposes two methods for combining advantages of different specifications: a simple concatenation of training data and a feature interpolation approach in which the same types of features of translation models from various CWS schemes are linearly interpolated. We found these approaches were very effective in improving the quality of translations.<\/jats:p>","DOI":"10.1145\/1363108.1363109","type":"journal-article","created":{"date-parts":[[2008,6,3]],"date-time":"2008-06-03T15:11:43Z","timestamp":1212505903000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Chinese word segmentation and statistical machine translation"],"prefix":"10.1145","volume":"5","author":[{"given":"Ruiqiang","family":"Zhang","sequence":"first","affiliation":[{"name":"National Institute of Information and Communications Technology, Kyoto, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Keiji","family":"Yasuda","sequence":"additional","affiliation":[{"name":"National Institute of Information and Communications Technology, Kyoto, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Eiichiro","family":"Sumita","sequence":"additional","affiliation":[{"name":"National Institute of Information and Communications Technology, Kyoto, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2008,5,29]]},"reference":[{"volume-title":"Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing","year":"2005","author":"Emerson T.","key":"e_1_2_1_1_1"},{"volume-title":"Proceedings of the 2nd Workshop on Statistical Machine Translation. Prague, Czech Republic, Association for Computational Linguistics, 128--135","author":"Foster G.","key":"e_1_2_1_2_1"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1162\/coli.2007.33.3.293"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.3115\/1218955.1219014"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.3115\/1220575.1220660"},{"volume-title":"Proceedings of Machine Translation Evaluation Workshop.","author":"Koehn P.","key":"e_1_2_1_6_1"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073445.1073462"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073336.1073361"},{"volume-title":"Proceedings of the ICML. 591--598","author":"Lafferty J.","key":"e_1_2_1_9_1"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/1613984.1613999"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.3115\/1075096.1075117"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1162\/089120103321337421"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"volume-title":"Proceedings of the IWSLT","year":"2006","author":"Paul M.","key":"e_1_2_1_14_1"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.3115\/1220355.1220436"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.3115\/1220175.1220176"},{"volume-title":"Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing","author":"Tseng H.","key":"e_1_2_1_17_1"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.3115\/1067807.1067853"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.3115\/1119250.1119278"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.3115\/1119250.1119280"},{"volume-title":"Proceedings of the HLT-NAACL (Companion Volume: Short Papers)","author":"Zhang R.","key":"e_1_2_1_21_1"},{"volume-title":"Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)","author":"Zhang R.","key":"e_1_2_1_22_1"},{"key":"e_1_2_1_23_1","unstructured":"Zhang Y. Vogel S. and Waibel A. 2004. Interpreting BLEU\/NIST scores: How much improvement do we need to have a better system&quest; In Proceedings of the International Conference on Language Resources and Evaluation (LREC).  Zhang Y. Vogel S. and Waibel A. 2004. Interpreting BLEU\/NIST scores: How much improvement do we need to have a better system&quest; In Proceedings of the International Conference on Language Resources and Evaluation (LREC)."}],"container-title":["ACM Transactions on Speech and Language Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1363108.1363109","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1363108.1363109","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T15:13:48Z","timestamp":1750259628000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1363108.1363109"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,5]]},"references-count":23,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2008,5]]}},"alternative-id":["10.1145\/1363108.1363109"],"URL":"https:\/\/doi.org\/10.1145\/1363108.1363109","relation":{},"ISSN":["1550-4875","1550-4883"],"issn-type":[{"type":"print","value":"1550-4875"},{"type":"electronic","value":"1550-4883"}],"subject":[],"published":{"date-parts":[[2008,5]]},"assertion":[{"value":"2007-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-05-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}