{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T22:32:40Z","timestamp":1757543560223},"reference-count":37,"publisher":"Cambridge University Press (CUP)","issue":"2","license":[{"start":{"date-parts":[[2019,2,11]],"date-time":"2019-02-11T00:00:00Z","timestamp":1549843200000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2019,3]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., \u201cB\u201d for word beginning, and \u201cE\u201d for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method <jats:italic>Edge Likelihood (EL)<\/jats:italic> for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.<\/jats:p>","DOI":"10.1017\/s1351324918000463","type":"journal-article","created":{"date-parts":[[2019,2,10]],"date-time":"2019-02-10T23:52:28Z","timestamp":1549842748000},"page":"239-255","source":"Crossref","is-referenced-by-count":9,"title":["Out-domain Chinese new word detection with statistics-based character embedding"],"prefix":"10.1017","volume":"25","author":[{"given":"Yuzhi","family":"Liang","sequence":"first","affiliation":[]},{"given":"Min","family":"Yang","sequence":"additional","affiliation":[]},{"given":"Jia","family":"Zhu","sequence":"additional","affiliation":[]},{"given":"S. M.","family":"Yiu","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2019,2,11]]},"reference":[{"key":"S1351324918000463_ref23","first-page":"901","volume-title":"International Conference on Computer Processing of Oriental Languages","author":"Qiu","year":"2016"},{"key":"S1351324918000463_ref29","first-page":"29","article-title":"Chinese word segmentation as character tagging","volume":"8","author":"Xue","year":"2003","journal-title":"Computational Linguistics and Chinese Language Processing"},{"key":"S1351324918000463_ref19","first-page":"3111","article-title":"Distributed representations of words and phrases and their compositionality","author":"Mikolov","year":"2013","journal-title":"Advances in Neural Information Processing Systems"},{"key":"S1351324918000463_ref3","doi-asserted-by":"publisher","DOI":"10.3115\/1626394.1626430"},{"key":"S1351324918000463_ref17","first-page":"3","volume-title":"The Interpretation of Modern Chinese Verbs","author":"Miao","year":"2011"},{"key":"S1351324918000463_ref16","first-page":"591","volume-title":"ICML","volume":"17","author":"McCallum","year":"2000"},{"key":"S1351324918000463_ref13","unstructured":"Li, Y. , Li, W. , Sun, F. and Li, S. (2015). Component-enhanced Chinese character embeddings. arXiv preprint arXiv:1508.06669."},{"key":"S1351324918000463_ref32","first-page":"41","volume-title":"Second CIPS-SIGHAN Joint Conference on Chinese Language Processing","author":"Zhang","year":"2012a"},{"key":"S1351324918000463_ref10","first-page":"1","volume-title":"Proceedings of the CoNLL99 ACL Workshop","author":"Kityz","year":"1999"},{"key":"S1351324918000463_ref25","first-page":"279","volume-title":"International Conference on Neural Information Processing","author":"Sun","year":"2014"},{"key":"S1351324918000463_ref8","first-page":"531","volume-title":"ACL (1)","author":"Huang","year":"2014"},{"key":"S1351324918000463_ref14","first-page":"864","volume-title":"EMNLP","author":"Liu","year":"2014"},{"key":"S1351324918000463_ref2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-2096"},{"key":"S1351324918000463_ref21","first-page":"562","volume-title":"Proceedings of the 20th international conference on Computational Linguistics","author":"Peng","year":"2004"},{"key":"S1351324918000463_ref18","unstructured":"Mikolov, T. , Chen, K. , Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781."},{"key":"S1351324918000463_ref26","first-page":"51","volume-title":"Association for Computational Linguistics","author":"Wang","year":"2012"},{"key":"S1351324918000463_ref1","first-page":"409","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics","author":"Cai","year":"2016"},{"key":"S1351324918000463_ref12","first-page":"854","volume-title":"International Conference on Computer Processing of Oriental Languages","author":"Leng","year":"2016"},{"key":"S1351324918000463_ref31","doi-asserted-by":"publisher","DOI":"10.3115\/1119250.1119280"},{"key":"S1351324918000463_ref33","first-page":"8","article-title":"Combining statistical model and dictionary for domain adaption of Chinese word segmentation","volume":"26","author":"Zhang","year":"2012b","journal-title":"Journal of Chinese Information Processing"},{"key":"S1351324918000463_ref22","first-page":"2185","volume-title":"Proceedings of the 54th international conference on Computational Linguistics","volume":"1","author":"Qian","year":"2016"},{"key":"S1351324918000463_ref20","first-page":"293","volume-title":"ACL (1)","author":"Pei","year":"2014"},{"key":"S1351324918000463_ref7","first-page":"210","volume-title":"Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010)","author":"Gao","year":"2010"},{"key":"S1351324918000463_ref6","first-page":"694","volume-title":"International Conference on Natural Language Processing","author":"Feng","year":"2004"},{"key":"S1351324918000463_ref24","doi-asserted-by":"publisher","DOI":"10.3115\/1119250.1119269"},{"key":"S1351324918000463_ref4","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1141"},{"key":"S1351324918000463_ref36","first-page":"162","volume-title":"IWPT \u201909 Proceedings of the 11th International Conference on Parsing Technologies","author":"Zhang","year":"2007"},{"key":"S1351324918000463_ref15","doi-asserted-by":"publisher","DOI":"10.3115\/1119250.1119254"},{"key":"S1351324918000463_ref11","first-page":"282","volume-title":"Proceedings of the Eighteenth International Conference on Machine Learning","author":"Lafferty","year":"2001"},{"key":"S1351324918000463_ref30","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46681-1_42"},{"key":"S1351324918000463_ref27","first-page":"309","volume-title":"IJCNLP","author":"Wang","year":"2011"},{"key":"S1351324918000463_ref34","first-page":"421","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics","author":"Zhang","year":"2016"},{"key":"S1351324918000463_ref37","first-page":"647","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing","author":"Zheng","year":"2013"},{"key":"S1351324918000463_ref28","first-page":"711","volume-title":"International Conference on Computer Processing of Oriental Languages","author":"Xia","year":"2016"},{"key":"S1351324918000463_ref5","doi-asserted-by":"publisher","DOI":"10.1016\/S0959-440X(96)80056-X"},{"key":"S1351324918000463_ref35","first-page":"4","article-title":"Chinese word segmentation and statistical machine translation","volume":"5","author":"Zhang","year":"2008","journal-title":"ACM Transactions on Speech and Language Processing (TSLP)"},{"key":"S1351324918000463_ref9","doi-asserted-by":"publisher","DOI":"10.3115\/1273073.1273129"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324918000463","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,4,11]],"date-time":"2019-04-11T20:36:49Z","timestamp":1555015009000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324918000463\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,2,11]]},"references-count":37,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2019,3]]}},"alternative-id":["S1351324918000463"],"URL":"https:\/\/doi.org\/10.1017\/s1351324918000463","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,2,11]]}}}