{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T00:27:35Z","timestamp":1777854455760,"version":"3.51.4"},"reference-count":39,"publisher":"SAGE Publications","issue":"4","license":[{"start":{"date-parts":[[2012,6,6]],"date-time":"2012-06-06T00:00:00Z","timestamp":1338940800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Journal of Information Science"],"published-print":{"date-parts":[[2012,8]]},"abstract":"<jats:p>There is now a huge amount of electronic documents stored on the internet. In order to retrieve information from this data, each document is commonly represented as a set of keywords, and then all documents are analysed based on the set of discriminative words. In information retrieval the recognition of words in articles is an essential step; however, unlike English, Chinese words are not distinguished by spaces. Therefore, many approaches have been devised to parse Chinese words. The dictionary-based approach is commonly used in most current systems for text segmentation. However, general purpose dictionaries are not always able to provide proper references to accurately parse the domain-specific words, especially with unknown words. This paper aims to propose a new method for classifying longer keywords from Chinese documents by incorporating previously unknown keywords into a keyword list without the effort of building domain-specific dictionaries. Our method first utilizes the parsed words from existing parsers and filters the keywords utilizing term frequency\u2013inverse document frequency (TF-IDF) values; further, based on the parsed words and keywords, a T tree is used to store the candidates for composing unknown words. The candidates are evaluated by an unknown word (UW) coefficient threshold, i.e. newly composed words are deemed as newly discovered unknown words if their UW coefficient is higher than a pre-defined threshold. Finally, the parsed words and newly composed words are re-filtered to form long keywords. The results of several experiments comparing the results with Google and Yahoo show that, regardless of recall rates, precision rates and F-measures, our proposed method significantly outperforms other methods.<\/jats:p>","DOI":"10.1177\/0165551512442481","type":"journal-article","created":{"date-parts":[[2012,6,6]],"date-time":"2012-06-06T22:21:32Z","timestamp":1339021292000},"page":"366-382","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":5,"title":["A new method to compose long unknown Chinese keywords"],"prefix":"10.1177","volume":"38","author":[{"given":"Yu-Chin","family":"Liu","sequence":"first","affiliation":[{"name":"Shih Hsin University, Taiwan, R.O.C."}]},{"given":"Chun-Wei","family":"Lin","sequence":"additional","affiliation":[{"name":"Wistron Corporation, Taiwan, R.O.C."}]}],"member":"179","published-online":{"date-parts":[[2012,6,6]]},"reference":[{"key":"bibr1-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1145\/361219.361220"},{"key":"bibr2-0165551512442481","volume-title":"Introduction to modern information retrieval","author":"Salton G","year":"1983"},{"key":"bibr3-0165551512442481","first-page":"627","volume-title":"Proceedings of the 17th national conference on artificial intelligence (AAAI)","author":"Nahm UY"},{"key":"bibr4-0165551512442481","first-page":"547","volume":"37","author":"Jaccard P","year":"1901","journal-title":"Bulletin de la Soci\u03aft\u03af Vaudoise des Sciences Naturelles"},{"key":"bibr5-0165551512442481","unstructured":"Internet World Stats, \u2018Internet World Users By Language: Top 10 Languages\u2019, http:\/\/www.internetworldstats.com\/stats7.htm (2011, accessed May 2011)."},{"key":"bibr6-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1145\/243199.243270"},{"key":"bibr7-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-4571(199310)44:9<532::AID-ASI3>3.0.CO;2-M"},{"key":"bibr8-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-4571(199503)46:2<83::AID-ASI2>3.0.CO;2-0"},{"key":"bibr9-0165551512442481","unstructured":"Chinese Knowledge Information processing (CKIP): the categorical analysis of Chinese. Technical Report; 1993; 05; Taipei: Institute of Information Science Academia Sinica."},{"issue":"3","key":"bibr10-0165551512442481","first-page":"377","volume":"22","author":"Sproat R","year":"1996","journal-title":"Computer Linguistics"},{"key":"bibr11-0165551512442481","first-page":"180","volume-title":"Proceedings of the fourth conference on applied natural language processing","author":"Wu D"},{"key":"bibr12-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1162\/089120100561746"},{"key":"bibr13-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1145\/355214.355235"},{"key":"bibr14-0165551512442481","first-page":"112","volume-title":"Proceedings 1st international conference on knowledge discovery and data mining","author":"Feldman R"},{"key":"bibr15-0165551512442481","volume-title":"Document warehousing and text mining: techniques for improving business operations, marketing, and sales","author":"Sullivan D","year":"2001"},{"key":"bibr16-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4573(02)00079-1"},{"key":"bibr17-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1142\/S0219427905001286"},{"issue":"4","key":"bibr18-0165551512442481","first-page":"336","volume":"4","author":"Sproat R","year":"1990","journal-title":"Computer Processing of Chinese and Oriental Languages"},{"issue":"2","key":"bibr19-0165551512442481","first-page":"97","volume":"5","author":"Yeh CL","year":"1991","journal-title":"Computer Processing of Chinese and Oriental Languages"},{"key":"bibr20-0165551512442481","doi-asserted-by":"publisher","DOI":"10.20965\/jaciii.2007.p0416"},{"key":"bibr21-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1016\/j.datak.2006.04.001"},{"key":"bibr22-0165551512442481","first-page":"2612","volume-title":"Proceedings of international conference on machine learning and cybernetics","author":"Fu GH"},{"key":"bibr23-0165551512442481","first-page":"131","volume":"8","author":"Tung CH","year":"1994","journal-title":"Computer Processing of Chinese & Oriental Languages"},{"key":"bibr24-0165551512442481","first-page":"113","volume-title":"Festschrift for Professor Akira Ikeya","author":"Wang MC","year":"1995"},{"key":"bibr25-0165551512442481","first-page":"197","volume-title":"Proceedings of the 41st annual meeting of the Association for Computational Linguistics (ACL)","volume":"2","author":"Ling GC"},{"key":"bibr26-0165551512442481","volume-title":"Proceedings of the 19th international conference on Computational linguistics (COLING I)","author":"Chen KJ"},{"issue":"1","key":"bibr27-0165551512442481","first-page":"76","volume":"16","author":"Church K","year":"1990","journal-title":"Computational Linguistics"},{"issue":"1","key":"bibr28-0165551512442481","first-page":"143","volume":"19","author":"Smadja F","year":"1993","journal-title":"Computational Linguistics"},{"issue":"4","key":"bibr29-0165551512442481","first-page":"1","volume":"11","author":"Chen HH","year":"1998","journal-title":"Computer Processing of Oriental Languages"},{"key":"bibr30-0165551512442481","first-page":"119","volume-title":"Proceedings of ROCLING VI","author":"Lin MY","year":"1993"},{"key":"bibr31-0165551512442481","doi-asserted-by":"publisher","DOI":"10.3115\/990820.990846"},{"key":"bibr32-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-011-9411-z"},{"key":"bibr33-0165551512442481","first-page":"412","volume-title":"The fourteenth international conference on machine learning","author":"Yang Y","year":"1997"},{"key":"bibr34-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1145\/215206.215365"},{"key":"bibr35-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-4571(199605)47:5<357::AID-ASI3>3.0.CO;2-V"},{"key":"bibr36-0165551512442481","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2008.50"},{"key":"bibr37-0165551512442481","volume-title":"Introduction to modern information retrieval","author":"Salton G","year":"2008"},{"key":"bibr38-0165551512442481","volume-title":"Information retrieval","author":"Van Rijsbergen C J","year":"1979"},{"key":"bibr39-0165551512442481","volume-title":"In Proceedings of the semantic web workshop of the 26th annual international ACM SIGIR conference","author":"Hotho A","year":"2003"}],"container-title":["Journal of Information Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/0165551512442481","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/0165551512442481","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/0165551512442481","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T23:08:20Z","timestamp":1777504100000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/0165551512442481"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,6,6]]},"references-count":39,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2012,8]]}},"alternative-id":["10.1177\/0165551512442481"],"URL":"https:\/\/doi.org\/10.1177\/0165551512442481","relation":{},"ISSN":["0165-5515","1741-6485"],"issn-type":[{"value":"0165-5515","type":"print"},{"value":"1741-6485","type":"electronic"}],"subject":[],"published":{"date-parts":[[2012,6,6]]}}}