{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T19:54:55Z","timestamp":1761162895548,"version":"build-2065373602"},"reference-count":46,"publisher":"Wiley","issue":"1","license":[{"start":{"date-parts":[[2016,12,27]],"date-time":"2016-12-27T00:00:00Z","timestamp":1482796800000},"content-version":"vor","delay-in-days":361,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"content-domain":{"domain":["asistdl.onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["Proc. Assoc. Info. Sci. Tech."],"published-print":{"date-parts":[[2016,1]]},"abstract":"<jats:title>ABSTRACT<\/jats:title>\n                  <jats:p>Globalization places people in a multilingual environment. There is a growing number of users to access and share information in several languages for public or private purpose. In order to deliver relevant information in different languages, efficient multilingual documents management is worthy of study. Generally, classification and clustering are two typical methods for documents management. However, lack of training data and high efforts for corpus annotation will increase the cost for classifying multilingual documents which needs to bridge language gaps as well. Clustering is more suitable to implement in such practical applications. There are two main factors involved in documents clustering, document representation method and clustering algorithm. In this paper, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of clustering results. In our experiment, we use parallel corpora (English\u2010Chinese documents on topic of technology information) and comparable corpora (English and Chinese documents on topics of mobile technology and wind energy) as dataset. We compare four different types of document representation methods: Vector Space Model, Latent Semantic Indexing, Latent Dirichlet Allocation and Doc2Vec. Experimental results show that, accuracy of Vector Space Model were not competitive with other methods in all clustering tasks. Latent Semantic Indexing is overly sensitive to corpora itself, for it behaved differently when clustering two different topics of comparable corpora. Latent Dirichlet Allocation behaves best when clustering documents in small size of comparable corpora while Doc2Vec behaves best for large documents set of parallel corpora. Accordingly, characteristics of corpora should be under considerations for rational utilization of document representation methods to have better performance.<\/jats:p>","DOI":"10.1002\/pra2.2016.14505301065","type":"journal-article","created":{"date-parts":[[2016,12,27]],"date-time":"2016-12-27T06:08:13Z","timestamp":1482818893000},"page":"1-10","update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Document representation methods for clustering bilingual documents"],"prefix":"10.1002","volume":"53","author":[{"given":"Shutian","family":"Ma","sequence":"first","affiliation":[{"name":"Department of Information Management Nanjing University of Science and Technology No. 200 Xiaolingwei Street Nanjing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chengzhi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Department of Information Management Nanjing University of Science and Technology No. 200 Xiaolingwei Street Nanjing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daqing","family":"He","sequence":"additional","affiliation":[{"name":"School of Information Science and Intelligent System Program University of Pittsburgh 135 North Bellefield Avenue Pittsburgh PA 15260"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"311","published-online":{"date-parts":[[2016,12,27]]},"reference":[{"key":"e_1_2_8_2_1","unstructured":"Anandkumar A. Liu Y.\u2010k. Hsu D. J. Foster D. P. &Kakade S. M.(2012).A spectral algorithm for latent dirichlet allocation.Proceedings of the Advances in Neural Information Processing Systems(pp.917\u2013925)."},{"key":"e_1_2_8_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24770-0_42"},{"volume-title":"Modern information retrieval","year":"1999","author":"Baeza\u2010Yates R.","key":"e_1_2_8_4_1"},{"key":"e_1_2_8_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2013.50"},{"key":"e_1_2_8_6_1","doi-asserted-by":"publisher","DOI":"10.1162\/jmlr.2003.3.4-5.993"},{"key":"e_1_2_8_7_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0018029"},{"key":"e_1_2_8_8_1","unstructured":"Boyd\u2010Graber J. &Blei D. M.(2009).Multilingual topic models for unaligned text.Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence(pp.75\u201382)."},{"key":"e_1_2_8_9_1","doi-asserted-by":"publisher","DOI":"10.1108\/02640471211221340"},{"issue":"8","key":"e_1_2_8_10_1","first-page":"1","article-title":"Effective use of latent semantic indexing and computational linguistics in biological and biomedical applications","volume":"4","author":"Chen H.","year":"2013","journal-title":"Front Physiol"},{"key":"e_1_2_8_11_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btn534"},{"key":"e_1_2_8_12_1","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9"},{"key":"e_1_2_8_13_1","doi-asserted-by":"publisher","DOI":"10.1108\/02640471211221313"},{"key":"e_1_2_8_14_1","doi-asserted-by":"crossref","unstructured":"Evans D. K. Klavans J. L. &McKeown K. R.(2004).Columbia newsblaster: multilingual news summarization on the Web.Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT\u2010NAACL) 2004(pp.1\u20134).","DOI":"10.3115\/1614025.1614026"},{"key":"e_1_2_8_15_1","doi-asserted-by":"publisher","DOI":"10.1197\/jamia.M2544"},{"key":"e_1_2_8_16_1","doi-asserted-by":"publisher","DOI":"10.1126\/science.1136800"},{"key":"e_1_2_8_17_1","doi-asserted-by":"crossref","unstructured":"Griffiths T. L. &Steyvers M.(2004).Finding scientific topics.Proceedings of the National Academy of Sciences(pp.5228\u20135235).","DOI":"10.1073\/pnas.0307752101"},{"key":"e_1_2_8_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2012.07.032"},{"key":"e_1_2_8_19_1","doi-asserted-by":"crossref","unstructured":"Hofmann T.(1999).Probabilistic latent semantic indexing.Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval(pp.50\u201357).","DOI":"10.1145\/312624.312649"},{"key":"e_1_2_8_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TFUZZ.2010.2065811"},{"key":"e_1_2_8_21_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2009.11.010"},{"key":"e_1_2_8_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2008.06.002"},{"key":"e_1_2_8_23_1","doi-asserted-by":"crossref","unstructured":"Kim H. Ren X. Sun Y. Wang C. &Han J.(2013).Semantic frame\u2010based document representation for comparable corpora.Proceedings of the Data Mining (ICDM) 2013 IEEE 13th International Conference on(pp.350\u2013359).","DOI":"10.1109\/ICDM.2013.99"},{"key":"e_1_2_8_24_1","unstructured":"La Fleur M. &Renstr\u00f6m F.(2015).Conceptual Indexing using Latent Semantic Indexing: A Case Study."},{"key":"e_1_2_8_25_1","unstructured":"Le Q. V. &Mikolov T.(2014).Distributed representations of sentences and documents.arXiv preprint arXiv:1405.4053."},{"volume-title":"Understanding Digital Libraries, Second Edition (The Morgan Kaufmann Series in Multimedia and Information Systems)","year":"2004","author":"Lesk M.","key":"e_1_2_8_26_1"},{"key":"e_1_2_8_27_1","unstructured":"Mikolov T. Sutskever I. Chen K. Corrado G. S. &Dean J.(2013).Distributed representations of words and phrases and their compositionality.Proceedings of the Advances in neural information processing systems(pp.3111\u20133119)."},{"key":"e_1_2_8_28_1","doi-asserted-by":"crossref","unstructured":"Mimno D. Wallach H. M. Naradowsky J. Smith D. A. &McCallum A.(2009).Polylingual topic models.Proceedings of the 2009 Empirical Methods in Natural Language(pp.880\u2013889).","DOI":"10.3115\/1699571.1699627"},{"key":"e_1_2_8_29_1","first-page":"81","article-title":"NESM: A named entity based proximity measure for multilingual news clustering","volume":"48","author":"Montalvo S.","year":"2012","journal-title":"Procesamiento del lenguaje natural"},{"key":"e_1_2_8_30_1","unstructured":"Niu L.\u2010Q. &Dai X.\u2010Y.(2015).Topic2Vec: Learning Distributed Representations of Topics.arXiv preprint arXiv:1506.08422."},{"key":"e_1_2_8_31_1","doi-asserted-by":"publisher","DOI":"10.1108\/02640471211221331"},{"key":"e_1_2_8_32_1","unstructured":"Rosenberg A. &Hirschberg J.(2007).V\u2010Measure: A Conditional Entropy\u2010Based External Cluster Evaluation Measure.Proceedings of the EMNLP\u2010CoNLL(pp.410\u2013420)."},{"key":"e_1_2_8_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/0377-0427(87)90125-7"},{"key":"e_1_2_8_34_1","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(88)90021-0"},{"key":"e_1_2_8_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/361219.361220"},{"key":"e_1_2_8_36_1","doi-asserted-by":"crossref","unstructured":"Shafiei M. Wang S. Zhang R. Milios E. Tang B. Tougas J. et al. (2007).Document representation and dimension reduction for text clustering.Proceedings of the Data Engineering Workshop 2007 IEEE 23rd International Conference on(pp.770\u2013779).","DOI":"10.1109\/ICDEW.2007.4401066"},{"key":"e_1_2_8_37_1","doi-asserted-by":"publisher","DOI":"10.13053\/cys-18-3-2043"},{"key":"e_1_2_8_38_1","unstructured":"Taddy M.(2015).Document Classification by Inversion of Distributed Language Representations.arXiv preprint arXiv:1504.07295."},{"key":"e_1_2_8_39_1","unstructured":"Tang G. Xia Y. Zhang M. Li H. &Zheng F.(2011).CLGVSM: Adapting Generalized Vector Space Model to Cross\u2010lingual Document Clustering.Proceedings of the IJCNLP(pp.580\u2013588)."},{"key":"e_1_2_8_40_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2014.08.003"},{"key":"e_1_2_8_41_1","unstructured":"Wang K. Zhang J. Li D. Zhang X. &Guo T.(2008).Adaptive affinity propagation clustering.arXiv preprint arXiv:0805.1096."},{"key":"e_1_2_8_42_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.dss.2007.07.008"},{"key":"e_1_2_8_43_1","doi-asserted-by":"publisher","DOI":"10.1108\/02640471211221322"},{"key":"e_1_2_8_44_1","doi-asserted-by":"crossref","unstructured":"Yang C. &Li K.(2004).Cross\u2010lingual information retrieval: The challenge in multilingual digital libraries.Design and Usability of Digital Libraries: Case Studies in the Asia Pacific Idea Group Inc.","DOI":"10.4018\/978-1-59140-441-5.ch009"},{"key":"e_1_2_8_45_1","unstructured":"Yang Y. &Pedersen J. O.(1997).A comparative study on feature selection in text categorization.Proceedings of the ICML(pp.412\u2013420)."},{"key":"e_1_2_8_46_1","unstructured":"Yetisgen\u2010Yildiz M. &Pratt W.(2005).The effect of feature representation on MEDLINE document classification.Proceedings of the AMIA."},{"key":"e_1_2_8_47_1","doi-asserted-by":"publisher","DOI":"10.1108\/02640471211221359"}],"container-title":["Proceedings of the Association for Information Science and Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fpra2.2016.14505301065","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fpra2.2016.14505301065","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/asistdl.onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/pra2.2016.14505301065","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T17:10:02Z","timestamp":1761066602000},"score":1,"resource":{"primary":{"URL":"https:\/\/asistdl.onlinelibrary.wiley.com\/doi\/10.1002\/pra2.2016.14505301065"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,1]]},"references-count":46,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2016,1]]}},"alternative-id":["10.1002\/pra2.2016.14505301065"],"URL":"https:\/\/doi.org\/10.1002\/pra2.2016.14505301065","archive":["Portico"],"relation":{},"ISSN":["2373-9231","2373-9231"],"issn-type":[{"type":"print","value":"2373-9231"},{"type":"electronic","value":"2373-9231"}],"subject":[],"published":{"date-parts":[[2016,1]]},"assertion":[{"value":"2016-12-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}