{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T03:54:21Z","timestamp":1760241261412,"version":"build-2065373602"},"reference-count":34,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2019,12,29]],"date-time":"2019-12-29T00:00:00Z","timestamp":1577577600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["U1703133"],"award-info":[{"award-number":["U1703133"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100013494","name":"West Light Foundation of Chinese Academy of Sciences","doi-asserted-by":"publisher","award":["2017-XBQNXZ-A-005"],"award-info":[{"award-number":["2017-XBQNXZ-A-005"]}],"id":[{"id":"10.13039\/501100013494","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004739","name":"Youth Innovation Promotion Association CAS","doi-asserted-by":"publisher","award":["2017472"],"award-info":[{"award-number":["2017472"]}],"id":[{"id":"10.13039\/501100004739","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Xinjiang Science and Technology Major Project under","award":["2016A03007-3"],"award-info":[{"award-number":["2016A03007-3"]}]},{"name":"High-level talent introduction project in Xinjiang Uyghur Autonomous Region","award":["Y839031201"],"award-info":[{"award-number":["Y839031201"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.<\/jats:p>","DOI":"10.3390\/info11010024","type":"journal-article","created":{"date-parts":[[2019,12,30]],"date-time":"2019-12-30T05:49:41Z","timestamp":1577684981000},"page":"24","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0849-1137","authenticated-orcid":false,"given":"Yang","family":"Yuan","sequence":"first","affiliation":[{"name":"Xinjiang Technical Institute of Physics &amp; Chemistry, Chinese Academy of Sciences, Urumqi 830011, China"},{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"},{"name":"Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiao","family":"Li","sequence":"additional","affiliation":[{"name":"Xinjiang Technical Institute of Physics &amp; Chemistry, Chinese Academy of Sciences, Urumqi 830011, China"},{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"},{"name":"Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ya-Ting","family":"Yang","sequence":"additional","affiliation":[{"name":"Xinjiang Technical Institute of Physics &amp; Chemistry, Chinese Academy of Sciences, Urumqi 830011, China"},{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"},{"name":"Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2019,12,29]]},"reference":[{"key":"ref_1","first-page":"2493","article-title":"Natural language processing (almost) from scratch","volume":"12","author":"Collobert","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_2","unstructured":"Sahlgren, M. (2006). The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations between Words in High-Dimensional Vector Spaces. [Ph.D. Thesis, Stockholm University]."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1613\/jair.2934","article-title":"From frequency to meaning: Vector space models of semantics","volume":"37","author":"Turney","year":"2010","journal-title":"J. Artif. Intell. Res."},{"key":"ref_4","unstructured":"Mnih, A., and Hinton, G.E. (2008, January 8\u201310). A scalable hierarchical distributed language model. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"391","DOI":"10.1002\/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9","article-title":"Indexing by latent semantic analysis","volume":"41","author":"Deerwester","year":"1990","journal-title":"J. Am. Soc. Inf. Sci."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"510","DOI":"10.3758\/BF03193020","article-title":"Extracting semantic representations from word-pair co-occurrence statistics: A computational study","volume":"39","author":"Bullinaria","year":"2007","journal-title":"Behav. Res. Methods"},{"key":"ref_7","unstructured":"RRitter, A., and Etzioni, O. (2010, January 11\u201316). A latent dirichlet allocation method for selectional preferences. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014, January 25\u201329). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_9","first-page":"1137","article-title":"A neural probabilistic language model","volume":"3","author":"Bengio","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_10","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013, January 5\u201310). Distributed representations of words and phrases and their compositionality. Proceedings of the Advances in Neural Information Processing Systems, Stateline, NV, USA."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Cotterell, R., and Sch\u00fctze, H. (June, January 31). Morphological word-embeddings. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA.","DOI":"10.3115\/v1\/N15-1140"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Levy, O., and Goldberg, Y. (2014, January 22\u201327). Dependency-based word embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA.","DOI":"10.3115\/v1\/P14-2050"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Xu, C., Bai, Y., Bian, J., Gao, B., Wang, G., Liu, X., and Liu, T.-Y. (2014, January 3\u20137). Rc-net: A general framework for incorporating knowledge into word representations. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, Shanghai, China.","DOI":"10.1145\/2661829.2662038"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Liu, Q., Jiang, H., Wei, S., Ling, Z.-H., and Hu, Y. (2015, January 26\u201331). Learning semantic word embeddings based on ordinal knowledge constraints. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China.","DOI":"10.3115\/v1\/P15-1145"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018, January 1\u20136). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA.","DOI":"10.18653\/v1\/N18-1202"},{"key":"ref_16","unstructured":"Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2019, December 28). Improving Language Understanding by Generative Pre-Training. Available online: https:\/\/www.cs.ubc.ca\/~amuham01\/LING530\/papers\/radford2018improving.pdf."},{"key":"ref_17","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January 2\u20137). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA."},{"key":"ref_18","unstructured":"Lample, G., and Conneau, A. (2019, December 28). Cross-lingual Language Model Pretraining. Available online: https:\/\/arxiv.org\/pdf\/1901.07291.pdf."},{"key":"ref_19","unstructured":"Jiang, C., Yu, H.F., Hsieh, C.J., and Chang, K.W. (2019, December 28). Learning Word Embeddings for Low-Resource Languages by PU Learning. Available online: https:\/\/arxiv.org\/pdf\/1805.03366.pdf."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"991","DOI":"10.3233\/SW-190349","article-title":"Wan2vec: Embeddings learned on word association norms","volume":"10","author":"Sierra","year":"2019","journal-title":"Semant. Web."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Artetxe, M., Labaka, G., and Agirre, E. (2019, December 28). A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings. Available online: https:\/\/arxiv.org\/pdf\/1805.06297.pdf.","DOI":"10.18653\/v1\/P18-1073"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Tilk, O., and Alum\u00e4e, T. (2016, January 8\u201312). Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration. Proceedings of the Interspeech, San Francisco, CA, USA.","DOI":"10.21437\/Interspeech.2016-1517"},{"key":"ref_23","unstructured":"Spitkovsky, V.I., Alshawi, H., and Jurafsky, D. (2011, January 23\u201324). Punctuation: Making a point in unsupervised dependency parsing. Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Association for Computational Linguistics, Portland, OR, USA."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"470","DOI":"10.17706\/jcp.12.5.470-478","article-title":"Negation Handling in Sentiment Analysis at Sentence Level","volume":"12","author":"Farooq","year":"2017","journal-title":"J. Comput."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Koto, F., and Adriani, M. (2015, January 17\u201319). A comparative study on twitter sentiment analysis: Which features are good?. Proceedings of the 20th International Conference on Applications of Natural Language to Information Systems, NLDB 2015, Passau, Germany.","DOI":"10.1007\/978-3-319-19581-0_46"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Gao, Q., and Vogel, S. (2008, January 19\u201320). Parallel implementations of word alignment tool. Proceedings of the Software Engineering, Testing, and Quality Assurance for Natural Language Processing, Columbus, OH, USA.","DOI":"10.3115\/1622110.1622119"},{"key":"ref_27","first-page":"307","article-title":"Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics","volume":"13","author":"Gutmann","year":"2012","journal-title":"J. Mach. Learn. Res."},{"key":"ref_28","unstructured":"Che, W., Li, Z., and Liu, T. (2010, January 23\u201327). Ltp: A Chinese language technology platform. Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, Beijing, China."},{"key":"ref_29","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2019, December 28). Efficient Estimation of Word Representations in Vector Space. Available online: https:\/\/arxiv.org\/pdf\/1301.3781.pdf."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"627","DOI":"10.1145\/365628.365657","article-title":"Contextual correlates of synonymy","volume":"8","author":"Rubenstein","year":"1965","journal-title":"Commun. Acm"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1080\/01690969108406936","article-title":"Contextual correlates of semantic similarity","volume":"6","author":"Miller","year":"1991","journal-title":"Lang. Cogn. Processes"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"116","DOI":"10.1145\/503104.503110","article-title":"Placing search in context: The concept revisited","volume":"20","author":"Finkelstein","year":"2002","journal-title":"Acm Trans. Inf. Syst."},{"key":"ref_33","unstructured":"Huang, E.H., Socher, R., Manning, C.D., and Ng, A.Y. (2012, January 8\u201314). Improving word representations via global context and multiple word prototypes. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea."},{"key":"ref_34","unstructured":"Luong, T., Socher, R., and Manning, C. (2013, January 8\u20139). Better word representations with recursive neural networks for morphology. Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/11\/1\/24\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:46:36Z","timestamp":1760190396000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/11\/1\/24"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,12,29]]},"references-count":34,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2020,1]]}},"alternative-id":["info11010024"],"URL":"https:\/\/doi.org\/10.3390\/info11010024","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2019,12,29]]}}}