{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T03:52:09Z","timestamp":1760241129282,"version":"build-2065373602"},"reference-count":30,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2019,12,11]],"date-time":"2019-12-11T00:00:00Z","timestamp":1576022400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100010418","name":"Institute for Information and Communications Technology Promotion","doi-asserted-by":"publisher","award":["2017-0-00255"],"award-info":[{"award-number":["2017-0-00255"]}],"id":[{"id":"10.13039\/501100010418","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise\u2014weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean\u2192English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively.<\/jats:p>","DOI":"10.3390\/e21121213","type":"journal-article","created":{"date-parts":[[2019,12,12]],"date-time":"2019-12-12T03:20:16Z","timestamp":1576120816000},"page":"1213","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Improving Neural Machine Translation by Filtering Synthetic Parallel Data"],"prefix":"10.3390","volume":"21","author":[{"given":"Guanghao","family":"Xu","sequence":"first","affiliation":[{"name":"Department of Engineering, Computer Science, Sogang University, Seoul 04107, Korea"}]},{"given":"Youngjoong","family":"Ko","sequence":"additional","affiliation":[{"name":"Applied Data Science, Sungkyunkwan University, Suwon 16419, Korea"}]},{"given":"Jungyun","family":"Seo","sequence":"additional","affiliation":[{"name":"Department of Engineering, Computer Science, Sogang University, Seoul 04107, Korea"}]}],"member":"1968","published-online":{"date-parts":[[2019,12,11]]},"reference":[{"key":"ref_1","unstructured":"Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., and Macherey, K. (2016). Google\u2019s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv, Available online: https:\/\/arxiv.org\/abs\/1609.08144."},{"key":"ref_2","unstructured":"Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., and Li, M. (2018). Achieving Human Parity on Automatic Chinese to English News Translation. arXiv, Available online: https:\/\/arxiv.org\/abs\/1803.05567."},{"key":"ref_3","unstructured":"Koehn, P., and Knowles, R. (August, January 30). Six Challenges for Neural Machine Translation. Proceedings of the First Workshop on Neural Machine Translation, Vancouver, BC, Canada."},{"key":"ref_4","unstructured":"Lambert, P., Schwenk, H., Servan, C., and Abdul-Rauf, S. (2011, January 30\u201331). Investigations on Translation Model Adaptation Using Monolingual Data. Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1016\/j.csl.2017.01.014","article-title":"On integrating a language model into neural machine translation","volume":"45","author":"Gulcehre","year":"2017","journal-title":"Comput. Speech Lang."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Sennrich, R., Haddow, B., and Birch, A. (2016, January 7\u201312). Improving Neural Machine Translation Models with Monolingual Data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany.","DOI":"10.18653\/v1\/P16-1009"},{"key":"ref_7","unstructured":"Imankulova, A., Sato, T., and Komachi, M. (December, January 27). Improving Low-Resource Neural Machine Translation with Filtered Pseudo-Parallel Corpus. Proceedings of the 4th Workshop on Asian Translation (WAT2017), Taipei, Taiwan."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002, January 7\u201312). BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Koehn, P., Khayrallah, H., Heafield, K., and Forcada, M.L. (November, January 31). Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering. Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels.","DOI":"10.18653\/v1\/W18-6453"},{"key":"ref_10","unstructured":"Mikolov, T., Le, Q.V., and Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv, Available online: https:\/\/arxiv.org\/abs\/1309.4168."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Xing, C., Wang, D., Liu, C., and Lin, Y. (June, January 31). Normalized word embedding and orthogonal transform for bilingual word translation. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA.","DOI":"10.3115\/v1\/N15-1104"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Luong, T., Pham, H., and Manning, C.D. (2015, January 5). Bilingual word representations with monolingual quality in mind. Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, USA.","DOI":"10.3115\/v1\/W15-1521"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Artetxe, M., Labaka, G., and Agirre, E. (2018, January 2\u20137). Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11992"},{"key":"ref_14","unstructured":"Artetxe, M., Labaka, G., and Agirre, E. (August, January 30). Learning bilingual word embeddings with (almost) no bilingual data. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada."},{"key":"ref_15","unstructured":"Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and J\u00e9gou, H. (2017). Word translation without parallel data. arXiv, Available online: https:\/\/arxiv.org\/abs\/1710.04087."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Artetxe, M., Labaka, G., and Agirre, E. (2018). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv, Available online: https:\/\/arxiv.org\/abs\/1805.06297.","DOI":"10.18653\/v1\/P18-1073"},{"key":"ref_17","unstructured":"Taghipour, K., Khadivi, S., and Xu, J. (2011, January 19\u201323). Parallel corpus refinement as an outlier detection algorithm. Proceedings of the 13th Machine Translation Summit (MT Summit XIII), Xiamen, China."},{"key":"ref_18","unstructured":"Cui, L., Zhang, D., Liu, S., Li, M., and Zhou, M. (2013, January 4\u20139). Bilingual data cleaning for smt using graph-based random walk. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Junczys-Dowmunt, M. (2018). Dual conditional cross-entropy filtering of noisy parallel corpora. arXiv, Available online: https:\/\/arxiv.org\/abs\/1809.00197.","DOI":"10.18653\/v1\/W18-6478"},{"key":"ref_20","unstructured":"Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. (2014). Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems 27, Curran Associates, Inc."},{"key":"ref_21","unstructured":"Cettolo, M., Federico, M., Bentivogli, L., Niehues, J., St\u00fcker, S., Sudoh, K., Yoshino, K., and Federmann, C. (2018, January 14\u201315). Overview of the IWSLT 2017 Evaluation Campaign. Proceedings of the 14th International Workshop on Spoken Language Translation (IWSLT), Tokyo, Japan."},{"key":"ref_22","unstructured":"Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., and Zens, R. (2007, January 25\u201327). Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (ACL \u201907), Stroudsburg, PA, USA."},{"key":"ref_23","unstructured":"Park, E.L., and Cho, S. (2014, January 10\u201311). KoNLPy: Korean natural language processing in Python. Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology, Chuncheon, Korea."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Sennrich, R., Haddow, B., and Birch, A. (2016, January 7\u201312). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany.","DOI":"10.18653\/v1\/P16-1162"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A.M. (2017). OpenNMT: Open-Source Toolkit for Neural Machine Translation. arXiv, Available online: https:\/\/arxiv.org\/abs\/1701.02810.","DOI":"10.18653\/v1\/P17-4012"},{"key":"ref_26","unstructured":"Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (2017). Attention is All you Need. Advances in Neural Information Processing Systems 30, Curran Associates, Inc."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-Term Memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Luong, T., Pham, H., and Manning, C.D. (2015, January 17\u201321). Effective Approaches to Attention-based Neural Machine Translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal.","DOI":"10.18653\/v1\/D15-1166"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching Word Vectors with Subword Information","volume":"5","author":"Bojanowski","year":"2017","journal-title":"TACL"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Fadaee, M., and Monz, C. (November, January 31). Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.","DOI":"10.18653\/v1\/D18-1040"}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/21\/12\/1213\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:41:13Z","timestamp":1760190073000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/21\/12\/1213"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,12,11]]},"references-count":30,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2019,12]]}},"alternative-id":["e21121213"],"URL":"https:\/\/doi.org\/10.3390\/e21121213","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2019,12,11]]}}}