{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:30:09Z","timestamp":1750221009825,"version":"3.41.0"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2019,10,9]],"date-time":"2019-10-09T00:00:00Z","timestamp":1570579200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2020,1,31]]},"abstract":"<jats:p>Two different methods of corpus cleaning are presented in this article. One is a machine-assisted technique, which is good to clean small-sized parallel corpus, and the other is an automatic method, which is suitable for cleaning large-sized parallel corpus. A baseline SMT (MOSES) system is used to evaluate these methods. The machine-assisted technique used two features: word alignment and length of the source and target language sentence. These features are used to detect mistranslations in the corpus, which are then handled by a human translator. Experiments of this method are conducted on the English-to-Indian Language Machine Translation (EILMT) corpus (English-Hindi). The Bilingual Evaluation Understudy (BLEU) score is improved by 0.47% for the clean corpus. Automatic method of corpus cleaning uses a combination of two features. One feature is length of source and target language sentence and the second feature is Viterbi alignment score generated by Hidden Markov Model for each sentence pair. Two different threshold values are used for these two features. These values are decided by using a small-sized manually annotated parallel corpus of 206 sentence pairs. Experiments of this method are conducted on the HindEnCorp corpus, released in the workshop of the Association of Computational Linguistics (ACL 2014). The BLEU score is improved by 0.6% on clean corpus. A comparison of the two methods is also presented on EILMT corpus.<\/jats:p>","DOI":"10.1145\/3342351","type":"journal-article","created":{"date-parts":[[2019,10,10]],"date-time":"2019-10-10T13:13:05Z","timestamp":1570713185000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["An Automatic and a Machine-assisted Method to Clean Bilingual Corpus"],"prefix":"10.1145","volume":"19","author":[{"given":"Jyoti","family":"Srivastava","sequence":"first","affiliation":[{"name":"Madanapalle Institute of Technology 8 Science, Madanapalle, Andhra Pradesh, India"}]},{"given":"Sudip","family":"Sanyal","sequence":"additional","affiliation":[{"name":"BML Munjal University, Gurugram, Haryana, India"}]},{"given":"Ashish Kumar","family":"Srivastava","sequence":"additional","affiliation":[{"name":"Madanapalle Institute of Technology 8 Science, Madanapalle, Andhra Pradesh, India"}]}],"member":"320","published-online":{"date-parts":[[2019,10,9]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.5555\/234285.234287"},{"volume-title":"Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.","author":"Diab M.","key":"e_1_2_1_2_1","unstructured":"M. Diab and P. Resnik . 2002. An unsupervised method for word sense tagging using parallel corpora . In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. M. Diab and P. Resnik. 2002. An unsupervised method for word sense tagging using parallel corpora. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-009-9097-9"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/92858.92860"},{"volume-title":"Proceedings of the 5th International Symposium on Telecommunications (IST\u201910)","author":"Taghipour K.","key":"e_1_2_1_5_1","unstructured":"K. Taghipour , N. Afhami , S. Khadivi , and S. Shiry . 2010. A discriminative approach to filter out noisy sentence pairs from bilingual corpora . In Proceedings of the 5th International Symposium on Telecommunications (IST\u201910) . K. Taghipour, N. Afhami, S. Khadivi, and S. Shiry. 2010. A discriminative approach to filter out noisy sentence pairs from bilingual corpora. In Proceedings of the 5th International Symposium on Telecommunications (IST\u201910)."},{"key":"e_1_2_1_6_1","volume-title":"Europarl: A Parallel Corpus for Statistical Machine Translation, in MT Summit.","author":"Koehn P.","year":"2005","unstructured":"P. Koehn . 2005 . Europarl: A Parallel Corpus for Statistical Machine Translation, in MT Summit. P. Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation, in MT Summit."},{"volume-title":"Proceedings of the International Conference on Asian Language Processing (IALP\u201910)","author":"Liu X.","key":"e_1_2_1_7_1","unstructured":"X. Liu and M. Zhou . 2010. Evaluating the quality of web-mined bilingual sentences using multiple linguistic features . In Proceedings of the International Conference on Asian Language Processing (IALP\u201910) . X. Liu and M. Zhou. 2010. Evaluating the quality of web-mined bilingual sentences using multiple linguistic features. In Proceedings of the International Conference on Asian Language Processing (IALP\u201910)."},{"volume-title":"Proceedings of the International Conference on Application of Natural Language to Information Systems.","author":"Khadivi S.","key":"e_1_2_1_8_1","unstructured":"S. Khadivi and H. Ney . 2005. Automatic filtering of bilingual corpora for statistical machine translation . In Proceedings of the International Conference on Application of Natural Language to Information Systems. S. Khadivi and H. Ney. 2005. Automatic filtering of bilingual corpora for statistical machine translation. In Proceedings of the International Conference on Application of Natural Language to Information Systems."},{"volume-title":"Proceedings of the Workshop on Machine Translation (WMT\u201913)","author":"Stymne S.","key":"e_1_2_1_9_1","unstructured":"S. Stymne , C. Hardmeier , J. Tiedemann , and J. Nivre . 2013. Tunable distortion limits and corpus cleaning for SMT . In Proceedings of the Workshop on Machine Translation (WMT\u201913) . S. Stymne, C. Hardmeier, J. Tiedemann, and J. Nivre. 2013. Tunable distortion limits and corpus cleaning for SMT. In Proceedings of the Workshop on Machine Translation (WMT\u201913)."},{"volume-title":"Proceedings of the International Conference on Computer Linguistics (COLING\u201912)","author":"Formiga Fanals L.","key":"e_1_2_1_10_1","unstructured":"L. Formiga Fanals and J. A. Rodr\u0131\u0301guez Fonollosa . 2012. Dealing with input noise in statistical machine translation . In Proceedings of the International Conference on Computer Linguistics (COLING\u201912) . L. Formiga Fanals and J. A. Rodr\u0131\u0301guez Fonollosa. 2012. Dealing with input noise in statistical machine translation. In Proceedings of the International Conference on Computer Linguistics (COLING\u201912)."},{"volume-title":"Proceedings of the Annual Meeting of the Association of Computational Linguistics (ACL\u201913)","author":"Cui L.","key":"e_1_2_1_11_1","unstructured":"L. Cui , D. Zhang , S. Liu , M. Li , and M. Zhou . 2013. Bilingual data cleaning for SMT using graph-based random walk . In Proceedings of the Annual Meeting of the Association of Computational Linguistics (ACL\u201913) . 2. L. Cui, D. Zhang, S. Liu, M. Li, and M. Zhou. 2013. Bilingual data cleaning for SMT using graph-based random walk. In Proceedings of the Annual Meeting of the Association of Computational Linguistics (ACL\u201913). 2."},{"volume-title":"Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.","author":"Callison-Burch C.","key":"e_1_2_1_12_1","unstructured":"C. Callison-Burch , D. Talbot , and M. Osborne . 2004. Statistical machine translation with word-and sentence-aligned parallel corpora . In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. C. Callison-Burch, D. Talbot, and M. Osborne. 2004. Statistical machine translation with word-and sentence-aligned parallel corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1162\/089120103322711578"},{"volume-title":"Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201909)","author":"Sarikaya R.","key":"e_1_2_1_14_1","unstructured":"R. Sarikaya , S. Maskey , R. Zhang , E.-E. Jan , D. Wang , B. Ramabhadran , and S. Roukos . 2009. Iterative sentence-pair extraction from quasi-parallel corpora for machine translation . In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201909) . R. Sarikaya, S. Maskey, R. Zhang, E.-E. Jan, D. Wang, B. Ramabhadran, and S. Roukos. 2009. Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH\u201909)."},{"volume-title":"Proceedings of the IEEE\/WIC\/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (WI-IAT\u201909)","author":"Turchi M.","key":"e_1_2_1_15_1","unstructured":"M. Turchi , T. De Bie , and N. Cristianini . 2009. An intelligent agent that autonomously learns how to translate . In Proceedings of the IEEE\/WIC\/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (WI-IAT\u201909) . M. Turchi, T. De Bie, and N. Cristianini. 2009. An intelligent agent that autonomously learns how to translate. In Proceedings of the IEEE\/WIC\/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (WI-IAT\u201909)."},{"volume-title":"Proceedings of the 13th Machine Translation Summit (MT\u201911)","author":"Taghipour K.","key":"e_1_2_1_16_1","unstructured":"K. Taghipour , S. Khadivi , and J. Xu . 2011. Parallel corpus refinement as an outlier detection algorithm . Proceedings of the 13th Machine Translation Summit (MT\u201911) , 414--421. K. Taghipour, S. Khadivi, and J. Xu. 2011. Parallel corpus refinement as an outlier detection algorithm. Proceedings of the 13th Machine Translation Summit (MT\u201911), 414--421."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1162\/089120105775299168"},{"key":"e_1_2_1_18_1","doi-asserted-by":"crossref","unstructured":"L. Cui D. Zhang S. Liu M. Li and M. Zhou. 2013. Collective corpus weighting and phrase scoring for SMT using graph-based random walk. In Natural Language Processing and Chinese Computing Springer 176--187.  L. Cui D. Zhang S. Liu M. Li and M. Zhou. 2013. Collective corpus weighting and phrase scoring for SMT using graph-based random walk. In Natural Language Processing and Chinese Computing Springer 176--187.","DOI":"10.1007\/978-3-642-41644-6_17"},{"volume-title":"Proceedings of the 6th International Conference on Web Services and Semantic Technology (WeST\u201914)","author":"Y\u0131ld\u0131z E.","key":"e_1_2_1_19_1","unstructured":"E. Y\u0131ld\u0131z , A. C. Tantu\u011f , and B. Diri . 2014. The effect of parallel corpus quality vs. size in English-to-Turkish SMT . In Proceedings of the 6th International Conference on Web Services and Semantic Technology (WeST\u201914) . E. Y\u0131ld\u0131z, A. C. Tantu\u011f, and B. Diri. 2014. The effect of parallel corpus quality vs. size in English-to-Turkish SMT. In Proceedings of the 6th International Conference on Web Services and Semantic Technology (WeST\u201914)."},{"key":"e_1_2_1_20_1","unstructured":"P. F. Brown V. J. D. Pietra S. A. D. Pietra and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19 263--311.  P. F. Brown V. J. D. Pietra S. A. D. Pietra and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19 263--311."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1162\/089120103321337421"},{"volume-title":"Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics.","author":"Liang P.","key":"e_1_2_1_22_1","unstructured":"P. Liang , B. Taskar , and D. Klein . 2006. Alignment by agreement . In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. P. Liang, B. Taskar, and D. Klein. 2006. Alignment by agreement. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics."},{"volume-title":"Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP\u201908)","author":"Ramanathan A.","key":"e_1_2_1_23_1","unstructured":"A. Ramanathan , J. Hegde , R. M. Shah , P. Bhattacharyya , and M. Sasikumar . 2008. Simple syntactic and morphological processing can help English-Hindi statistical machine translation . In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP\u201908) . A. Ramanathan, J. Hegde, R. M. Shah, P. Bhattacharyya, and M. Sasikumar. 2008. Simple syntactic and morphological processing can help English-Hindi statistical machine translation. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP\u201908)."},{"volume-title":"Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST\u201914)","author":"Singla K.","key":"e_1_2_1_24_1","unstructured":"K. Singla , K. Sachdeva , S. Bangalore , D. M. Sharma , and D. Yadav . 2014. Reducing the impact of data sparsity in statistical machine translation . In Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST\u201914) . K. Singla, K. Sachdeva, S. Bangalore, D. M. Sharma, and D. Yadav. 2014. Reducing the impact of data sparsity in statistical machine translation. In Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST\u201914)."},{"volume-title":"Proceedings of the Language Resources and Evaluation Conference (LREC\u201914)","author":"Sachdeva K.","key":"e_1_2_1_25_1","unstructured":"K. Sachdeva , R. Srivastava , S. Jain , and D. M. Sharma , 2014. Hindi-to-English machine translation: Using effective selection in multi-model SMT . In Proceedings of the Language Resources and Evaluation Conference (LREC\u201914) . K. Sachdeva, R. Srivastava, S. Jain, and D. M. Sharma, 2014. Hindi-to-English machine translation: Using effective selection in multi-model SMT. In Proceedings of the Language Resources and Evaluation Conference (LREC\u201914)."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-014-9282-3"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.5555\/972450.972455"},{"volume-title":"Proceedings of the 29th Annual Meeting on Association for Computational Linguistics.","author":"Brown P. F.","key":"e_1_2_1_28_1","unstructured":"P. F. Brown , J. C. Lai , and R. L. Mercer . 1991. Aligning sentences in parallel corpora . In Proceedings of the 29th Annual Meeting on Association for Computational Linguistics. P. F. Brown, J. C. Lai, and R. L. Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting on Association for Computational Linguistics."},{"volume-title":"Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions.","author":"Koehn P.","key":"e_1_2_1_29_1","unstructured":"P. Koehn , H. Hoang , A. Birch , C. Callison-Burch , M. Federico , N. Bertoldi , B. Cowan , W. Shen , C. Moran , R. Zens , C. Dyer , O. Bojar , A. Constantin , and E. Herbst . 2007. Moses: Open source toolkit for statistical machine translation . In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions."},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.","author":"Papineni K.","year":"2002","unstructured":"K. Papineni , S. Roukos , T. Ward , and W.-J. Zhu . 2002 . BLEU: A method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.3115\/1289189.1289273"},{"volume-title":"Proceedings of the Language Resources and Evaluation Conference (LREC\u201914)","author":"Bojar O.","key":"e_1_2_1_32_1","unstructured":"O. Bojar , V. Diatka , P. Rychl\u1ef3 , P. Stran\u00e1k , V. Suchomel , A. Tamchyna , and D. Zeman . 2014. Hindencorp-Hindi-English and Hindi-only corpus for machine translation . In Proceedings of the Language Resources and Evaluation Conference (LREC\u201914) . O. Bojar, V. Diatka, P. Rychl\u1ef3, P. Stran\u00e1k, V. Suchomel, A. Tamchyna, and D. Zeman. 2014. Hindencorp-Hindi-English and Hindi-only corpus for machine translation. In Proceedings of the Language Resources and Evaluation Conference (LREC\u201914)."}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3342351","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3342351","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:26:02Z","timestamp":1750206362000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3342351"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,9]]},"references-count":32,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,1,31]]}},"alternative-id":["10.1145\/3342351"],"URL":"https:\/\/doi.org\/10.1145\/3342351","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2019,10,9]]},"assertion":[{"value":"2017-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-10-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}