{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,21]],"date-time":"2026-02-21T04:19:47Z","timestamp":1771647587486,"version":"3.50.1"},"reference-count":42,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,9,6]],"date-time":"2021-09-06T00:00:00Z","timestamp":1630886400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,9,6]],"date-time":"2021-09-06T00:00:00Z","timestamp":1630886400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Med Inform Decis Mak"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Biomedical language translation requires multi-lingual fluency as well as relevant domain knowledge. Such requirements make it challenging to train qualified translators and costly to generate high-quality translations. Machine translation represents an effective alternative, but accurate machine translation requires large amounts of in-domain data. While such datasets are abundant in general domains, they are less accessible in the biomedical domain. Chinese and English are two of the most widely spoken languages, yet to our knowledge, a parallel corpus does not exist for this language pair in the biomedical domain.<\/jats:p><\/jats:sec><jats:sec><jats:title>Description<\/jats:title><jats:p>We developed an effective pipeline to acquire and process an English-Chinese parallel corpus from the New England Journal of Medicine (NEJM). This corpus consists of about 100,000 sentence pairs and 3,000,000 tokens on each side. We showed that training on out-of-domain data and fine-tuning with as few as 4000 NEJM sentence pairs improve translation quality by 25.3 (13.4) BLEU for en<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\rightarrow$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mo>\u2192<\/mml:mo><\/mml:math><\/jats:alternatives><\/jats:inline-formula>zh (zh<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\rightarrow$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mo>\u2192<\/mml:mo><\/mml:math><\/jats:alternatives><\/jats:inline-formula>en) directions. Translation quality continues to improve at a slower pace on larger in-domain data subsets, with a total increase of 33.0 (24.3) BLEU for en<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\rightarrow$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mo>\u2192<\/mml:mo><\/mml:math><\/jats:alternatives><\/jats:inline-formula>zh (zh<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\rightarrow$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mo>\u2192<\/mml:mo><\/mml:math><\/jats:alternatives><\/jats:inline-formula>en) directions on the full dataset.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusions<\/jats:title><jats:p>The code and data are available at<jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/boxiangliu\/ParaMed\">https:\/\/github.com\/boxiangliu\/ParaMed<\/jats:ext-link>.<\/jats:p><\/jats:sec>","DOI":"10.1186\/s12911-021-01621-8","type":"journal-article","created":{"date-parts":[[2021,9,6]],"date-time":"2021-09-06T12:03:19Z","timestamp":1630929799000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["ParaMed: a parallel corpus for English\u2013Chinese translation in the biomedical domain"],"prefix":"10.1186","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2595-4463","authenticated-orcid":false,"given":"Boxiang","family":"Liu","sequence":"first","affiliation":[]},{"given":"Liang","family":"Huang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,9,6]]},"reference":[{"issue":"7124","key":"1621_CR1","doi-asserted-by":"publisher","first-page":"2","DOI":"10.1136\/bmj.316.7124.2a","volume":"316","author":"I Bamforth","year":"1998","unstructured":"Bamforth I. Biomedical translation. BMJ. 1998;316(7124):2\u20137124.","journal-title":"BMJ"},{"key":"1621_CR2","doi-asserted-by":"publisher","first-page":"2354","DOI":"10.1136\/bmj.b2354","volume":"338","author":"A Das","year":"2009","unstructured":"Das A. Medical interpreters. BMJ. 2009;338:2354.","journal-title":"BMJ"},{"key":"1621_CR3","unstructured":"Hassan H, Aue A, Chen C, Chowdhary V, Clark J, Federmann C, Huang X, Junczys-Dowmunt M, Lewis W, Li M, et al. Achieving human parity on automatic chinese to english news translation; 2018. arXiv preprint arXiv:1803.05567"},{"issue":"suppl\u20131","key":"1621_CR4","doi-asserted-by":"publisher","first-page":"267","DOI":"10.1093\/nar\/gkh061","volume":"32","author":"O Bodenreider","year":"2004","unstructured":"Bodenreider O. The unified medical language system (UMLs): integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl\u20131):267\u201370.","journal-title":"Nucleic Acids Res"},{"key":"1621_CR5","doi-asserted-by":"crossref","unstructured":"Sennrich R, Haddow B, Birch A. Improving neural machine translation models with monolingual data; 2015. arXiv preprint arXiv:1511.06709","DOI":"10.18653\/v1\/P16-1009"},{"key":"1621_CR6","unstructured":"Duh K, Neubig G, Sudoh K, Tsukada H. Adaptation data selection using neural language models: experiments in machine translation. In: Proceedings of the 51st annual meeting of the Association for Computational Linguistics (Volume 2: Short Papers); 2013. p. 678\u2013683."},{"key":"1621_CR7","unstructured":"Luong M-T, Manning CD. Stanford neural machine translation systems for spoken language domains. In: Proceedings of the international workshop on spoken language translation; 2015. p. 76\u201379."},{"key":"1621_CR8","doi-asserted-by":"crossref","unstructured":"Neves M. A parallel collection of clinical trials in Portuguese and English. In: Proceedings of the 10th workshop on building and using comparable corpora; 2017. p. 36\u201340","DOI":"10.18653\/v1\/W17-2507"},{"key":"1621_CR9","unstructured":"Bawden R, Di\u00a0Nunzio G, Grozea C, Unanue I, Yepes A, Mah N, Martinez D, N\u00e9v\u00e9ol A, Neves M, Oronoz M, et\u00a0al. Findings of the WMT 2020 biomedical translation shared task: Basque, Italian and Russian as new additional languages. In: 5th conference on machine translation; 2020."},{"key":"1621_CR10","unstructured":"Du\u0161ek O, Haji\u010d J, Hlav\u00e1\u010dov\u00e1 J, Libovick\u00fd J, Pecina P, Tamchyna A, Ure\u0161ov\u00e1 Z. Khresmoi Summary Translation Test Data 2.0. LINDAT\/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (\u00daFAL), Faculty of Mathematics and Physics, Charles University; 2017. http:\/\/hdl.handle.net\/11234\/1-2122"},{"key":"1621_CR11","unstructured":"Villegas M, Intxaurrondo A, Gonzalez-Agirre A, Marimon M, Krallinger M. The mespen resource for english-spanish medical machine translation and terminologies: census of parallel corpora, glossaries and term translations. In: Malero M, Krallinger M, Gonzalez-Agirre A, editors. LREC MultilingualBIO: Multilingual Biomedical Text Processing. 2018."},{"key":"1621_CR12","first-page":"2214","volume":"2012","author":"J Tiedemann","year":"2012","unstructured":"Tiedemann J. Parallel data, tools and interfaces in opus. LREC. 2012;2012:2214\u20138.","journal-title":"LREC"},{"key":"1621_CR13","unstructured":"Tian L, Wong DF, Chao LS, Quaresma P, Oliveira F, Yi L. Um-corpus: A large english-chinese parallel corpus for statistical machine translation. In: LREC; 2014. p. 1837\u20131842"},{"key":"1621_CR14","doi-asserted-by":"crossref","unstructured":"Barrault L, Bojar O, Costa-juss\u00e0 MR, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Malmasi S, et\u00a0al. Findings of the 2019 conference on machine translation (wmt19). In: Proceedings of the fourth conference on machine translation (Volume 2: Shared Task Papers, Day 1); 2019. p. 1\u201361.","DOI":"10.18653\/v1\/W19-5301"},{"key":"1621_CR15","doi-asserted-by":"crossref","unstructured":"Buck C, Koehn P. Findings of the wmt 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: Volume 2, Shared Task Papers; 2016. p. 554\u2013563.","DOI":"10.18653\/v1\/W16-2347"},{"key":"1621_CR16","doi-asserted-by":"crossref","unstructured":"Gomes L, Lopes GP. First steps towards coverage-based document alignment. In: Proceedings of the first conference on machine translation: volume 2, Shared Task Papers; 2016. p. 697\u2013702.","DOI":"10.18653\/v1\/W16-2369"},{"key":"1621_CR17","unstructured":"Read J, Dridan R, Oepen S, Solberg LJ. Sentence boundary detection: a long solved problem? In: Proceedings of COLING 2012: Posters; 2012. p. 985\u2013994."},{"key":"1621_CR18","unstructured":"Alias-i: Alias-i. http:\/\/alias-i.com\/lingpipe. Accessed:2019-12-10 (2008)"},{"key":"1621_CR19","volume-title":"Natural language processing with Python","author":"S Bird","year":"2009","unstructured":"Bird S, Loper E, Klein E. Natural language processing with Python. Newton: O\u2019Reilly Media Inc.; 2009."},{"issue":"1","key":"1621_CR20","first-page":"75","volume":"19","author":"WA Gale","year":"1993","unstructured":"Gale WA, Church KW. A program for aligning sentences in bilingual corpora. Comput Linguist. 1993;19(1):75\u2013102.","journal-title":"Comput Linguist"},{"key":"1621_CR21","doi-asserted-by":"crossref","unstructured":"Moore RC. Fast and accurate sentence alignment of bilingual corpora. In: Conference of the association for machine translation in the Americas, Springer; 2002 p. 135\u2013144.","DOI":"10.1007\/3-540-45820-4_14"},{"key":"1621_CR22","doi-asserted-by":"crossref","unstructured":"Varga D, Hal\u00e1csy P, Kornai A, Nagy V, N\u00e9meth L, Tr\u00f3n V. Parallel corpora for medium density languages. Amsterdam studies in the theory and history of linguistic science series 4. 2007;292:247.","DOI":"10.1075\/cilt.292.32var"},{"key":"1621_CR23","unstructured":"Ma X. Champollion: a robust parallel text sentence aligner. In: LREC; 2006. p. 489\u2013492."},{"key":"1621_CR24","unstructured":"Sennrich R, Volk M. Iterative, MT-based sentence alignment of parallel texts. In: Proceedings of the 18th Nordic conference of computational linguistics (NODALIDA 2011); 2011. p. 175\u2013182."},{"key":"1621_CR25","unstructured":"Sennrich R, Volk M. Mt-based sentence alignment for ocr-generated parallel texts. In: The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010); 2010."},{"key":"1621_CR26","doi-asserted-by":"crossref","unstructured":"Koehn P, Khayrallah H, Heafield K, Forcada ML. Findings of the WMT 2018 shared task on parallel corpus filtering. In: Proceedings of the third conference on machine translation: shared task papers; 2018. p. 726\u2013739.","DOI":"10.18653\/v1\/W18-6453"},{"key":"1621_CR27","doi-asserted-by":"crossref","unstructured":"Junczys-Dowmunt M. Dual conditional cross-entropy filtering of noisy parallel corpora; 2018. arXiv preprint arXiv:1809.00197","DOI":"10.18653\/v1\/W18-6478"},{"key":"1621_CR28","unstructured":"Muthukadan B. Selenium with Python. https:\/\/selenium-python.readthedocs.io\/. Accessed 10 Dec 2019"},{"key":"1621_CR29","doi-asserted-by":"crossref","unstructured":"Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et\u00a0al. Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics Companion Volume proceedings of the demo and poster sessions; 2007. p. 177\u2013180.","DOI":"10.3115\/1557769.1557821"},{"key":"1621_CR30","unstructured":"Junczys-Dowmunt M. eserix. https:\/\/github.com\/emjotde\/eserix. Accessed 10 Dec 2019"},{"key":"1621_CR31","unstructured":"Ziemski M, Junczys-Dowmunt M, Pouliquen B. The united nations parallel corpus v1. 0. In: Proceedings of the tenth international conference on language resources and evaluation (LREC\u201916); 2016. p. 3530\u20133534."},{"key":"1621_CR32","unstructured":"Ram\u00edrez-S\u00e1nchez G, Zaragoza-Bernabeu J, Ba\u00f1\u00f3n M, Ortiz-Rojas S. Bifixer and bicleaner: two open-source tools to clean your parallel data. In: Proceedings of the 22nd annual conference of the European Association for Machine Translation; 2020. p. 291\u2013298. European Association for Machine Translation, Lisboa, Portugal."},{"key":"1621_CR33","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser, \u0141, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems; 2017. p. 5998\u20136008."},{"key":"1621_CR34","doi-asserted-by":"crossref","unstructured":"Klein G, Kim Y, Deng Y, Senellart J, Rush A. OpenNMT: Open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, system demonstrations; 2017. p. 67\u201372. Association for Computational Linguistics, Vancouver, Canada. https:\/\/www.aclweb.org\/anthology\/P17-4012","DOI":"10.18653\/v1\/P17-4012"},{"issue":"8","key":"1621_CR35","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735\u201380.","journal-title":"Neural Comput"},{"key":"1621_CR36","doi-asserted-by":"crossref","unstructured":"Bojar O, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Monz C. Findings of the 2018 conference on machine translation (wmt18). In: Proceedings of the third conference on machine translation, Volume 2: Shared Task Papers; 2018. p. 272\u2013307. Association for Computational Linguistics, Belgium, Brussels. http:\/\/www.aclweb.org\/anthology\/W18-6401","DOI":"10.18653\/v1\/W18-6401"},{"key":"1621_CR37","doi-asserted-by":"crossref","unstructured":"Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units; 2015. arXiv preprint arXiv:1508.07909","DOI":"10.18653\/v1\/P16-1162"},{"key":"1621_CR38","unstructured":"Kingma DP, Ba J. Adam: a method for stochastic optimization; 2014. arXiv preprint arXiv:1412.6980."},{"key":"1621_CR39","doi-asserted-by":"crossref","unstructured":"S\u00e1nchez-Cartagena VM, Ba\u00f1\u00f3n M, Rojas SO, Ram\u00edrez-S\u00e1nchez G. Prompsit\u2019s submission to WMT 2018 parallel corpus filtering shared task. In: Proceedings of the third conference on machine translation: shared task papers; 2018. p. 955\u2013962.","DOI":"10.18653\/v1\/W18-6488"},{"key":"1621_CR40","doi-asserted-by":"crossref","unstructured":"Graham Y, Haddow B, Koehn P. Translationese in machine translation evaluation; 2019. arXiv preprint arXiv:1906.09833","DOI":"10.18653\/v1\/2020.emnlp-main.6"},{"key":"1621_CR41","doi-asserted-by":"crossref","unstructured":"Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Yepes AJ, Koehn P, Logacheva V, Monz C, et\u00a0al. Findings of the 2016 conference on machine translation. In: Proceedings of the first conference on machine translation: Volume 2, Shared Task Papers; 2016. p. 131\u2013198.","DOI":"10.18653\/v1\/W16-2301"},{"key":"1621_CR42","doi-asserted-by":"crossref","unstructured":"Bojar O, Chatterjee R, Christian F, Yvette G, Barry H, Matthias H, Philipp K, Qun L, Varvara L, Christof M, et\u00a0al. Findings of the 2017 conference on machine translation (wmt17). In: Second Conference onMachine Translation; 2017. p. 169\u2013214. The Association for Computational Linguistics","DOI":"10.18653\/v1\/W17-4717"}],"container-title":["BMC Medical Informatics and Decision Making"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12911-021-01621-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12911-021-01621-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12911-021-01621-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,7]],"date-time":"2024-09-07T21:02:47Z","timestamp":1725742967000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcmedinformdecismak.biomedcentral.com\/articles\/10.1186\/s12911-021-01621-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,6]]},"references-count":42,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["1621"],"URL":"https:\/\/doi.org\/10.1186\/s12911-021-01621-8","relation":{},"ISSN":["1472-6947"],"issn-type":[{"value":"1472-6947","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,9,6]]},"assertion":[{"value":"22 October 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 August 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 September 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"258"}}