{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T17:47:44Z","timestamp":1754156864927,"version":"3.41.2"},"reference-count":43,"publisher":"Emerald","issue":"2","license":[{"start":{"date-parts":[[2021,10,22]],"date-time":"2021-10-22T00:00:00Z","timestamp":1634860800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["DTA"],"published-print":{"date-parts":[[2022,3,15]]},"abstract":"<jats:sec><jats:title content-type=\"abstract-subheading\">Purpose<\/jats:title><jats:p>In computational chemistry, the chemical bond energy (pKa) is essential, but most pKa-related data are submerged in scientific papers, with only a few data that have been extracted by domain experts manually. The loss of scientific data does not contribute to in-depth and innovative scientific data analysis. To address this problem, this study aims to utilize natural language processing methods to extract pKa-related scientific data in chemical papers.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Design\/methodology\/approach<\/jats:title><jats:p>Based on the previous Bert-CRF model combined with dictionaries and rules to resolve the problem of a large number of unknown words of professional vocabulary, in this paper, the authors proposed an end-to-end Bert-CRF model with inputting constructed domain wordpiece tokens using text mining methods. The authors use standard high-frequency string extraction techniques to construct domain wordpiece tokens for specific domains. And in the subsequent deep learning work, domain features are added to the input.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Findings<\/jats:title><jats:p>The experiments show that the end-to-end Bert-CRF model could have a relatively good result and can be easily transferred to other domains because it reduces the requirements for experts by using automatic high-frequency wordpiece tokens extraction techniques to construct the domain wordpiece tokenization rules and then input domain features to the Bert model.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Originality\/value<\/jats:title><jats:p>By decomposing lots of unknown words with domain feature-based wordpiece tokens, the authors manage to resolve the problem of a large amount of professional vocabulary and achieve a relatively ideal extraction result compared to the baseline model. The end-to-end model explores low-cost migration for entity and relation extraction in professional fields, reducing the requirements for experts.<\/jats:p><\/jats:sec>","DOI":"10.1108\/dta-11-2020-0284","type":"journal-article","created":{"date-parts":[[2021,10,20]],"date-time":"2021-10-20T20:44:34Z","timestamp":1634762674000},"page":"205-222","source":"Crossref","is-referenced-by-count":0,"title":["Using pretraining and text mining methods to automatically extract the chemical scientific data"],"prefix":"10.1108","volume":"56","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5192-4201","authenticated-orcid":false,"given":"Na","family":"Pang","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0931-2882","authenticated-orcid":false,"given":"Li","family":"Qian","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9889-6092","authenticated-orcid":false,"given":"Weimin","family":"Lyu","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7351-2152","authenticated-orcid":false,"given":"Jin-Dong","family":"Yang","sequence":"additional","affiliation":[]}],"member":"140","published-online":{"date-parts":[[2021,10,22]]},"reference":[{"year":"1998","key":"key2022031408455812500_ref001","article-title":"SRA: description of the IE2 system used for MUC-7"},{"key":"key2022031408455812500_ref002","first-page":"1137","article-title":"A neural probabilistic language model","volume":"3","year":"2003","journal-title":"Journal of Machine Learning Research"},{"year":"1998","key":"key2022031408455812500_ref003","article-title":"Facile: description of the NE system used for MUC-7"},{"first-page":"54","article-title":"Making sense of microposts: (# microposts2014) named entity extraction and linking challenge[C]\/\/Ceur workshop","year":"2014","key":"key2022031408455812500_ref004"},{"year":"1998","key":"key2022031408455812500_ref005","article-title":"Description of the NTU system used for MET2"},{"issue":"1","key":"key2022031408455812500_ref006","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1177\/0165551516673485","article-title":"Extraction of protein\u2013protein interactions (PPIs) from the literature by deep convolutional neural networks with various feature embeddings","volume":"44","year":"2018","journal-title":"Journal of Information Science"},{"year":"2020","key":"key2022031408455812500_ref007","article-title":"Electra: pre-training text encoders as discriminators rather than generators"},{"year":"2019","key":"key2022031408455812500_ref008","article-title":"Pre-training with whole word masking for Chinese bert"},{"year":"2018","key":"key2022031408455812500_ref009","article-title":"Bert: pre-training of deep bidirectional transformers for language understanding"},{"key":"key2022031408455812500_ref010","first-page":"3","article-title":"Using deep neural networks for extracting sentiment targets in Arabic Tweet","volume-title":"Intelligent Natural Language Processing: Trends and Applications","year":"2018"},{"issue":"7","key":"key2022031408455812500_ref011","first-page":"315","article-title":"Status of text-mining techniques applied to biomedical text","volume":"11","year":"2006","journal-title":"Drug Discovery Today"},{"year":"1998","key":"key2022031408455812500_ref012","article-title":"Oki electric industry: description of the oki system as used for MET-2"},{"first-page":"4805","article-title":"Spottune: transfer learning through adaptive fine-tuning","year":"2019","key":"key2022031408455812500_ref013"},{"issue":"22","key":"key2022031408455812500_ref014","doi-asserted-by":"crossref","first-page":"2983","DOI":"10.1093\/bioinformatics\/btp535","article-title":"A dictionary to identify small molecules and drugs in free text","volume":"25","year":"2009","journal-title":"Bioinformatics"},{"key":"key2022031408455812500_ref015","unstructured":"iBond (2014), iBonD 2.0 Version was Enriched!, available at: http:\/\/ibond.nankai.edu.cn\/ (accessed 30 January 2021)."},{"issue":"1","key":"key2022031408455812500_ref016","first-page":"1","article-title":"OSCAR4: a flexible architecture for chemical text-mining","volume":"3","year":"2011","journal-title":"Journal of Cheminformatics"},{"first-page":"22","article-title":"Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations","year":"2004","key":"key2022031408455812500_ref017"},{"issue":"4","key":"key2022031408455812500_ref018","doi-asserted-by":"crossref","first-page":"544","DOI":"10.1021\/ci980324v","article-title":"Extraction of information from the text of chemical patents. 1. identification of specific chemical names","volume":"38","year":"1998","journal-title":"Journal of Chemical Information and Computer Sciences"},{"issue":"S1","key":"key2022031408455812500_ref019","doi-asserted-by":"crossref","first-page":"S12","DOI":"10.1186\/1758-2946-7-S1-S12","article-title":"Chemical entity extraction using CRF and an ensemble of extractors","volume":"7","year":"2015","journal-title":"Journal of Cheminformatics"},{"year":"2008","key":"key2022031408455812500_ref020","article-title":"Chemical names: terminological resources and corpora annotation"},{"issue":"S1","key":"key2022031408455812500_ref021","doi-asserted-by":"crossref","first-page":"S1","DOI":"10.1186\/1758-2946-7-S1-S1","article-title":"CHEMDNER: the drugs and chemical names extraction challenge","volume":"7","year":"2015","journal-title":"Journal of Cheminformatics"},{"year":"2001","key":"key2022031408455812500_ref022","article-title":"Conditional random fields: probabilistic models for segmenting and labeling sequence data"},{"year":"2019","key":"key2022031408455812500_ref023","article-title":"Albert: a lite bert for self-supervised learning of language representations"},{"year":"2019","key":"key2022031408455812500_ref024","article-title":"An analysis of pre-training on object detection"},{"year":"2019","key":"key2022031408455812500_ref025","article-title":"Roberta: a robustly optimized bert pretraining approach"},{"year":"2019","key":"key2022031408455812500_ref026","article-title":"Evolution of transfer learning in natural language processing"},{"year":"2013","key":"key2022031408455812500_ref027","article-title":"Efficient estimation of word representations in vector space"},{"first-page":"28","volume-title":"Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with Joint BERT-CRF Model","year":"2019","key":"key2022031408455812500_ref028"},{"first-page":"1532","article-title":"Glove: global vectors for word representation","year":"2014","key":"key2022031408455812500_ref029"},{"year":"2018","key":"key2022031408455812500_ref030","article-title":"Deep contextualized word representations"},{"issue":"3","key":"key2022031408455812500_ref031","doi-asserted-by":"crossref","first-page":"392","DOI":"10.1007\/s12204-018-1954-5","article-title":"Research of clinical named entity recognition based on bi-lstm-crf","volume":"23","year":"2018","journal-title":"Journal of Shanghai Jiaotong University (Science)"},{"key":"key2022031408455812500_ref032","unstructured":"Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. (2018), \u201cImproving language understanding by generative pre-training\u201d, available at: https:\/\/s3-us-west-2.amazonaws.com\/openai-assets\/researchcovers\/languageunsupervised\/languageunderstandingpaper.pdf."},{"issue":"8","key":"key2022031408455812500_ref033","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","year":"2019","journal-title":"OpenAI Blog"},{"issue":"12","key":"key2022031408455812500_ref034","doi-asserted-by":"crossref","first-page":"1633","DOI":"10.1093\/bioinformatics\/bts183","article-title":"ChemSpot: a hybrid system for chemical named entity recognition","volume":"28","year":"2012","journal-title":"Bioinformatics"},{"issue":"11","key":"key2022031408455812500_ref035","first-page":"67","article-title":"A hybrid approach to Arabic named entity recognition","volume":"40","year":"2014","journal-title":"Journal of Information Science"},{"year":"2019","key":"key2022031408455812500_ref036","article-title":"Ernie: enhanced representation through knowledge integration"},{"year":"2019","key":"key2022031408455812500_ref037","article-title":"Ernie 2.0: a continual pre-training framework for language understanding"},{"volume-title":"The Fourth Paradigm: Data-Intensive Scientific Discovery","year":"2009","key":"key2022031408455812500_ref038"},{"issue":"8","key":"key2022031408455812500_ref039","first-page":"18","article-title":"A summary of technical methods for entity and relation extraction","volume":"24","year":"2008","journal-title":"Modern Library and Information Technology"},{"first-page":"1785","article-title":"Classifying relations via long short term memory networks along shortest dependency paths","year":"2015","key":"key2022031408455812500_ref040"},{"issue":"11","key":"key2022031408455812500_ref041","first-page":"1285","article-title":"Organic Bond Energy Database (iBonD) is freely open to the academic community","volume":"52","year":"2016","journal-title":"Physical Testing and Chemical Analysis Part B: Chemical Analysis"},{"key":"key2022031408455812500_ref042","first-page":"5754","article-title":"Xlnet: generalized autoregressive pretraining for language understanding","year":"2019","journal-title":"Advances in Neural Information Processing Systems"},{"year":"2019","key":"key2022031408455812500_ref043","article-title":"ERNIE: enhanced language representation with informative entities"}],"container-title":["Data Technologies and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/DTA-11-2020-0284\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/DTA-11-2020-0284\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T23:15:28Z","timestamp":1753398928000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/dta\/article\/56\/2\/205-222\/101798"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,22]]},"references-count":43,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2021,10,22]]},"published-print":{"date-parts":[[2022,3,15]]}},"alternative-id":["10.1108\/DTA-11-2020-0284"],"URL":"https:\/\/doi.org\/10.1108\/dta-11-2020-0284","relation":{},"ISSN":["2514-9288"],"issn-type":[{"type":"print","value":"2514-9288"}],"subject":[],"published":{"date-parts":[[2021,10,22]]}}}