{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T18:15:31Z","timestamp":1754158531779,"version":"3.41.2"},"reference-count":43,"publisher":"Emerald","issue":"4","license":[{"start":{"date-parts":[[2022,7,27]],"date-time":"2022-07-27T00:00:00Z","timestamp":1658880000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["EL"],"published-print":{"date-parts":[[2022,8,8]]},"abstract":"<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Purpose<\/jats:title>\n<jats:p>The purpose of this paper is to propose an extensible framework for extracting data set usage from research articles.<\/jats:p>\n<\/jats:sec>\n<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Design\/methodology\/approach<\/jats:title>\n<jats:p>The framework uses a training set of manually labeled examples to identify word features surrounding data set usage references. Using the word features and general entity identifiers, candidate data sets are extracted and scored separately at the sentence and document levels. Finally, the extracted data set references can be verified by the authors using a web-based verification module.<\/jats:p>\n<\/jats:sec>\n<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Findings<\/jats:title>\n<jats:p>This paper successfully addresses a significant gap in entity extraction literature by focusing on data set extraction. In the process, this paper: identified an entity-extraction scenario with specific characteristics that enable a multiphase approach, including a feasible author-verification step; defined the search space for word feature identification; defined scoring functions for sentences and documents; and designed a simple web-based author verification step. The framework is successfully tested on 178 articles authored by researchers from a large research organization.<\/jats:p>\n<\/jats:sec>\n<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Originality\/value<\/jats:title>\n<jats:p>Whereas previous approaches focused on completely automated large-scale entity recognition from text snippets, the proposed framework is designed for a longer, high-quality text, such as a research publication. The framework includes a verification module that enables the request validation of the discovered entities by the authors of the research publications. This module shares some similarities with general crowdsourcing approaches, but the target scenario increases the likelihood of meaningful author participation.<\/jats:p>\n<\/jats:sec>","DOI":"10.1108\/el-03-2022-0071","type":"journal-article","created":{"date-parts":[[2022,7,26]],"date-time":"2022-07-26T07:29:49Z","timestamp":1658820589000},"page":"453-471","source":"Crossref","is-referenced-by-count":0,"title":["Framework for entity extraction with verification: application to inference of data set usage in research publications"],"prefix":"10.1108","volume":"40","author":[{"given":"Svetlozar","family":"Nestorov","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dinko","family":"Ba\u010di\u0107","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nenad","family":"Juki\u0107","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mary","family":"Malliaris","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"140","published-online":{"date-parts":[[2022,7,27]]},"reference":[{"issue":"8","key":"key2022080411052808600_ref001","doi-asserted-by":"publisher","first-page":"888","DOI":"10.1002\/asi.24172","article-title":"Digital data archives as knowledge infrastructures: mediating data sharing and reuse","volume":"70","year":"2019","journal-title":"Journal of the Association for Information Science and Technology"},{"issue":"1","key":"key2022080411052808600_ref002","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1108\/PROG-06-2016-0048","article-title":"The data-literature interlinking service","volume":"51","year":"2017","journal-title":"Program"},{"issue":"6","key":"key2022080411052808600_ref003","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1002\/asi.24583","article-title":"Discovering emerging topics in textual corpora of galleries, libraries, archives, and museums institutions","volume":"73","year":"2021","journal-title":"Journal of the Association for Information Science and Technology"},{"issue":"1","key":"key2022080411052808600_ref004","doi-asserted-by":"publisher","first-page":"176","DOI":"10.2218\/ijdc.v9i1.311","article-title":"Building a bridge between journal articles and research data: the PKP-Dataverse Integration Project","volume":"9","year":"2014","journal-title":"International Journal of Digital Curation"},{"issue":"5","key":"key2022080411052808600_ref005","doi-asserted-by":"publisher","first-page":"936","DOI":"10.1108\/JD-09-2017-0133","article-title":"Mining user queries with information extraction methods and linked data","volume":"74","year":"2018","journal-title":"Journal of Documentation"},{"key":"key2022080411052808600_ref006","unstructured":"Clark, C. and Diwala, S. (2015), \u201cLooking beyond text: extracting figures, tables, and captions from computer science papers\u201d, AAAI Workshop \u2013 Technical Report, WS-15-13, pp. 2-8."},{"year":"1999","key":"key2022080411052808600_ref007","article-title":"Unsupervised models for named entity classification"},{"key":"key2022080411052808600_ref008","first-page":"2493","article-title":"Natural language processing (almost) from scratch","volume":"12","year":"2011","journal-title":"Journal of Machine Learning Research, Article"},{"issue":"1\/3","key":"key2022080411052808600_ref009","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1023\/A:1007537716579","article-title":"Similarity-based models of word cooccurrence probabilities","volume":"34","year":"1999","journal-title":"Machine Learning"},{"issue":"1","key":"key2022080411052808600_ref010","doi-asserted-by":"publisher","first-page":"2","DOI":"10.1108\/EL-01-2018-0012","article-title":"Method for automatic key concepts extraction: application to documents in the domain of nuclear reactors","volume":"37","year":"2019","journal-title":"The Electronic Library"},{"issue":"2","key":"key2022080411052808600_ref011","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1162\/99608f92.e38165eb","article-title":"Lost or found? Discovering data needed for research","volume":"2","year":"2020","journal-title":"Harvard Data Science Review"},{"issue":"5","key":"key2022080411052808600_ref012","doi-asserted-by":"publisher","first-page":"419","DOI":"10.1002\/asi.24165","article-title":"Searching data: a review of observational data retrieval practices in selected disciplines","volume":"70","year":"2019","journal-title":"Journal of the Association for Information Science and Technology"},{"key":"key2022080411052808600_ref013","doi-asserted-by":"publisher","first-page":"281","DOI":"10.1145\/2756406.2756952","article-title":"The RMap project: capturing and preserving associations amongst multi-part distributed publications","year":"2015"},{"key":"key2022080411052808600_ref014","unstructured":"Hanson, R.H. (1978), \u201cThe current population survey: design and methodology\u201d, Technical Paper 40, Department of Commerce, Bureau of the Census, Washington, DC."},{"issue":"6","key":"key2022080411052808600_ref015","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.ipm.2020.102305","article-title":"An end-to-end joint model for evidence information extraction from court record document","volume":"57","year":"2020","journal-title":"Information Processing and Management"},{"issue":"1","key":"key2022080411052808600_ref016","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1108\/JD-03-2020-0036","article-title":"From data to knowledge: the relationships between vocabularies, linked data and knowledge graphs","volume":"77","year":"2021","journal-title":"Journal of Documentation"},{"issue":"6","key":"key2022080411052808600_ref017","first-page":"1930","article-title":"A comparative study of stemming algorithms","volume":"2","year":"2011","journal-title":"International Journal of Computer Technology and Applications"},{"issue":"2","key":"key2022080411052808600_ref018","doi-asserted-by":"publisher","first-page":"473","DOI":"10.1093\/logcom\/exs079","article-title":"A system for named entity recognition based on local grammars","volume":"24","year":"2014","journal-title":"Journal of Logic and Computation"},{"first-page":"327","article-title":"NERosetta for the named entity multi-lingual space","year":"2013","key":"key2022080411052808600_ref019"},{"first-page":"1085","article-title":"Automatic evaluation of text coherence: models and representations","year":"2005","key":"key2022080411052808600_ref020"},{"key":"key2022080411052808600_ref021","doi-asserted-by":"publisher","first-page":"721","DOI":"10.1145\/2348283.2348380","article-title":"TwiNER: named entity recognition in targeted twitter stream","year":"2012"},{"issue":"1","key":"key2022080411052808600_ref022","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1109\/TKDE.2020.2981314","article-title":"A survey on deep learning for named entity recognition","volume":"34","year":"2020","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"issue":"2","key":"key2022080411052808600_ref023","doi-asserted-by":"publisher","first-page":"215","DOI":"10.1108\/02640470810864109","article-title":"Event\u2010based knowledge extraction from free\u2010text descriptions for art images by using semantic role labeling approaches","volume":"26","year":"2008","journal-title":"The Electronic Library"},{"issue":"1\/2","key":"key2022080411052808600_ref024","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1007\/s10791-015-9262-2","article-title":"Biomedical term extraction: overview and a new methodology","volume":"19","year":"2016","journal-title":"Information Retrieval Journal"},{"key":"key2022080411052808600_ref025","doi-asserted-by":"publisher","first-page":"1990","DOI":"10.18653\/v1\/P18-1185","article-title":"Visual attention model for name tagging in multimodal social media","year":"2018"},{"issue":"2","key":"key2022080411052808600_ref026","doi-asserted-by":"publisher","first-page":"365","DOI":"10.1108\/EL-10-2018-0196","article-title":"KEFST: a knowledge extraction framework using finite-state transducers","volume":"37","year":"2019","journal-title":"The Electronic Library"},{"key":"key2022080411052808600_ref027","doi-asserted-by":"publisher","first-page":"55","DOI":"10.3115\/v1\/P14-5010","article-title":"The Stanford CoreNLP natural language processing toolkit","year":"2014"},{"issue":"1","key":"key2022080411052808600_ref028","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1075\/li.30.1.03nad","article-title":"A survey of named entity recognition and classification","volume":"30","year":"2007","journal-title":"Lingvisticae Investigationes"},{"key":"key2022080411052808600_ref029","doi-asserted-by":"publisher","first-page":"228","DOI":"10.18653\/v1\/S15-1028","article-title":"Implicit entity recognition in clinical documents","year":"2015"},{"first-page":"23","article-title":"Random walks for text semantic similarity","year":"2009","key":"key2022080411052808600_ref030"},{"first-page":"147","article-title":"Design challenges and misconceptions in named entity recognition","year":"2009","key":"key2022080411052808600_ref031"},{"volume-title":"Meta-NET White Paper Series","year":"2012","key":"key2022080411052808600_ref032"},{"issue":"2","key":"key2022080411052808600_ref033","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1145\/276305.276335","article-title":"Integrating association rule mining with relational database systems: alternatives and implications","volume":"27","year":"1998","journal-title":"ACM SIGMOD Record"},{"issue":"1","key":"key2022080411052808600_ref034","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/2041-1480-4-27","article-title":"FlexiTerm: a flexible term recognition method","volume":"4","year":"2013","journal-title":"Journal of Biomedical Semantics"},{"issue":"5\/6","key":"key2022080411052808600_ref035","doi-asserted-by":"publisher","first-page":"905","DOI":"10.1108\/EL-03-2020-0052","article-title":"HerCulB: content-based information extraction and retrieval for cultural heritage of the Balkans","volume":"38","year":"2020","journal-title":"The Electronic Library"},{"issue":"3","key":"key2022080411052808600_ref036","doi-asserted-by":"publisher","first-page":"e3001129","DOI":"10.1371\/journal.pbio.3001129","article-title":"From reductionism to reintegration: solving society\u2019s most pressing problems requires building bridges between data types across the life sciences","volume":"19","year":"2021","journal-title":"PLOS Biology"},{"issue":"6","key":"key2022080411052808600_ref037","doi-asserted-by":"publisher","first-page":"993","DOI":"10.1108\/EL-11-2017-0239","article-title":"Managing mining project documentation using human language technology","volume":"36","year":"2018","journal-title":"The Electronic Library"},{"key":"key2022080411052808600_ref038","first-page":"45","article-title":"Support vector machine active learning with applications to text classification","volume":"2","year":"2002","journal-title":"Journal of Machine Learning Research"},{"issue":"5","key":"key2022080411052808600_ref039","doi-asserted-by":"publisher","first-page":"589","DOI":"10.1007\/s10791-006-9005-5","article-title":"Table extraction for answer retrieval","volume":"9","year":"2006","journal-title":"Information Retrieval"},{"issue":"3","key":"key2022080411052808600_ref040","doi-asserted-by":"publisher","first-page":"411","DOI":"10.1108\/EL-10-2020-0304","article-title":"Towards an entity relation extraction framework in the cross-lingual context","volume":"39","year":"2021","journal-title":"The Electronic Library"},{"issue":"4","key":"key2022080411052808600_ref041","doi-asserted-by":"publisher","first-page":"724","DOI":"10.1108\/EL-09-2016-0198","article-title":"Semantically linking events for massive scientific literature research","volume":"35","year":"2017","journal-title":"The Electronic Library"},{"issue":"3","key":"key2022080411052808600_ref042","doi-asserted-by":"publisher","first-page":"469","DOI":"10.1108\/EL-11-2020-0320","article-title":"An exploratory analysis: extracting materials science knowledge from unstructured scholarly data","volume":"39","year":"2021","journal-title":"The Electronic Library"},{"issue":"2","key":"key2022080411052808600_ref043","doi-asserted-by":"publisher","first-page":"480","DOI":"10.1002\/asi.23677","article-title":"The use of a graph-based system to improve bibliographic information retrieval: system design, implementation, and evaluation","volume":"68","year":"2017","journal-title":"Journal of the Association for Information Science and Technology"}],"container-title":["The Electronic Library"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/EL-03-2022-0071\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/EL-03-2022-0071\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,25]],"date-time":"2025-07-25T01:06:59Z","timestamp":1753405619000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/el\/article\/40\/4\/453-471\/42284"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,7,27]]},"references-count":43,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2022,7,27]]},"published-print":{"date-parts":[[2022,8,8]]}},"alternative-id":["10.1108\/EL-03-2022-0071"],"URL":"https:\/\/doi.org\/10.1108\/el-03-2022-0071","relation":{},"ISSN":["0264-0473","0264-0473"],"issn-type":[{"type":"print","value":"0264-0473"},{"type":"electronic","value":"0264-0473"}],"subject":[],"published":{"date-parts":[[2022,7,27]]}}}