{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T20:20:36Z","timestamp":1777753236086,"version":"3.51.4"},"reference-count":29,"publisher":"SAGE Publications","issue":"3-4","license":[{"start":{"date-parts":[[2016,8,1]],"date-time":"2016-08-01T00:00:00Z","timestamp":1470009600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Information Services and Use"],"published-print":{"date-parts":[[2016,8]]},"abstract":"<jats:p>\n                    Today, full-texts of scientific articles are often stored in different locations than the used datasets. Dataset registries aim at a closer integration by making datasets citable but authors typically refer to datasets using inconsistent abbreviations and heterogeneous metadata (e.g. title, publication year). It is thus hard to reproduce research results, to access datasets for further analysis, and to determine the impact of a dataset. Manually detecting references to datasets in scientific articles is time-consuming and requires expert knowledge in the underlying research domain. We propose and evaluate a semi-automatic three-step approach for finding explicit references to datasets in social sciences articles. We first extract pre-defined special features from dataset titles in the\n                    <jats:sans-serif>da|ra<\/jats:sans-serif>\n                    registry, then detect references to datasets using the extracted features, and finally match the references found with corresponding dataset titles. The approach does not require a corpus of articles (avoiding the cold start problem) and performs well on a test corpus. We achieved an F-measure of 0.84 for detecting references in full-texts and an F-measure of 0.83 for finding correct matches of detected references in the\n                    <jats:sans-serif>da|ra<\/jats:sans-serif>\n                    dataset registry.\n                  <\/jats:p>","DOI":"10.3233\/isu-160816","type":"journal-article","created":{"date-parts":[[2016,12,23]],"date-time":"2016-12-23T19:46:24Z","timestamp":1482522384000},"page":"171-187","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":0,"title":["A semi-automatic approach for detecting dataset references in social science texts"],"prefix":"10.1177","volume":"36","author":[{"given":"Behnam","family":"Ghavimi","sequence":"first","affiliation":[{"name":"GESIS \u2013 Leibniz Institute for the Social Sciences, Germany. E-mails: ,"},{"name":"Enterprise Information Systems (EIS), University of Bonn, Germany. E-mails: , , ,"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Philipp","family":"Mayr","sequence":"additional","affiliation":[{"name":"GESIS \u2013 Leibniz Institute for the Social Sciences, Germany. E-mails: ,"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Christoph","family":"Lange","sequence":"additional","affiliation":[{"name":"Enterprise Information Systems (EIS), University of Bonn, Germany. E-mails: , , ,"},{"name":"Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Germany. E-mails: ,"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sahar","family":"Vahdati","sequence":"additional","affiliation":[{"name":"Enterprise Information Systems (EIS), University of Bonn, Germany. E-mails: , , ,"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"S\u00f6ren","family":"Auer","sequence":"additional","affiliation":[{"name":"Enterprise Information Systems (EIS), University of Bonn, Germany. E-mails: , , ,"},{"name":"Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Germany. E-mails: ,"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","published-online":{"date-parts":[[2016,8]]},"reference":[{"issue":"3","key":"e_1_3_3_2_2","first-page":"196","article-title":"Rule based autonomous citation mining with tierl","volume":"8","author":"Afzal M.T.","year":"2010","unstructured":"AfzalM.T.MaurerH.BalkeW.-T. and KulathuramaiyerN., Rule based autonomous citation mining with tierl, Journal of Digital Information Management8(3) (2010), 196\u2013204.","journal-title":"Journal of Digital Information Management"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","unstructured":"AltmanM. and KingG. A proposed standard for the scholarly citation of quantitative data D-Lib Magazine13(3\/4) (2007). doi:10.1045\/march2007-altman.","DOI":"10.1045\/march2007-altman"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","unstructured":"BolandK.RitzeD.EckertK. and MathiakB. Identifying references to datasets in publications in: Proceedings of the Second International Conference on Theory and Practice of Digital Libraries (TDPL 2012) Springer 2012 pp. 150\u2013161. doi:10.1007\/978-3-642-33290-6_17.","DOI":"10.1007\/978-3-642-33290-6_17"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","unstructured":"CuiB.-G. and ChenX. An improved hidden Markov model for literature metadata extraction in: 6th International Conference on Intelligent Computing ICIC 2010 Changsha China 2010. doi:10.1007\/978-3-642-14922-1_26.","DOI":"10.1007\/978-3-642-14922-1_26"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","unstructured":"GhavimiB.MayrP.VahdatiS. and LangeC. Identifying and improving dataset references in social sciences full-texts in: Positioning and Power in Academic Publishing: Players Agents and Agendas IOS Press 2016 pp. 105\u2013114. doi:10.3233\/978-1-61499-649-1-105.","DOI":"10.3233\/978-1-61499-649-1-105"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","unstructured":"HanH.Lee GilesC.ManavogluE.ZhaH.ZhangZ. and FoxE.A. Automatic document metadata extraction using support vector machines in: Digital Libraries 2003. Proceedings. 2003 Joint Conference on ACM\/IEEE 2003 Joint Conference on Digital Libraries 2003 pp. 37\u201348. doi:10.1109\/JCDL.2003.1204842.","DOI":"10.1109\/JCDL.2003.1204842"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","unstructured":"HienertD.SawitzkiF. and MayrP. Digital library research in action: Supporting information retrieval in sowiport D-Lib Magazine21(3\/4) (2015). doi:10.1045\/march2015-hienert.","DOI":"10.1045\/march2015-hienert"},{"key":"e_1_3_3_9_2","unstructured":"JoachimsT. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization in: ICML 97 Proceedings of the Fourteenth International Conference on Machine Learning Morgan Kaufmann 1997 pp. 143\u2013151."},{"issue":"6","key":"e_1_3_3_10_2","first-page":"144","article-title":"Effective approaches for extraction of keywords","volume":"7","author":"Kaur J.","year":"2010","unstructured":"KaurJ. and GuptaV., Effective approaches for extraction of keywords, IJCSI International Journal of Computer Science7(6) (2010), 144\u2013148.","journal-title":"IJCSI International Journal of Computer Science"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","unstructured":"KernR.JackK.HristakevaM. and GranitzerM. Teambeam \u2013 meta-data extraction from scientific literature D-Lib Magazine18(7\/8) (2012). doi:10.1045\/july2012-kern.","DOI":"10.1045\/july2012-kern"},{"key":"e_1_3_3_12_2","unstructured":"KubalaF.SchwartzR.StoneR. and WeischedelR. Named entity extraction from speech in: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop 1998 pp. 287\u2013292."},{"key":"e_1_3_3_13_2","doi-asserted-by":"crossref","unstructured":"LeeS. and KimH. News keyword extraction for topic tracking in: Networked Computing and Advanced Information Management 2008 (NCM 08) Vol. 2 IEEE 2008 pp. 554\u2013559.","DOI":"10.1109\/NCM.2008.199"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","unstructured":"LuM.BangaloreS.CormodeG.HadjieleftheriouM. and SrivastavaD. A dataset search engine for the research document corpus in: Data Engineering (ICDE) 2012 IEEE 28th International Conference on IEEE 2012 pp. 1237\u20131240. doi:10.1109\/ICDE.2012.80.","DOI":"10.1109\/ICDE.2012.80"},{"key":"e_1_3_3_15_2","unstructured":"ManningC.D. and Sch\u00fctzeH. Foundations of Statistical Natural Language Processing MIT Press Cambridge MA USA 1999. ISBN 0-262-13360-1."},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","unstructured":"MarinaiS. Metadata extraction from PDF papers for digital library ingest in: Document Analysis and Recognition 2009. ICDAR 09. 10th International Conference on IEEE 2009 pp. 251\u2013255. doi:10.1109\/ICDAR.2009.232.","DOI":"10.1109\/ICDAR.2009.232"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","unstructured":"MathiakB. and BolandK. Challenges in matching dataset citation strings to datasets in social science D-Lib Magazine21(1\/2) (2015). doi:10.1045\/january2015-mathiak.","DOI":"10.1045\/january2015-mathiak"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1075\/li.30.1.03nad"},{"key":"e_1_3_3_19_2","unstructured":"O\u2019NeilK. and PeplerS. Preservation intent and collection identifiers: Claddier project report ii 2008 http:\/\/purl.org\/net\/epubs\/work\/43640."},{"key":"e_1_3_3_20_2","first-page":"37","article-title":"Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation","volume":"2","author":"Powers D.M.W.","year":"2011","unstructured":"PowersD.M.W., Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation, International Journal of Machine Learning Technology2 (2011), 37\u201363.","journal-title":"International Journal of Machine Learning Technology"},{"key":"e_1_3_3_21_2","doi-asserted-by":"crossref","unstructured":"RenearA.H.SacchiS. and WickettK.M. Definitions of dataset in the scientific and technical literature in: In Proceedings of the American Society for Information Science and Technology Vol. 47 Wiley Online Library 2010 pp. 1\u20134.","DOI":"10.1002\/meet.14504701240"},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","unstructured":"SahamiM. and HeilmanT.D. A web-based kernel function for measuring the similarity of short text snippets in: Proceedings of the 15th International Conference on World Wide Web WWW 06 ACM 2006 pp. 377\u2013386. doi:10.1145\/1135777.1135834.","DOI":"10.1145\/1135777.1135834"},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1016\/0306-4573(88)90021-0"},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","unstructured":"SarawagiS. Information extraction in: Foundations and Trends in Information Retrieval in Databases Vol. 1 2007 pp. 261\u2013377. doi:10.1561\/1900000003.","DOI":"10.1561\/1900000003"},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","unstructured":"SchaeferC.HienertD. and GottronT. Normalized relevance distance \u2013 a stable metric for computing semantic relatedness over reference corpora in: European Conference on Artificial Intelligence (ECAI) Vol. 263 2014 pp. 789\u2013794. doi:10.3233\/978-1-61499-419-0-789.","DOI":"10.3233\/978-1-61499-419-0-789"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","unstructured":"SinghalA. and SrivastavaJ. Data extract: Mining context from the web for dataset extraction International Journal of Machine Learning and Computing3(2) (2013). doi:10.7763\/IJMLC.2013.V3.306.","DOI":"10.7763\/IJMLC.2013.V3.306"},{"key":"e_1_3_3_27_2","doi-asserted-by":"crossref","unstructured":"TurneyP. Mining the web for synonyms: PMI-IR versus LSA on TOEFL in: Proceedings of the 12th European Conference on Machine Learning EMCL 01 Springer-Verlag 2001 pp. 491\u2013502 http:\/\/dl.acm.org\/citation.cfm?id=645328.650004.","DOI":"10.1007\/3-540-44795-4_42"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","unstructured":"VahdatiS.KarimF.HuangJ.-Y. and LangeC. Mapping large scale research metadata to linked data: A performance comparison of HBase CSV and XML in: Metadata and Semantics Research Communications in Computer and Information Science Springer 2015. doi:10.1007\/978-3-319-24129-6_23.","DOI":"10.1007\/978-3-319-24129-6_23"},{"issue":"3","key":"e_1_3_3_29_2","first-page":"1169","article-title":"Automatic keyword extraction from documents using conditional random fields","volume":"4","author":"Zhang C.","year":"2008","unstructured":"ZhangC.WangH.LiuY.WuD.LiaoY. and WangB., Automatic keyword extraction from documents using conditional random fields, Computational and Information Systems4(3) (2008), 1169\u20131180.","journal-title":"Computational and Information Systems"},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","unstructured":"ZhangK.XuH.TangJ. and LiJ. Keyword extraction using support vector machine in: Proceedings of the 7th International Conference on Advances in Web-Age Information Management WAIM 06 Springer-Verlag 2006 pp. 85\u201396. doi:10.1007\/11775300_8.","DOI":"10.1007\/11775300_8"}],"container-title":["Information Services and Use"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/ISU-160816","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.3233\/ISU-160816","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/ISU-160816","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T12:02:22Z","timestamp":1777464142000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.3233\/ISU-160816"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,8]]},"references-count":29,"journal-issue":{"issue":"3-4","published-print":{"date-parts":[[2016,8]]}},"alternative-id":["10.3233\/ISU-160816"],"URL":"https:\/\/doi.org\/10.3233\/isu-160816","relation":{},"ISSN":["0167-5265","1875-8789"],"issn-type":[{"value":"0167-5265","type":"print"},{"value":"1875-8789","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,8]]}}}