{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,15]],"date-time":"2026-04-15T06:48:04Z","timestamp":1776235684330,"version":"3.50.1"},"reference-count":57,"publisher":"MIT Press","issue":"1","license":[{"start":{"date-parts":[[2021,3,8]],"date-time":"2021-03-08T00:00:00Z","timestamp":1615161600000},"content-version":"vor","delay-in-days":66,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,4,8]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Wikipedia\u2019s content is based on reliable and published sources. To this date, relatively little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive data set of citations extracted from Wikipedia. We extracted29.3 million citations from 6.1 million English Wikipedia articles as of May 2020, and classified as being books, journal articles, or Web content. We were thus able to extract 4.0 million citations to scholarly publications with known identifiers\u2014including DOI, PMC, PMID, and ISBN\u2014and further equip an extra 261 thousand citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science. We release our code to allow the community to extend upon our work and update the data set in the future.<\/jats:p>","DOI":"10.1162\/qss_a_00105","type":"journal-article","created":{"date-parts":[[2021,2,2]],"date-time":"2021-02-02T15:42:55Z","timestamp":1612280575000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":22,"title":["Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia"],"prefix":"10.1162","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0517-1576","authenticated-orcid":true,"given":"Harshdeep","family":"Singh","sequence":"first","affiliation":[{"name":"Data Science Laboratory, EPFL"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3984-1232","authenticated-orcid":true,"given":"Robert","family":"West","sequence":"additional","affiliation":[{"name":"Data Science Laboratory, EPFL"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9806-084X","authenticated-orcid":true,"given":"Giovanni","family":"Colavizza","sequence":"additional","affiliation":[{"name":"Institute for Logic, Language and Computation, University of Amsterdam"}]}],"member":"281","published-online":{"date-parts":[[2021,4,8]]},"reference":[{"issue":"2","key":"2021041320050772900_bib1","doi-asserted-by":"crossref","first-page":"e0228713","DOI":"10.1371\/journal.pone.0228713","article-title":"Science through Wikipedia: A novel representation of open knowledge through co-citation networks","volume":"15","author":"Arroyo-Machado","year":"2020","journal-title":"PLOS ONE"},{"key":"2021041320050772900_bib2","doi-asserted-by":"crossref","first-page":"1188","DOI":"10.1145\/3308560.3316757","article-title":"A graph-structured dataset for Wikipedia Research","volume-title":"Companion Proceedings of the 2019 World Wide Web Conference","author":"Aspert","year":"2019"},{"issue":"1","key":"2021041320050772900_bib3","doi-asserted-by":"crossref","first-page":"363","DOI":"10.1162\/qss_a_00018","article-title":"Web of Science as a data source for research on scientific and scholarly activity","volume":"1","author":"Birkle","year":"2020","journal-title":"Quantitative Science Studies"},{"key":"2021041320050772900_bib4","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching word vectors with subword information","volume":"5","author":"Bojanowski","year":"2017","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2021041320050772900_bib5","volume-title":"Atlas of science: Visualizing what we know","author":"B\u00f6rner","year":"2010"},{"issue":"2","key":"2021041320050772900_bib6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1515\/jdis-2017-0006","article-title":"Science mapping: A systematic review of the literature","volume":"2","author":"Chen","year":"2017","journal-title":"Journal of Data and Information Science"},{"key":"2021041320050772900_bib7","doi-asserted-by":"crossref","DOI":"10.1145\/2462932.2462943","article-title":"{{citation needed}}: The dynamics of referencing in Wikipedia","volume-title":"Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration","author":"Chen","year":"2012"},{"issue":"4","key":"2021041320050772900_bib8","doi-asserted-by":"crossref","first-page":"1349","DOI":"10.1162\/qss_a_00080","article-title":"COVID-19 research in Wikipedia","volume":"1","author":"Colavizza","year":"2020","journal-title":"Quantitative Science Studies"},{"key":"2021041320050772900_bib9","doi-asserted-by":"crossref","first-page":"247","DOI":"10.1007\/1-4020-4102-0_19","article-title":"Using hedges to classify citations in scientific articles","volume-title":"Computing attitude and affect in text: Theory and applications","author":"Di Marco","year":"2006"},{"key":"2021041320050772900_bib10","first-page":"623","article-title":"Ensemble-style self-training on citation classification","volume-title":"Proceedings of 5th International Joint Conference on Natural Language Processing","author":"Dong","year":"2011"},{"key":"2021041320050772900_bib11","article-title":"Wikidata from a research perspective\u2014A systematic mapping study of Wikidata","author":"Farda-Sarbas","year":"2019","journal-title":"arXiv:1908.11153"},{"key":"2021041320050772900_bib12","doi-asserted-by":"crossref","first-page":"337","DOI":"10.1145\/2983323.2983808","article-title":"Finding news citations for Wikipedia","volume-title":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","author":"Fetahu","year":"2016"},{"key":"2021041320050772900_bib13","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1145\/3148330.3148347","article-title":"Information fortification: An on-line citation behavior","volume-title":"Proceedings of the 2018 ACM Conference on Supporting Groupwork\u2014GROUP \u201918","author":"Forte","year":"2018"},{"key":"2021041320050772900_bib14","doi-asserted-by":"crossref","DOI":"10.1145\/2491055.2491061","article-title":"When the levee breaks: Without bots, what happens to Wikipedia\u2019s quality control processes?","volume-title":"Proceedings of the 9th International Symposium on Open Collaboration","author":"Geiger","year":"2013"},{"key":"2021041320050772900_bib15","article-title":"Citations with identifiers in Wikipedia","author":"Halfaker","year":"2018","journal-title":"Figshare"},{"issue":"1","key":"2021041320050772900_bib16","doi-asserted-by":"crossref","first-page":"e14","DOI":"10.2196\/jmir.1589","article-title":"Wikipedia: A key tool for global public health promotion","volume":"13","author":"Heilman","year":"2011","journal-title":"Journal of Medical Internet Research"},{"key":"2021041320050772900_bib17","doi-asserted-by":"crossref","first-page":"717","DOI":"10.1145\/3041021.3053375","article-title":"Bias in Wikipedia","volume-title":"Proceedings of the 26th International Conference on World Wide Web Companion\u2014WWW \u201917 Companion","author":"Hube","year":"2017"},{"issue":"7","key":"2021041320050772900_bib18","doi-asserted-by":"crossref","first-page":"1773","DOI":"10.1002\/asi.23691","article-title":"Bridging the gap between Wikipedia and academia","volume":"67","author":"Jemielniak","year":"2016","journal-title":"Journal of the Association for Information Science and Technology"},{"key":"2021041320050772900_bib19","doi-asserted-by":"crossref","DOI":"10.1145\/2038558.2038577","article-title":"Hot off the Wiki: Dynamics, practices, and structures in Wikipedia\u2019s coverage of the To\u00afhoku catastrophes","volume-title":"Proceedings of the 7th International Symposium on Wikis and Open Collaboration\u2014WikiSym \u201911","author":"Keegan","year":"2011"},{"key":"2021041320050772900_bib20","volume-title":"Adam: A method for stochastic optimization","author":"Kingma","year":"2014"},{"issue":"3","key":"2021041320050772900_bib21","doi-asserted-by":"crossref","first-page":"762","DOI":"10.1002\/asi.23694","article-title":"Are Wikipedia citations important evidence of the impact of scholarly articles and books?","volume":"68","author":"Kousha","year":"2017","journal-title":"Journal of the Association for Information Science and Technology"},{"key":"2021041320050772900_bib22","doi-asserted-by":"crossref","first-page":"591","DOI":"10.1145\/2872427.2883085","article-title":"Disinformation on the web: Impact, characteristics, and detection of Wikipedia hoaxes","volume-title":"Proceedings of the 25th International Conference on World Wide Web","author":"Kumar","year":"2016"},{"issue":"4","key":"2021041320050772900_bib23","doi-asserted-by":"crossref","first-page":"471","DOI":"10.1197\/jamia.M3059","article-title":"Seeking health information online: Does Wikipedia matter?","volume":"16","author":"Laurent","year":"2009","journal-title":"Journal of the American Medical Informatics Association"},{"issue":"2","key":"2021041320050772900_bib24","doi-asserted-by":"crossref","first-page":"167","DOI":"10.3233\/SW-140134","article-title":"DBpedia\u2014A large-scale, multilingual knowledge base extracted from Wikipedia","volume":"6","author":"Lehmann","year":"2015","journal-title":"Semantic Web"},{"key":"2021041320050772900_bib25","doi-asserted-by":"crossref","first-page":"561","DOI":"10.1007\/978-3-319-67642-5_47","article-title":"Analysis of references across Wikipedia languages","volume-title":"Information and software technologies","author":"Lewoniewski","year":"2017"},{"key":"2021041320050772900_bib26","doi-asserted-by":"crossref","first-page":"e52426","DOI":"10.7554\/eLife.52426","article-title":"Reader engagement with medical content on Wikipedia","volume":"9","author":"Maggio","year":"2020","journal-title":"eLife"},{"issue":"12","key":"2021041320050772900_bib27","doi-asserted-by":"crossref","first-page":"e0190046","DOI":"10.1371\/journal.pone.0190046","article-title":"Wikipedia as a gateway to biomedical research: The relative distribution and use of citations in the English Wikipedia","volume":"12","author":"Maggio","year":"2017","journal-title":"PLOS ONE"},{"key":"2021041320050772900_bib28","article-title":"Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations\u2019 COCI: A multidisciplinary comparison of coverage via citations","author":"Mart\u00edn-Mart\u00edn","year":"2020","journal-title":"Scientometrics"},{"key":"2021041320050772900_bib29","doi-asserted-by":"crossref","DOI":"10.1609\/icwsm.v11i1.14883","article-title":"The substantial interdependence of Wikipedia and Google: A case study on the relationship between peer production communities and information technologies","volume-title":"Proceedings of the Eleventh International AAAI Conference on Web and Social Media","author":"McMahon","year":"2017"},{"issue":"2","key":"2021041320050772900_bib30","doi-asserted-by":"crossref","first-page":"219","DOI":"10.1002\/asi.23172","article-title":"\u201cThe sum of all human knowledge\u201d: A systematic review of scholarly research on the content of Wikipedia","volume":"66","author":"Mesgari","year":"2015","journal-title":"Journal of the Association for Information Science and Technology"},{"issue":"8","key":"2021041320050772900_bib31","article-title":"Scientific citations in Wikipedia","volume":"12","author":"Nielsen","year":"2007","journal-title":"First Monday"},{"key":"2021041320050772900_bib32","doi-asserted-by":"crossref","first-page":"237","DOI":"10.1007\/978-3-319-70407-4_36","article-title":"Scholia, Scientometrics and Wikidata","volume-title":"The Semantic Web: ESWC 2017 Satellite Events","author":"Nielsen","year":"2017"},{"key":"2021041320050772900_bib33","doi-asserted-by":"crossref","DOI":"10.2139\/ssrn.2021326","article-title":"The people\u2019s encyclopedia under the gaze of the sages: A systematic review of scholarly research on Wikipedia","author":"Okoli","year":"2012","journal-title":"SSRN Electronic Journal"},{"key":"2021041320050772900_bib34","doi-asserted-by":"crossref","first-page":"615","DOI":"10.1145\/2835776.2835832","article-title":"Improving website hyperlink structure using server logs","volume-title":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","author":"Paranjape","year":"2016"},{"key":"2021041320050772900_bib35","doi-asserted-by":"crossref","first-page":"2365","DOI":"10.1145\/3366423.3380300","article-title":"Quantifying engagement with citations on Wikipedia","volume-title":"Proceedings of The Web Conference 2020","author":"Piccardi","year":"2020"},{"key":"2021041320050772900_bib36","doi-asserted-by":"crossref","first-page":"542","DOI":"10.1007\/978-3-319-68288-4_32","article-title":"Provenance information in a collaborative knowledge graph: An evaluation of Wikidata external references","volume-title":"The Semantic Web\u2014ISWC 2017","author":"Piscopo","year":"2017"},{"key":"2021041320050772900_bib37","doi-asserted-by":"crossref","DOI":"10.1145\/3306446.3340822","article-title":"What we talk about when we talk about Wikidata quality: A literature survey","volume-title":"Proceedings of the 15th International Symposium on Open Collaboration","author":"Piscopo","year":"2019"},{"issue":"1","key":"2021041320050772900_bib38","doi-asserted-by":"crossref","first-page":"455","DOI":"10.1007\/s11192-017-2474-z","article-title":"Methodological issues in measuring citations in Wikipedia: A case study in Library and Information Science","volume":"113","author":"Pooladian","year":"2017","journal-title":"Scientometrics"},{"key":"2021041320050772900_bib39","doi-asserted-by":"crossref","DOI":"10.1145\/1316624.1316663","article-title":"Creating, destroying, and restoring value in Wikipedia","volume-title":"Proceedings of the 2007 International ACM Conference on Conference on Supporting Group Work","author":"Priedhorsky","year":"2007"},{"key":"2021041320050772900_bib40","article-title":"Altmetrics in the wild: Using social media to explore scholarly impact","author":"Priem","year":"2012"},{"key":"2021041320050772900_bib41","doi-asserted-by":"crossref","first-page":"1567","DOI":"10.1145\/3308558.3313618","article-title":"Citation needed: A taxonomy and algorithmic assessment of Wikipedia\u2019s verifiability","volume-title":"Proceedings of the World Wide Web Conference","author":"Redi","year":"2019"},{"issue":"11","key":"2021041320050772900_bib42","doi-asserted-by":"crossref","first-page":"2673","DOI":"10.1109\/78.650093","article-title":"Bidirectional recurrent neural networks","volume":"45","author":"Schuster","year":"1997","journal-title":"IEEE Transactions on Signal Processing"},{"key":"2021041320050772900_bib43","first-page":"1122","article-title":"Evolution of Wikipedia\u2019s medical content: Past, present and future","volume":"71","author":"Shafee","year":"2017","journal-title":"Journal of Epidemiology and Community Health"},{"issue":"Supplement 1","key":"2021041320050772900_bib44","doi-asserted-by":"crossref","first-page":"5183","DOI":"10.1073\/pnas.0307852100","article-title":"Mapping knowledge domains","volume":"101","author":"Shiffrin","year":"2004","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"2021041320050772900_bib45","doi-asserted-by":"crossref","DOI":"10.1145\/2467696.2467746","article-title":"A comparative study of academic and Wikipedia ranking","volume-title":"Proceedings of the 13th ACM\/IEEE-CS joint conference on Digital libraries\u2014JCDL \u201913","author":"Shuai","year":"2013"},{"key":"2021041320050772900_bib46","volume-title":"Wikipedia citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia","author":"Singh","year":"2020"},{"issue":"2","key":"2021041320050772900_bib47","doi-asserted-by":"crossref","first-page":"e0228786","DOI":"10.1371\/journal.pone.0228786","article-title":"Situating Wikipedia as a health information resource in various contexts: A scoping review","volume":"15","author":"Smith","year":"2020","journal-title":"PLOS ONE"},{"issue":"9","key":"2021041320050772900_bib48","doi-asserted-by":"crossref","first-page":"2037","DOI":"10.1002\/asi.23833","article-title":"Scholarly use of social media and altmetrics: A review of the literature","volume":"68","author":"Sugimoto","year":"2017","journal-title":"Journal of the Association for Information Science and Technology"},{"issue":"9","key":"2021041320050772900_bib49","doi-asserted-by":"crossref","first-page":"2116","DOI":"10.1002\/asi.23687","article-title":"Amplifying the impact of open access: Wikipedia and the diffusion of science","volume":"68","author":"Teplitskiy","year":"2017","journal-title":"Journal of the Association for Information Science and Technology"},{"key":"2021041320050772900_bib50","article-title":"Science is shaped by Wikipedia: Evidence from a randomized control trial","author":"Thompson","year":"2018","journal-title":"MIT Sloan Research Paper 5238-17"},{"issue":"3","key":"2021041320050772900_bib51","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1080\/0194262X.2016.1206052","article-title":"A study of citations to Wikipedia in scholarly publications","volume":"35","author":"Tomaszewski","year":"2016","journal-title":"Science & Technology Libraries"},{"issue":"3","key":"2021041320050772900_bib52","doi-asserted-by":"crossref","first-page":"793","DOI":"10.1016\/j.joi.2019.07.002","article-title":"Mapping the backbone of the humanities through the eyes of Wikipedia","volume":"13","author":"Torres-Salinas","year":"2019","journal-title":"Journal of Informetrics"},{"issue":"1","key":"2021041320050772900_bib53","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1002\/asi.24210","article-title":"Assessing the quality of information on Wikipedia: A deep-learning approach","volume":"71","author":"Wang","year":"2020","journal-title":"Journal of the Association for Information Science and Technology"},{"key":"2021041320050772900_bib54","doi-asserted-by":"crossref","first-page":"975","DOI":"10.1145\/2872427.2883077","article-title":"Growing Wikipedia across languages via recommendation","volume-title":"Proceedings of the 25th International Conference on World Wide Web","author":"Wulczyn","year":"2016"},{"key":"2021041320050772900_bib55","article-title":"Using heterogeneous features for scientific citation classification","volume-title":"Proceedings of the 13th Conference of the Pacific Association for Computational Linguistics","author":"Xu","year":"2013"},{"key":"2021041320050772900_bib56","article-title":"\u2018I Updated the &lt;ref&gt;\u2019: The evolution of references in the English Wikipedia and the implications for altmetrics","author":"Zagovora","year":"2020","journal-title":"arXiv:2010.03083"},{"issue":"2","key":"2021041320050772900_bib57","doi-asserted-by":"crossref","first-page":"1491","DOI":"10.1007\/s11192-014-1264-0","article-title":"How well developed are altmetrics? A cross-disciplinary analysis of the presence of \u2018alternative metrics\u2019 in scientific publications","volume":"101","author":"Zahedi","year":"2014","journal-title":"Scientometrics"}],"container-title":["Quantitative Science Studies"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/direct.mit.edu\/qss\/article-pdf\/2\/1\/1\/1906624\/qss_a_00105.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/direct.mit.edu\/qss\/article-pdf\/2\/1\/1\/1906624\/qss_a_00105.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,14]],"date-time":"2022-12-14T10:32:54Z","timestamp":1671013974000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/qss\/article\/2\/1\/1\/97565\/Wikipedia-citations-A-comprehensive-data-set-of"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021]]},"references-count":57,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,4,8]]}},"URL":"https:\/\/doi.org\/10.1162\/qss_a_00105","relation":{},"ISSN":["2641-3337"],"issn-type":[{"value":"2641-3337","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021]]},"published":{"date-parts":[[2021]]}}}