{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T17:25:46Z","timestamp":1777397146680,"version":"3.51.4"},"reference-count":45,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2020,3,2]],"date-time":"2020-03-02T00:00:00Z","timestamp":1583107200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,3,2]],"date-time":"2020-03-02T00:00:00Z","timestamp":1583107200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Scientometrics"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several regards. In this paper, we propose a new data set based on all publications from all scientific disciplines available on arXiv.org. Apart from providing the papers\u2019 plain text, in-text citations were annotated via global identifiers. Furthermore, citing and cited publications were linked to the Microsoft Academic Graph, providing access to rich metadata. Our data set consists of over one million documents and 29.2 million citation contexts. The data set, which is made freely available for research purposes, not only can enhance the future evaluation of research paper-based and citation context-based approaches, but also serve as a basis for new ways to analyze in-text citations, as we show prototypically in this article.<\/jats:p>","DOI":"10.1007\/s11192-020-03382-z","type":"journal-article","created":{"date-parts":[[2020,3,2]],"date-time":"2020-03-02T10:02:47Z","timestamp":1583143367000},"page":"3085-3108","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":31,"title":["unarXive: a large scholarly data set with publications\u2019 full-text, annotated in-text citations, and links to metadata"],"prefix":"10.1007","volume":"125","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5028-0109","authenticated-orcid":false,"given":"Tarek","family":"Saier","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5458-8645","authenticated-orcid":false,"given":"Michael","family":"F\u00e4rber","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,3,2]]},"reference":[{"key":"3382_CR1","unstructured":"Abu-Jbara, A., & Radev, D. (2012). Reference scope identification in citing sentences. In Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, association for computational linguistics, Stroudsburg, PA, USA (pp. 80\u201390)."},{"key":"3382_CR2","unstructured":"Abu-Jbara, A., Ezra, J., & Radev, D. (2013). Purpose and polarity of citation: Towards NLP-based bibliometrics. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, association for computational linguistics, Atlanta, Georgia (pp. 596\u2013606)."},{"key":"3382_CR3","doi-asserted-by":"crossref","unstructured":"Bast, H., & Korzen, C. (2017). A benchmark and evaluation for text extraction from PDF. In Proceedings of the 2017 ACM\/IEEE joint conference on digital libraries, JCDL\u201917 (pp. 99\u2013108).","DOI":"10.1109\/JCDL.2017.7991564"},{"issue":"4","key":"3382_CR4","doi-asserted-by":"publisher","first-page":"305","DOI":"10.1007\/s00799-015-0156-0","volume":"17","author":"J Beel","year":"2016","unstructured":"Beel, J., Gipp, B., Langer, S., & Breitinger, C. (2016). Research-paper recommender systems: A literature survey. International Journal on Digital Libraries, 17(4), 305\u2013338. https:\/\/doi.org\/10.1007\/s00799-015-0156-0.","journal-title":"International Journal on Digital Libraries"},{"key":"3382_CR5","unstructured":"Bird, S., Dale, R., Dorr, B.J., Gibson, B.R., Joseph, M.T., Kan, M., Lee, D., Powley, B., Radev, D.R., & Tan, Y.F. (2008). The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In Proceedings of the sixth international conference on language resources and evaluation, LREC\u201908."},{"issue":"2","key":"3382_CR6","doi-asserted-by":"publisher","first-page":"101","DOI":"10.1214\/ss\/1009213286","volume":"16","author":"LD Brown","year":"2001","unstructured":"Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2), 101\u2013133.","journal-title":"Statistical Science"},{"key":"3382_CR7","doi-asserted-by":"crossref","unstructured":"Caragea, C., Wu, J., Ciobanu, A.M., Williams, K., Ram\u00edrez, J.P.F., Chen, H., Wu, Z., & Giles, C.L. (2014). CiteSeer x : A scholarly big dataset. In Proceedings of the 36th European conference on IR research, ECIR\u201914 (pp. 311\u2013322).","DOI":"10.1007\/978-3-319-06028-6_26"},{"key":"3382_CR8","doi-asserted-by":"crossref","unstructured":"Chakraborty, T., & Narayanam, R. (2016). All fingers are not equal: Intensity of references in scientific articles. In Proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP\u201916 (pp. 1348\u20131358).","DOI":"10.18653\/v1\/D16-1142"},{"key":"3382_CR9","doi-asserted-by":"crossref","unstructured":"Chandrasekaran, M.K., Yasunaga, M., Radev, D.R., Freitag, D., & Kan, M. (2019). Overview and results: CL-SciSumm shared task 2019. In Proceedings of the 4th joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries, BIRNDL\u201919, (pp. 153\u2013166).","DOI":"10.1145\/3331184.3331650"},{"issue":"3","key":"3382_CR10","doi-asserted-by":"publisher","first-page":"e4261","DOI":"10.1002\/cpe.4261","volume":"31","author":"J Chen","year":"2019","unstructured":"Chen, J., & Zhuge, H. (2019). Automatic generation of related work through summarizing citations. Concurrency and Computation: Practice and Experience, 31(3), e4261.","journal-title":"Concurrency and Computation: Practice and Experience"},{"key":"3382_CR11","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1045\/september2016-duma","volume":"22","author":"D Duma","year":"2016","unstructured":"Duma, D., Klein, E., Liakata, M., Ravenscroft, J., & Clare, A. (2016). Rhetorical classification of anchor text for citation recommendation. D-Lib Magazine, 22, 1.","journal-title":"D-Lib Magazine"},{"key":"3382_CR12","doi-asserted-by":"crossref","unstructured":"Ebesu, T., & Fang, Y. (2017). Neural citation network for context-aware citation recommendation. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, SIGIR\u201917, (pp. 1093\u20131096).","DOI":"10.1145\/3077136.3080730"},{"issue":"1","key":"3382_CR13","doi-asserted-by":"publisher","first-page":"51","DOI":"10.1002\/asi.20707","volume":"59","author":"A Elkiss","year":"2008","unstructured":"Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D. J., & Radev, D. R. (2008). Blind men and elephants: What do citation summaries tell us about a research article? Journal of the Association for Information Science and Technology, 59(1), 51\u201362. https:\/\/doi.org\/10.1002\/asi.20707.","journal-title":"Journal of the Association for Information Science and Technology"},{"key":"3382_CR14","doi-asserted-by":"crossref","unstructured":"F\u00e4rber, M., & Sampath, A. (2019). Determining how citations are used in citation contexts. In Proceedings of the 23th international conference on theory and practice of digital libraries, TPDL\u201919.","DOI":"10.1007\/978-3-030-30760-8_38"},{"key":"3382_CR15","unstructured":"F\u00e4rber, M., Thiemann, A., & Jatowt, A. (2018). A high-quality gold standard for citation-based tasks. In Proceedings of the 11th international conference on language resources and evaluation, LREC\u201918."},{"key":"3382_CR16","doi-asserted-by":"publisher","unstructured":"Galke, L., Mai, F., Vagliano, I., & Scherp, A. (2018). Multi-modal adversarial autoencoders for recommendations of citations and subject labels. In Proceedings of the 26th conference on user modeling, adaptation and personalization, ACM, New York, NY, USA, UMAP \u201918 (pp. 197\u2013205). https:\/\/doi.org\/10.1145\/3209219.3209236.","DOI":"10.1145\/3209219.3209236"},{"key":"3382_CR17","doi-asserted-by":"crossref","unstructured":"Ghosh, S., Das, D., & Chakraborty, T. (2016). Determining sentiment in citation text and analyzing its impact on the proposed ranking index. In Proceedings of the 17th international conference on computational linguistics and intelligent text processing, CICLing\u201916 (pp. 292\u2013306).","DOI":"10.1007\/978-3-319-75487-1_23"},{"key":"3382_CR18","unstructured":"Gipp, B., Meuschke, N., & Lipinski, M. (2015). CITREC: An evaluation framework for citation-based similarity measures based on TREC genomics and PubMed central. In Proceedings of the iConference 2015."},{"key":"3382_CR19","doi-asserted-by":"crossref","unstructured":"He, Q., Pei, J., Kifer, D., Mitra, P., & Giles, C.L. (2010). Context-aware citation recommendation. In Proceedings of the 19th international conference on world wide web, WWW\u201910, (pp. 421\u2013430).","DOI":"10.1145\/1772690.1772734"},{"key":"3382_CR20","doi-asserted-by":"crossref","unstructured":"Huang, W., Wu, Z., Liang, C., Mitra, P., & Giles, C.L. (2015). A neural probabilistic model for context based citation recommendation. In Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI Press, AAAI\u201915 (pp. 2404\u20132410).","DOI":"10.1609\/aaai.v29i1.9528"},{"issue":"2","key":"3382_CR21","doi-asserted-by":"publisher","first-page":"99","DOI":"10.6087\/kcse.2014.1.99","volume":"1","author":"S Huh","year":"2014","unstructured":"Huh, S. (2014). Journal article tag suite 1.0: National information standards organization standard of journal extensible markup language. Science Editing, 1(2), 99\u2013104. https:\/\/doi.org\/10.6087\/kcse.2014.1.99.","journal-title":"Science Editing"},{"issue":"3","key":"3382_CR22","doi-asserted-by":"publisher","first-page":"341","DOI":"10.1093\/applin\/20.3.341","volume":"20","author":"K Hyland","year":"1999","unstructured":"Hyland, K. (1999). Academic attribution: Citation and the construction of disciplinary knowledge. Applied Linguistics, 20(3), 341\u2013367. https:\/\/doi.org\/10.1093\/applin\/20.3.341.","journal-title":"Applied Linguistics"},{"key":"3382_CR23","unstructured":"Lamers, W., Eck, N.J.v., Waltman, L., & Hoos, H. (2018). Patterns in citation context: the case of the field of scientometrics. In STI 2018 conference proceedings, centre for science and technology studies (CWTS) (pp 1114\u20131122)."},{"issue":"1","key":"3382_CR24","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1007\/s11192-012-0828-0","volume":"95","author":"L Liang","year":"2013","unstructured":"Liang, L., Rousseau, R., & Zhong, Z. (2013). Non-english journals and papers in physics and chemistry: Bias in citations? Scientometrics, 95(1), 333\u2013350. https:\/\/doi.org\/10.1007\/s11192-012-0828-0.","journal-title":"Scientometrics"},{"issue":"1","key":"3382_CR25","doi-asserted-by":"publisher","first-page":"359","DOI":"10.1007\/s11192-017-2577-6","volume":"114","author":"F Liu","year":"2018","unstructured":"Liu, F., Hu, G., Tang, L., & Liu, W. (2018). The penalty of containing more non-english articles. Scientometrics, 114(1), 359\u2013366. https:\/\/doi.org\/10.1007\/s11192-017-2577-6.","journal-title":"Scientometrics"},{"key":"3382_CR26","doi-asserted-by":"crossref","unstructured":"Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Research and advanced technology for digital libraries (pp. 473\u2013474). Berlin: Springer.","DOI":"10.1007\/978-3-642-04346-8_62"},{"key":"3382_CR27","doi-asserted-by":"crossref","unstructured":"Mohammad, S., Dorr, B.J., Egan, M., Awadallah, A.H., Muthukrishnan, P., Qazvinian, V., Radev, D.R., Zajic, D.M. (2009). Using citations to generate surveys of scientific paradigms. In Proceedings of the 2009 annual conference of the North American chapter of the association for computational linguistics, NAACL-HLT\u201909, (pp. 584\u2013592).","DOI":"10.3115\/1620754.1620839"},{"key":"3382_CR28","doi-asserted-by":"crossref","unstructured":"Mohapatra, D., Maiti, A., Bhatia, S., & Chakraborty, T. (2019). Go wide, go deep: Quantifying the impact of scientific papers through influence dispersion trees. In Proceedings of the 19th ACM\/IEEE joint conference on digital libraries, JCDL\u201919 (pp. 305\u2013314).","DOI":"10.1109\/JCDL.2019.00051"},{"issue":"1","key":"3382_CR29","doi-asserted-by":"publisher","first-page":"86","DOI":"10.1177\/030631277500500106","volume":"5","author":"MJ Moravcsik","year":"1975","unstructured":"Moravcsik, M. J., & Murugesan, P. (1975). Some results on the function and quality of citations. Social Studies of Science, 5(1), 86\u201392.","journal-title":"Social Studies of Science"},{"issue":"3","key":"3382_CR30","doi-asserted-by":"publisher","first-page":"1931","DOI":"10.1007\/s11192-018-2921-5","volume":"117","author":"Z Nasar","year":"2018","unstructured":"Nasar, Z., Jaffry, S. W., & Malik, M. K. (2018). Information extraction from scientific articles: A survey. Scientometrics, 117(3), 1931\u20131990. https:\/\/doi.org\/10.1007\/s11192-018-2921-5.","journal-title":"Scientometrics"},{"key":"3382_CR31","doi-asserted-by":"publisher","first-page":"323","DOI":"10.1007\/s00799-018-0242-1","volume":"19","author":"A Prasad","year":"2018","unstructured":"Prasad, A., Kaur, M., & Kan, M. Y. (2018). Neural ParsCit: A deep learning based reference string parser. International Journal on Digital Libraries, 19, 323\u2013337.","journal-title":"International Journal on Digital Libraries"},{"issue":"4","key":"3382_CR32","doi-asserted-by":"publisher","first-page":"919","DOI":"10.1007\/s10579-012-9211-2","volume":"47","author":"DR Radev","year":"2013","unstructured":"Radev, D. R., Muthukrishnan, P., Qazvinian, V., & Abu-Jbara, A. (2013). The ACL anthology network corpus. Language Resources and Evaluation, 47(4), 919\u2013944.","journal-title":"Language Resources and Evaluation"},{"issue":"1","key":"3382_CR33","doi-asserted-by":"publisher","first-page":"74","DOI":"10.1016\/j.joi.2017.11.006","volume":"12","author":"Y Reingewertz","year":"2018","unstructured":"Reingewertz, Y., & Lutmar, C. (2018). Academic in-group bias: An empirical examination of the link between author and journal affiliation. Journal of Informetrics, 12(1), 74\u201386. https:\/\/doi.org\/10.1016\/j.joi.2017.11.006.","journal-title":"Journal of Informetrics"},{"key":"3382_CR34","unstructured":"Roy, D., Ray, K., & Mitra, M. (2016). From a scholarly big dataset to a test collection for bibliographic citation recommendation. AAAI Workshops. https:\/\/www.aaai.org\/ocs\/index.php\/WS\/AAAIW16\/paper\/view\/12635."},{"key":"3382_CR35","unstructured":"Saier, T., & F\u00e4rber, M. (2019). Bibliometric-enhanced arXiv: A data set for paper-based and citation-based tasks. In Proceedings of the 8th international workshop on bibliometric-enhanced information retrieval (BIR 2019) co-located with the 41st European conference on information retrieval (ECIR 2019), Cologne, Germany, April 14, 2019, (pp. 14\u201326)."},{"key":"3382_CR36","doi-asserted-by":"crossref","unstructured":"Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.P., & Wang, K. (2015). An overview of microsoft academic service (MAS) and applications. In Proceedings of the 24th international conference on world wide web, WWW\u201915, (pp. 243\u2013246).","DOI":"10.1145\/2740908.2742839"},{"issue":"2","key":"3382_CR37","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1007\/s00799-014-0122-2","volume":"16","author":"K Sugiyama","year":"2015","unstructured":"Sugiyama, K., & Kan, M. (2015). A comprehensive evaluation of scholarly paper recommendation using potential citation papers. International Journal on Digital Libraries, 16(2), 91\u2013109.","journal-title":"International Journal on Digital Libraries"},{"key":"3382_CR38","volume-title":"Genre analysis: English in academic and research settings","author":"J Swales","year":"1990","unstructured":"Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge: Cambridge University Press."},{"key":"3382_CR39","doi-asserted-by":"publisher","unstructured":"Tang, X., Wan, X., & Zhang, X. (2014). Cross-language context-aware citation recommendation in scientific articles. In Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, ACM, SIGIR \u201914 (pp. 817\u2013826). https:\/\/doi.org\/10.1145\/2600428.2609564.","DOI":"10.1145\/2600428.2609564"},{"key":"3382_CR40","doi-asserted-by":"crossref","unstructured":"Teufel, S., Siddharthan, A., & Tidhar, D. (2006a) An annotation scheme for citation function. In Proceedings of the 7th SIGdial workshop on discourse and dialogue, association for computational linguistics, SigDIAL \u201906 (pp. 80\u201387).","DOI":"10.3115\/1654595.1654612"},{"key":"3382_CR41","doi-asserted-by":"crossref","unstructured":"Teufel, S., Siddharthan, A., & Tidhar, D. (2006b) Automatic classification of citation function. In Proceedings of the 2006 conference on empirical methods in natural language processing, EMNLP\u201906, (pp. 103\u2013110).","DOI":"10.3115\/1610075.1610091"},{"issue":"4","key":"3382_CR42","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1007\/s10032-015-0249-8","volume":"18","author":"D Tkaczyk","year":"2015","unstructured":"Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, L. (2015). CERMINE: Automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR), 18(4), 317\u2013335.","journal-title":"International Journal on Document Analysis and Recognition (IJDAR)"},{"key":"3382_CR43","doi-asserted-by":"publisher","unstructured":"Tkaczyk, D., Collins, A., Sheridan, P., & Beel, J. (2018). Machine learning vs. rules and out-of-the-box vs. retrained: An evaluation of open-source bibliographic reference and citation parsers. In Proceedings of the 18th ACM\/IEEE on joint conference on digital libraries, ACM, New York, NY, USA, JCDL \u201918 (pp. 99\u2013108). https:\/\/doi.org\/10.1145\/3197026.3197048.","DOI":"10.1145\/3197026.3197048"},{"key":"3382_CR44","unstructured":"Valenzuela, M., Ha, V., & Etzioni, O. (2015). Identifying meaningful citations. AAAI Workshops. https:\/\/www.aaai.org\/ocs\/index.php\/WS\/AAAIW15\/paper\/view\/10185."},{"key":"3382_CR45","unstructured":"Whidby, M., Zajic, D., & Dorr, B. (2011). Citation handling for improved summarization of scientific documents. Tech. rep."}],"container-title":["Scientometrics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11192-020-03382-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11192-020-03382-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11192-020-03382-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,10,17]],"date-time":"2022-10-17T03:21:56Z","timestamp":1665976916000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11192-020-03382-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,2]]},"references-count":45,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["3382"],"URL":"https:\/\/doi.org\/10.1007\/s11192-020-03382-z","relation":{},"ISSN":["0138-9130","1588-2861"],"issn-type":[{"value":"0138-9130","type":"print"},{"value":"1588-2861","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,3,2]]},"assertion":[{"value":"30 September 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 March 2020","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}