{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T02:04:53Z","timestamp":1774922693537,"version":"3.50.1"},"reference-count":37,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2018,12,19]],"date-time":"2018-12-19T00:00:00Z","timestamp":1545177600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Data science is a new academic field that has received much attention in recent years. One reason for this is that our increasingly digitalized society generates more and more data in all areas of our lives and science and we are desperately seeking for solutions to deal with this problem. In this paper, we investigate the academic roots of data science. We are using data of scientists and their citations from Google Scholar, who have an interest in data science, to perform a quantitative analysis of the data science community. Furthermore, for decomposing the data science community into its major defining factors corresponding to the most important research fields, we introduce a statistical regression model that is fully automatic and robust with respect to a subsampling of the data. This statistical model allows us to define the \u2018importance\u2019 of a field as its predictive abilities. Overall, our method provides an objective answer to the question \u2018What is data science?\u2019.<\/jats:p>","DOI":"10.3390\/make1010015","type":"journal-article","created":{"date-parts":[[2018,12,19]],"date-time":"2018-12-19T12:12:44Z","timestamp":1545221564000},"page":"235-251","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":21,"title":["Defining Data Science by a Data-Driven Quantification of the Community"],"prefix":"10.3390","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0745-5641","authenticated-orcid":false,"given":"Frank","family":"Emmert-Streib","sequence":"first","affiliation":[{"name":"Predictive Medicine and Data Analytics Lab, Department of Signal Processing, Tampere University of Technology, FI-33101 Tampere, Finland"},{"name":"Institute of Biosciences and Medical Technology, FI-33101 Tampere, Finland"}]},{"given":"Matthias","family":"Dehmer","sequence":"additional","affiliation":[{"name":"Institute for Intelligent Production, Faculty for Management, University of Applied Sciences Upper Austria, Steyr Campus, A-4400 Steyr, Austria"},{"name":"Department of Mechatronics and Biomedical Computer Science, UMIT, A-6060 Hall in Tyrol, Austria"},{"name":"College of Computer and Control Engineering, Nankai University, Tianjin 300071, China"}]}],"member":"1968","published-online":{"date-parts":[[2018,12,19]]},"reference":[{"key":"ref_1","unstructured":"Marshall, A. (1890). Principles of Economics, Macmillan."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1145\/368424.368427","article-title":"The Role of the University in Computers, Data Processing, and Related Fields","volume":"2","author":"Fein","year":"1959","journal-title":"Commun. ACM"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"319","DOI":"10.1016\/0010-4825(78)90032-X","article-title":"Interactive instruction on population interactions","volume":"8","author":"Hogeweg","year":"1978","journal-title":"Comput. Biol. Med."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Dehmer, M., and Emmert-Streib, F. (2017). Frontiers in Data Science, CRC Press.","DOI":"10.1201\/9781315156408"},{"key":"ref_5","unstructured":"Loukides, M. (2011). What Is Data Science?, O\u2019Reilly Media."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"51","DOI":"10.1089\/big.2013.1508","article-title":"Data science and its relationship to big data and data-driven decision making","volume":"1","author":"Provost","year":"2013","journal-title":"Big Data"},{"key":"ref_7","unstructured":"Naur, P. (1974). Concise Survey of Computer Methods, Studentlitteratur."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1111\/j.1751-5823.2001.tb00477.x","article-title":"Data science: An action plan for expanding the technical areas of the field of statistics","volume":"69","author":"Cleveland","year":"2001","journal-title":"Int. Stat. Rev."},{"key":"ref_9","first-page":"70","article-title":"Data scientist: The sexiest job of the 21st century","volume":"90","author":"Patil","year":"2012","journal-title":"Harv. Bus. Rev."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Hayashi, C. (1998). What is data science? Fundamental concepts and a heuristic example. Data Science, Classification, and Related Methods, Springer.","DOI":"10.1007\/978-4-431-65950-1_3"},{"key":"ref_11","first-page":"12","article-title":"The process of analyzing data is the emergent feature of data science","volume":"7","author":"Moutari","year":"2016","journal-title":"Front. Genet."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"163","DOI":"10.2481\/dsj.5.163","article-title":"Data science as an academic discipline","volume":"5","author":"Smith","year":"2006","journal-title":"Data Sci. J."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Zhong, N., and Xiong, Y. (2009). Data explosion, data nature and dataology. Procceedings of the International Conference on Brain Informatics, Beijing, China, 22\u201324 October 2009, Springer.","DOI":"10.1007\/978-3-642-04954-5_25"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"8","DOI":"10.5334\/dsj-2015-008","article-title":"Towards data science","volume":"14","author":"Zhu","year":"2015","journal-title":"Data Sci. J."},{"key":"ref_15","unstructured":"Zhu, Y., and Xiong, Y. (arXiv, 2015). Defining data science, arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"931","DOI":"10.1007\/s11192-015-1614-6","article-title":"Methods for estimating the size of Google Scholar","volume":"104","year":"2015","journal-title":"Scientometrics"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Khabsa, M., and Giles, C.L. (2014). The number of scholarly documents on the public web. PLoS ONE, 9.","DOI":"10.1371\/journal.pone.0093949"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"461","DOI":"10.1214\/aos\/1176344136","article-title":"Estimating the dimension of a model","volume":"6","author":"Schwarz","year":"1978","journal-title":"Ann. Stat."},{"key":"ref_19","unstructured":"Lideman, R., Merenda, P., and Gold, R. (1980). Introduction to Bivariate and Multivariate Analysis Scott, Scott Foresman."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"291","DOI":"10.1023\/A:1017919924342","article-title":"The literature of bibliometrics, scientometrics, and informetrics","volume":"52","author":"Hood","year":"2001","journal-title":"Scientometrics"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"719","DOI":"10.1007\/s11192-008-2197-2","article-title":"Is science becoming more interdisciplinary? Measuring and mapping six research fields over time","volume":"81","author":"Porter","year":"2009","journal-title":"Scientometrics"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Emmert-Streib, F., and Glazko, G. (2011). Pathway analysis of expression data: Deciphering functional building blocks of complex diseases. PLoS Comput. Biol., 7.","DOI":"10.1371\/journal.pcbi.1002053"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"401","DOI":"10.1093\/bioinformatics\/btl633","article-title":"Enrichment or depletion of a GO category within a class of genes: which test?","volume":"23","author":"Rivals","year":"2006","journal-title":"Bioinformatics"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"308","DOI":"10.1198\/tast.2009.08199","article-title":"Variable importance assessment in regression: Linear regression versus random forest","volume":"63","year":"2009","journal-title":"Am. Stat."},{"key":"ref_25","first-page":"1","article-title":"Relative importance for linear regression in R: The package relaimpo","volume":"17","year":"2006","journal-title":"J. Stat. Softw."},{"key":"ref_26","unstructured":"R Development Core Team (2008). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"de Matos Simoes, R., and Emmert-Streib, F. (2012). Bagging statistical network inference from large-scale gene expression data. PLoS ONE, 7.","DOI":"10.1371\/journal.pone.0033624"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Altay, G., and Emmert-Streib, F. (2010). Inferring the conservative causal core of gene regulatory networks. BMC Syst. Biol., 4.","DOI":"10.1186\/1752-0509-4-132"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"de Matos Simoes, R., Dehmer, M., and Emmert-Streib, F. (2013). Interfacing cellular networks of S. cerevisiae and E. coli: Connecting dynamic and genetic information. BMC Genom., 14.","DOI":"10.1186\/1471-2164-14-324"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Emmert-Streib, F., de Matos Simoes, R., Glazko, G., McDade, S., Haibe-Kains, B., Holzinger, A., Dehmer, M., and Campbell, F. (2014). Functional and genetic analysis of the colon cancer network. BMC Bioinformat., 15.","DOI":"10.1186\/1471-2105-15-S6-S6"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"8198","DOI":"10.1038\/s41598-018-26575-2","article-title":"Multilayer Aggregation of Investor Trading Networks","volume":"1","author":"Baltakys","year":"2018","journal-title":"Sci. Rep."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"293","DOI":"10.7155\/jgaa.00168","article-title":"Using a Significant Spanning Tree to Draw a Directed Graph","volume":"12","author":"Harrigan","year":"2008","journal-title":"J. Graphs Algorithms Appl."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1214\/07-EJS004","article-title":"Forward stagewise regression and the monotone lasso","volume":"1","author":"Hastie","year":"2007","journal-title":"Electron. J. Stat."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"49","DOI":"10.2307\/2348411","article-title":"The interpretation of Mallows\u2019s C_p-statistic","volume":"45","author":"Gilmour","year":"1996","journal-title":"Statistician"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"325","DOI":"10.1016\/j.eswa.2014.07.056","article-title":"Subset selection by Mallows? Cp: A mixed integer programming approach","volume":"42","author":"Miyashiro","year":"2015","journal-title":"Expert Syst. Appl."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"488","DOI":"10.1038\/464488a","article-title":"Let\u2019s make science metrics more scientific","volume":"464","author":"Lane","year":"2010","journal-title":"Nature"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"678","DOI":"10.1126\/science.1201865","article-title":"Measuring the results of science investments","volume":"331","author":"Lane","year":"2011","journal-title":"Science"}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/1\/1\/15\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:34:56Z","timestamp":1760196896000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/1\/1\/15"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,12,19]]},"references-count":37,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2019,3]]}},"alternative-id":["make1010015"],"URL":"https:\/\/doi.org\/10.3390\/make1010015","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,12,19]]}}}