{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T16:01:01Z","timestamp":1776182461893,"version":"3.50.1"},"reference-count":75,"publisher":"SAGE Publications","issue":"2","license":[{"start":{"date-parts":[[2021,7,1]],"date-time":"2021-07-01T00:00:00Z","timestamp":1625097600000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nd\/4.0\/"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Big Data &amp; Society"],"published-print":{"date-parts":[[2021,7]]},"abstract":"<jats:p> In response to growing concerns of bias, discrimination, and unfairness perpetuated by algorithmic systems, the datasets used to train and evaluate machine learning models have come under increased scrutiny. Many of these examinations have focused on the contents of machine learning datasets, finding glaring underrepresentation of minoritized groups. In contrast, relatively little work has been done to examine the norms, values, and assumptions embedded in these datasets. In this work, we conceptualize machine learning datasets as a type of informational infrastructure, and motivate a genealogy as method in examining the histories and modes of constitution at play in their creation. We present a critical history of ImageNet as an exemplar, utilizing critical discourse analysis of major texts around ImageNet\u2019s creation and impact. We find that assumptions around ImageNet and other large computer vision datasets more generally rely on three themes: the aggregation and accumulation of more data, the computational construction of meaning, and making certain types of data labor invisible. By tracing the discourses that surround this influential benchmark, we contribute to the ongoing development of the standards and norms around data development in machine learning and artificial intelligence research. <\/jats:p>","DOI":"10.1177\/20539517211035955","type":"journal-article","created":{"date-parts":[[2021,9,24]],"date-time":"2021-09-24T14:19:29Z","timestamp":1632493169000},"update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":155,"title":["On the genealogy of machine learning datasets: A critical history of ImageNet"],"prefix":"10.1177","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4915-0512","authenticated-orcid":false,"given":"Emily","family":"Denton","sequence":"first","affiliation":[{"name":"Google Research, NY, USA"}]},{"given":"Alex","family":"Hanna","sequence":"additional","affiliation":[{"name":"Google Research, NY, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3497-0641","authenticated-orcid":false,"given":"Razvan","family":"Amironesei","sequence":"additional","affiliation":[{"name":"Center for Applied Data Ethics, University of San Francisco, CA, USA"}]},{"given":"Andrew","family":"Smart","sequence":"additional","affiliation":[{"name":"Google Research, NY, USA"}]},{"given":"Hilary","family":"Nicole","sequence":"additional","affiliation":[{"name":"Google Research, NY, USA"}]}],"member":"179","published-online":{"date-parts":[[2021,9,24]]},"reference":[{"key":"bibr1-20539517211035955","unstructured":"Anderson C (2008) The end of theory: The data deluge makes the scientific method obsolete. https:\/\/www.wired.com\/2008\/06\/pb-theory\/"},{"key":"bibr2-20539517211035955","unstructured":"Bambach S, Crandall D, Smith L, et al. (2018) Toddler-inspired visual object learning."},{"key":"bibr3-20539517211035955","volume-title":"Race After Technology: Abolitionist Tools for the New Jim Code","author":"Benjamin R","year":"2019"},{"key":"bibr4-20539517211035955","unstructured":"Bolukbasi T, Chang KW, Zou J, et al. (2016) Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In: Proceedings of the 30th international conference on neural information processing systems, Barcelona, Spain."},{"key":"bibr5-20539517211035955","doi-asserted-by":"crossref","unstructured":"Bowker G, Baker K, Millerand F, et al. (2010) Toward information infrastructure studies: Ways of knowing in a networked environment. In: Hunsinger J, Klastrup L and Allen M (eds) International Handbook of Internet Research. Springer, pp. 97\u2013117.","DOI":"10.1007\/978-1-4020-9789-8_5"},{"key":"bibr6-20539517211035955","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/6352.001.0001"},{"key":"bibr7-20539517211035955","unstructured":"Chang AX, Funkhouser T, Guibas L, et al. (2015) ShapeNet: an information-rich 3D model repository. Technical Report, arXiv:1512.03012 [cs.GR]."},{"key":"bibr8-20539517211035955","unstructured":"Chutel L (2018) China is exporting facial recognition software to africa, expanding its vast database. https:\/\/qz.com\/africa\/1287675\/china-is-exporting-facial-recognition-to-africa-ensuring-ai-dominance-through-diversity\/"},{"key":"bibr9-20539517211035955","unstructured":"Crawford K, Paglen T (2019) Excavating AI: The politics of images in machine learning training sets. Excavating AI https:\/\/excavating.ai"},{"key":"bibr10-20539517211035955","doi-asserted-by":"crossref","unstructured":"Deng J, Dong W, Socher R, et al. (2009) ImageNet: A large-scale hierarchical image database. In: CVPR09, Miami, FL.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"bibr11-20539517211035955","unstructured":"Denton E, Hanna A, Amironesei R, et al. (2020) Bringing the people back in: Contesting benchmark machine learning datasets. In: ICML workshop on participatory approaches to machine learning."},{"key":"bibr12-20539517211035955","unstructured":"DeVries T, Misra I, Wang C, et al. (2019) Does object recognition work for everyone? In: IEEE conference on computer vision and pattern recognition workshops, Long Beach, CA."},{"key":"bibr13-20539517211035955","doi-asserted-by":"crossref","unstructured":"Dotan R, Milli S (2020) Value-laden disciplinary shifts in machine learning. In: Proceedings of the 2020 conference on fairness, accountability, and transparency, Barcelona, Spain.","DOI":"10.1145\/3351095.3373157"},{"key":"bibr14-20539517211035955","volume-title":"Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor","author":"Eubanks V","year":"2018"},{"key":"bibr15-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-009-0275-4"},{"key":"bibr16-20539517211035955","doi-asserted-by":"publisher","DOI":"10.4324\/9781315834368"},{"key":"bibr17-20539517211035955","unstructured":"Fei-Fei L (2010) Imagenet: Crowdsourcing, benchmarking, & other cool things. http:\/\/www.image-net.org\/papers\/ImageNet_2010.pdf"},{"key":"bibr18-20539517211035955","unstructured":"Fei-Fei L (2012) Computers that see. https:\/\/www.youtube.com\/watch?v=viwpTTvSQKM"},{"key":"bibr19-20539517211035955","unstructured":"Fei-Fei L (2017) Imagenet: Where have we gone? Where are we going? https:\/\/learning.acm.org\/techtalks\/ImageNet"},{"key":"bibr20-20539517211035955","unstructured":"Fei-Fei L (2019) Where did imagenet come from? https:\/\/www.youtube.com\/watch?v=Z7naK1uq1F8"},{"key":"bibr21-20539517211035955","unstructured":"Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In: Conference on computer vision and pattern recognition workshop, Washington, DC."},{"key":"bibr22-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1167\/7.1.10"},{"key":"bibr23-20539517211035955","volume-title":"The Archaeology of Knowledge","author":"Foucault M","year":"1972"},{"key":"bibr24-20539517211035955","volume-title":"Discipline and Punish: The Birth of the Prison","author":"Foucault M","year":"1977"},{"key":"bibr25-20539517211035955","unstructured":"Gebru T, Morgenstern J, Vecchione B, et al. (2018) Datasheets for datasets. https:\/\/arxiv.org\/abs\/1803.09010"},{"key":"bibr26-20539517211035955","doi-asserted-by":"crossref","unstructured":"Geiger RS, Yu K, Yang Y, et al. (2020) Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from? In: Proceedings of the 2020 conference on fairness, accountability, and transparency, Barcelona, Spain.","DOI":"10.1145\/3351095.3372862"},{"key":"bibr27-20539517211035955","unstructured":"Gershgorn D (2017) The data that transformed ai research\u2014and possibly the world. https:\/\/qz.com\/1034972\/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world\/"},{"key":"bibr28-20539517211035955","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/9302.001.0001"},{"key":"bibr29-20539517211035955","volume-title":"Deep Learning","author":"Goodfellow I","year":"2016"},{"key":"bibr30-20539517211035955","volume-title":"Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass","author":"Gray ML","year":"2019"},{"key":"bibr31-20539517211035955","doi-asserted-by":"publisher","DOI":"10.2307\/3178066"},{"key":"bibr32-20539517211035955","unstructured":"Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. In: Proceedings of the 30th international conference on neural information processing systems, Barcelona, Spain."},{"key":"bibr33-20539517211035955","volume-title":"NLP\u2019s Clever Hans Moment has Arrived","author":"Heinzerling B","year":"2019"},{"key":"bibr34-20539517211035955","doi-asserted-by":"crossref","unstructured":"Hutchinson B, Prabhakaran V, Denton E, et al. (2020) Social biases in NLP models as barriers for persons with disabilities. In: Proceedings of ACL 2020.","DOI":"10.18653\/v1\/2020.acl-main.487"},{"key":"bibr35-20539517211035955","doi-asserted-by":"crossref","unstructured":"Hutchinson B, Smart A, Hanna A, et al. (2021) Towards accountability for machine learning datasets. In: Proceedings of the 2021 conference on fairness, accountability, and transparency.","DOI":"10.1145\/3442188.3445918"},{"key":"bibr36-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1177\/1461444813511926"},{"key":"bibr37-20539517211035955","doi-asserted-by":"crossref","unstructured":"Irani LC, Silberman MS (2013) Turkopticon: Interrupting worker invisibility in amazon mechanical turk. In: Proceedings of the SIGCHI conference on human factors in computing systems, Paris, France, pp. 611\u2013620.","DOI":"10.1145\/2470654.2470742"},{"key":"bibr38-20539517211035955","doi-asserted-by":"crossref","unstructured":"Jo ES, Gebru T (2020) Lessons from archives: Strategies for collecting sociocultural data in machine learning. In: Proceedings of the 2020 conference on fairness, accountability, and transparency, FAT* \u201920, Barcelona, Spain.","DOI":"10.1145\/3351095.3372829"},{"key":"bibr39-20539517211035955","unstructured":"Johnson K (2020) AI weekly: A deep learning pioneer\u2019s teachable moment on AI bias. https:\/\/venturebeat.com\/2020\/06\/26\/ai-weekly-a-deep-learning-pioneers-teachable-moment-on-ai-bias\/"},{"key":"bibr40-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1177\/2053951714528481"},{"key":"bibr41-20539517211035955","doi-asserted-by":"publisher","DOI":"10.7208\/chicago\/9780226626611.001.0001"},{"key":"bibr42-20539517211035955","doi-asserted-by":"crossref","unstructured":"Krause J, Sapp B, Howard A, et al. (2016) The unreasonable effectiveness of noisy data for fine-grained recognition. In: ECCV, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46487-9_19"},{"key":"bibr43-20539517211035955","unstructured":"Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, Lake Tahoe, Utah."},{"key":"bibr44-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1146\/annurev-anthro-092412-155522"},{"key":"bibr45-20539517211035955","volume-title":"Science in Action: How to Follow Scientists and Engineers Through Society","author":"Latour B","year":"1987"},{"key":"bibr46-20539517211035955","unstructured":"Malev\u00e9 N (2019) An introduction to image datasets. Unthinking photography. https:\/\/unthinking.photography\/articles\/an-introduction-to-image-datasets"},{"key":"bibr47-20539517211035955","unstructured":"Merler M, Ratha N, Feris R, et al. (2019) Diversity in faces. ArXiv abs\/1901.10436."},{"key":"bibr48-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1145\/3415186"},{"key":"bibr49-20539517211035955","doi-asserted-by":"crossref","unstructured":"Mitchell M, Wu S, Zaldivar A, et al. (2019) Model cards for model reporting. In: Proceedings of the 2019 conference on fairness, accountability, and transparency, Atlanta, GA.","DOI":"10.1145\/3287560.3287596"},{"key":"bibr50-20539517211035955","doi-asserted-by":"publisher","DOI":"10.18574\/nyu\/9781479833641.001.0001"},{"key":"bibr51-20539517211035955","unstructured":"\u1eccn\u1ee5\u1ecdha M (2016) The point of collection. https:\/\/points.datasociety.net\/the-point-of-collection-8ee44ad7c2fa"},{"key":"bibr52-20539517211035955","unstructured":"\u1eccn\u1ee5\u1ecdha M (2019) The future is here! http:\/\/mimionuoha.com\/the-future-is-here"},{"key":"bibr53-20539517211035955","doi-asserted-by":"crossref","unstructured":"Paullada A, Raji ID, Bender EM, et al. (2020) Data and its (dis)contents: A survey of dataset development and use in machine learning research. In: ML-retrospectives, surveys & meta-analyses @ NeurIPS.","DOI":"10.1016\/j.patter.2021.100336"},{"issue":"4","key":"bibr54-20539517211035955","first-page":"6","volume":"38","author":"Plasek A","year":"2016","journal-title":"IEEE Annals of the History of Computing"},{"key":"bibr55-20539517211035955","doi-asserted-by":"crossref","unstructured":"Prabhu VU, Birhane A (2020) Large image datasets: A pyrrhic win for computer vision? ArXiv abs\/2006.16923.","DOI":"10.1109\/WACV48630.2021.00158"},{"key":"bibr56-20539517211035955","volume":"8","author":"Puschmann C","year":"2014","journal-title":"International Journal of Communication"},{"key":"bibr57-20539517211035955","unstructured":"Ruder S (2018) NLP\u2019s imagenet moment has arrived. https:\/\/thegradient.pub\/nlp-imagenet\/"},{"key":"bibr58-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"bibr59-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1177\/2053951718820549"},{"key":"bibr60-20539517211035955","doi-asserted-by":"crossref","unstructured":"Sambasivan N, Kapania S, Highfill H, et al. (2021) \u201cEveryone wants to do the model work, not the data work\u201d: Data cascades in high-stakes ai. In: In proceedings of the ACM SIGCHI conference on human factors in computing systems.","DOI":"10.1145\/3411764.3445518"},{"key":"bibr61-20539517211035955","doi-asserted-by":"crossref","unstructured":"Scheuerman MK, Paul JM, Brubaker JR (2019) How computers see gender: An evaluation of gender classification in commercial facial analysis services. In: Proc. ACM hum.comput. interact..","DOI":"10.1145\/3359246"},{"key":"bibr62-20539517211035955","doi-asserted-by":"crossref","unstructured":"Scheuerman MK, Wade K, Lustig C, et al. (2020) How we\u2019ve taught algorithms to see identity: Constructing race and gender in image databases for facial analysis. In: Proc. ACM hum.comput. interact.","DOI":"10.1145\/3392866"},{"key":"bibr63-20539517211035955","doi-asserted-by":"crossref","unstructured":"Selbst AD, Boyd D, Friedler SA, et al. (2019) Fairness and abstraction in sociotechnical systems. In: Proceedings of the conference on fairness, accountability, and transparency, Atlanta, GA.","DOI":"10.1145\/3287560.3287598"},{"key":"bibr64-20539517211035955","unstructured":"Shankar S, Halpern Y, Breck E, et al. (2017) No classification without representation: Assessing geodiversity issues in open data sets for the developing world. In: NIPS 2017 workshop: machine learning for the developing world, Long Beach, CA."},{"key":"bibr65-20539517211035955","unstructured":"Solon O (2019) Facial recognition\u2019s \u2018dirty little secret\u2019: Millions of online photos scraped without consent. NBC News. Https:\/\/www.nbcnews.com\/tech\/internet\/facial-recognition-s-dirty-little-secret-millions-online-photos-scraped-n981921"},{"key":"bibr66-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1177\/00027649921955326"},{"key":"bibr67-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1023\/A:1008651105359"},{"key":"bibr68-20539517211035955","doi-asserted-by":"publisher","DOI":"10.22148\/16.036"},{"key":"bibr69-20539517211035955","doi-asserted-by":"crossref","unstructured":"Sun C, Shrivastava A, Singh S, et al. (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: 2017 IEEE international conference on computer vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.97"},{"key":"bibr70-20539517211035955","unstructured":"Thickstun J, Harchaoui Z, Kakade SM (2017) Learning features of music from scratch. ArXiv abs\/1611.09827."},{"key":"bibr71-20539517211035955","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2008.128"},{"key":"bibr72-20539517211035955","doi-asserted-by":"publisher","DOI":"10.24908\/ss.v12i2.4776"},{"key":"bibr73-20539517211035955","unstructured":"Trewin S (2018) AI fairness for people with disabilities: Point of view. ArXiv abs\/1811.10670."},{"key":"bibr74-20539517211035955","doi-asserted-by":"publisher","DOI":"10.24908\/ss.v12i2.4776"},{"key":"bibr75-20539517211035955","doi-asserted-by":"crossref","unstructured":"Yang K, Qinami K, Fei-Fei L, et al. (2020) Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In: Proceedings of the 2020 conference on fairness, accountability, and transparency, Barcelona, Spain.","DOI":"10.1145\/3351095.3375709"}],"container-title":["Big Data &amp; Society"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/20539517211035955","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/20539517211035955","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/20539517211035955","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,1]],"date-time":"2025-03-01T09:45:08Z","timestamp":1740822308000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/20539517211035955"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7]]},"references-count":75,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,7]]}},"alternative-id":["10.1177\/20539517211035955"],"URL":"https:\/\/doi.org\/10.1177\/20539517211035955","relation":{},"ISSN":["2053-9517","2053-9517"],"issn-type":[{"value":"2053-9517","type":"print"},{"value":"2053-9517","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7]]},"article-number":"20539517211035955"}}