{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T02:07:01Z","timestamp":1776305221188,"version":"3.50.1"},"reference-count":31,"publisher":"Association for Computing Machinery (ACM)","issue":"OOPSLA","license":[{"start":{"date-parts":[[2017,10,12]],"date-time":"2017-10-12T00:00:00Z","timestamp":1507766400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Program. Lang."],"published-print":{"date-parts":[[2017,10,12]]},"abstract":"<jats:p>Previous studies have shown that there is a non-trivial amount of duplication in source code. This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. In other words, 70% of the code on GitHub consists of clones of previously created files. There is considerable variation between language ecosystems. JavaScript has the highest rate of file duplication, only 6% of the files are distinct. Java, on the other hand, has the least duplication, 60% of files are distinct. Lastly, a project-level analysis shows that between 9% and 31% of the projects contain at least 80% of files that can be found elsewhere. These rates of duplication have implications for systems built on open source software as well as for researchers interested in analyzing large code bases. As a concrete artifact of this study, we have created D\u00e9j\u00e0Vu, a publicly available map of code duplicates in GitHub repositories.<\/jats:p>","DOI":"10.1145\/3133908","type":"journal-article","created":{"date-parts":[[2017,10,13]],"date-time":"2017-10-13T15:15:45Z","timestamp":1507907745000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":143,"title":["D\u00e9j\u00e0Vu: a map of code duplicates on GitHub"],"prefix":"10.1145","volume":"1","author":[{"given":"Cristina V.","family":"Lopes","sequence":"first","affiliation":[{"name":"University of California at Irvine, USA"}]},{"given":"Petr","family":"Maj","sequence":"additional","affiliation":[{"name":"Czech Technical University, Czechia"}]},{"given":"Pedro","family":"Martins","sequence":"additional","affiliation":[{"name":"University of California at Irvine, USA"}]},{"given":"Vaibhav","family":"Saini","sequence":"additional","affiliation":[{"name":"University of California at Irvine, USA"}]},{"given":"Di","family":"Yang","sequence":"additional","affiliation":[{"name":"University of California at Irvine, USA"}]},{"given":"Jakub","family":"Zitny","sequence":"additional","affiliation":[{"name":"Czech Technical University, Czechia"}]},{"given":"Hitesh","family":"Sajnani","sequence":"additional","affiliation":[{"name":"Microsoft Research, USA"}]},{"given":"Jan","family":"Vitek","sequence":"additional","affiliation":[{"name":"Northeastern University, USA"}]}],"member":"320","published-online":{"date-parts":[[2017,10,12]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICECCS.2013.42"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1167473.1167488"},{"key":"e_1_2_2_3_1","doi-asserted-by":"crossref","unstructured":"Hudson Borges Andr\u00e9 C. Hora and Marco Tulio Valente. 2016. Understanding the Factors that Impact the Popularity of GitHub Repositories. (2016). http:\/\/arxiv.org\/abs\/1606.04984  Hudson Borges Andr\u00e9 C. Hora and Marco Tulio Valente. 2016. Understanding the Factors that Impact the Popularity of GitHub Repositories. (2016). http:\/\/arxiv.org\/abs\/1606.04984","DOI":"10.1109\/ICSME.2016.31"},{"key":"e_1_2_2_4_1","volume-title":"Assert Use in GitHub Projects. In International Conference on Sotware Engineering (ICSE). http:\/\/dl.acm.org\/citation.cfm?id=2818754","author":"Casalnuovo Casey","year":"2015","unstructured":"Casey Casalnuovo , Prem Devanbu , Abilio Oliveira , Vladimir Filkov , and Baishakhi Ray . 2015 . Assert Use in GitHub Projects. In International Conference on Sotware Engineering (ICSE). http:\/\/dl.acm.org\/citation.cfm?id=2818754 .2818846 Casey Casalnuovo, Prem Devanbu, Abilio Oliveira, Vladimir Filkov, and Baishakhi Ray. 2015. Assert Use in GitHub Projects. In International Conference on Sotware Engineering (ICSE). http:\/\/dl.acm.org\/citation.cfm?id=2818754.2818846"},{"key":"e_1_2_2_5_1","volume-title":"Practical Language-independent Detection of Near-miss Clones. In Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). http:\/\/dl.acm.org\/citation. cfm?id=1034914","author":"Cordy James R.","year":"2004","unstructured":"James R. Cordy , Thomas R. Dean , and Nikita Synytskyy . 2004 . Practical Language-independent Detection of Near-miss Clones. In Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). http:\/\/dl.acm.org\/citation. cfm?id=1034914 .1034915 James R. Cordy, Thomas R. Dean, and Nikita Synytskyy. 2004. Practical Language-independent Detection of Near-miss Clones. In Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). http:\/\/dl.acm.org\/citation. cfm?id=1034914.1034915"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2901739.2901776"},{"key":"e_1_2_2_7_1","unstructured":"John W. Creswell. 2014. Research Design: ualitative uantitative and Mixed Methods Approaches. SAGE.  John W. Creswell. 2014. Research Design: ualitative uantitative and Mixed Methods Approaches. SAGE."},{"key":"e_1_2_2_8_1","volume-title":"Boa: A Language and Infrastructure for Analyzing Ultra-large-scale Sotware Repositories. In International Conference on Sotware Engineering (ICSE). http: \/\/dl.acm.org\/citation.cfm?id=2486788","author":"Dyer Robert","unstructured":"Robert Dyer , Hoan Anh Nguyen , Hridesh Rajan , and Tien N. Nguyen . 2013 . Boa: A Language and Infrastructure for Analyzing Ultra-large-scale Sotware Repositories. In International Conference on Sotware Engineering (ICSE). http: \/\/dl.acm.org\/citation.cfm?id=2486788 .2486844 Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A Language and Infrastructure for Analyzing Ultra-large-scale Sotware Repositories. In International Conference on Sotware Engineering (ICSE). http: \/\/dl.acm.org\/citation.cfm?id=2486788.2486844"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1833272.1833278"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSR.2013.6624034"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-21347-2_16"},{"key":"e_1_2_2_12_1","volume-title":"1 billion files, 14 terabytes of code: Spaces or Tabs?","author":"Hofa Felipe","year":"2016","unstructured":"Felipe Hofa . 2016. 400,000 GitHub repositories , 1 billion files, 14 terabytes of code: Spaces or Tabs? ( 2016 ). https: \/\/medium.com\/@hofa\/400-000-github-repositories-1-billion-iles-14-terabytes-of-code-spaces-or-tabs-7cfe0b5dd7fd Felipe Hofa. 2016. 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs? (2016). https: \/\/medium.com\/@hofa\/400-000-github-repositories-1-billion-iles-14-terabytes-of-code-spaces-or-tabs-7cfe0b5dd7fd"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2597073.2597074"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2002.1019480"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CSMR.2013.48"},{"key":"e_1_2_2_16_1","series-title":"Dagstuhl Seminar Proceedings 06301","volume-title":"Duplication, Redundancy, and Similarity in Sotware","author":"Koschke R.","unstructured":"R. Koschke . 2007. Survey of research on sotware clones . In Duplication, Redundancy, and Similarity in Sotware ( Dagstuhl Seminar Proceedings 06301 ). R. Koschke. 2007. Survey of research on sotware clones. In Duplication, Redundancy, and Similarity in Sotware (Dagstuhl Seminar Proceedings 06301)."},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/FLOSS.2007.10"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSR.2009.5069476"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491411.2491415"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSR.2009.5069501"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSM.2011.6080795"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2635868.2635922"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2048066.2048119"},{"key":"e_1_2_2_24_1","volume-title":"Technical Report 541. ueens University.","author":"Roy C. K.","year":"2007","unstructured":"C. K. Roy and J. R. Cordy . 2007 . A survey on sotware clone detection research. Technical Report 541. ueens University. C. K. Roy and J. R. Cordy. 2007. A survey on sotware clone detection research. Technical Report 541. ueens University."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSTW.2009.18"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1002\/smr.v22:3"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2884781.2884877"},{"key":"e_1_2_2_29_1","unstructured":"Johnny Salda\u00f1a. 2009. The Coding Manual for ualitative Researchers. SAGE.  Johnny Salda\u00f1a. 2009. The Coding Manual for ualitative Researchers. SAGE."},{"key":"e_1_2_2_30_1","unstructured":"SPEC. 1998. SPECjvm98 benchmarks. (1998).  SPEC. 1998. SPECjvm98 benchmarks. (1998)."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSM.2015.7332459"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10664-016-9438-4"}],"container-title":["Proceedings of the ACM on Programming Languages"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3133908","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3133908","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T02:13:25Z","timestamp":1750212805000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3133908"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,10,12]]},"references-count":31,"journal-issue":{"issue":"OOPSLA","published-print":{"date-parts":[[2017,10,12]]}},"alternative-id":["10.1145\/3133908"],"URL":"https:\/\/doi.org\/10.1145\/3133908","relation":{},"ISSN":["2475-1421"],"issn-type":[{"value":"2475-1421","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,10,12]]},"assertion":[{"value":"2017-10-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}