{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,5]],"date-time":"2026-02-05T10:58:57Z","timestamp":1770289137586,"version":"3.49.0"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2018,3,21]],"date-time":"2018-03-21T00:00:00Z","timestamp":1521590400000},"content-version":"vor","delay-in-days":365,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"IBM CAS Faculty Fellowship"},{"DOI":"10.13039\/501100004543","name":"Chinese Scholarship Council","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100004543","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF-1629376, CNS-1319617 and CCF-1116104"],"award-info":[{"award-number":["CCF-1629376, CNS-1319617 and CCF-1116104"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2017,3,31]]},"abstract":"<jats:p>Data race detection has become an important problem in GPU programming. Previous designs of CPU race-checking tools are mainly task parallel and incur high overhead on GPUs due to access instrumentation, especially when monitoring many thousands of threads routinely used by GPU programs.<\/jats:p>\n                  <jats:p>This article presents a novel data-parallel solution designed and optimized for the GPU architecture. It includes compiler support and a set of runtime techniques. It uses value-based checking, which detects the races reported in previous work, finds new races, and supports race-free deterministic GPU execution. More important, race checking is massively data parallel and does not introduce divergent branching or atomic synchronization. Its slowdown is less than 5 \u00d7 for over half of the tests and 10 \u00d7 on average, which is orders of magnitude more efficient than the cuda-memcheck tool by Nvidia and the methods that use fine-grained access instrumentation.<\/jats:p>","DOI":"10.1145\/3046678","type":"journal-article","created":{"date-parts":[[2017,3,23]],"date-time":"2017-03-23T12:19:44Z","timestamp":1490271584000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["LD"],"prefix":"10.1145","volume":"14","author":[{"given":"Pengcheng","family":"Li","sequence":"first","affiliation":[{"name":"University of Rochester, Rochester, NY"}]},{"given":"Xiaoyu","family":"Hu","sequence":"additional","affiliation":[{"name":"University of Rochester, Rochester, NY"}]},{"given":"Dong","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Rochester, Rochester, NY"}]},{"given":"Jacob","family":"Brock","sequence":"additional","affiliation":[{"name":"University of Rochester, Rochester, NY"}]},{"given":"Hao","family":"Luo","sequence":"additional","affiliation":[{"name":"University of Rochester, Rochester, NY"}]},{"given":"Eddy Z.","family":"Zhang","sequence":"additional","affiliation":[{"name":"Rutgers University, Piscataway, NJ"}]},{"given":"Chen","family":"Ding","sequence":"additional","affiliation":[{"name":"University of Rochester, Rochester,NY"}]}],"member":"320","published-online":{"date-parts":[[2017,3,21]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Allen and Ken Kennedy","author":"John","year":"2001","unstructured":"John R. Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751238"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.485843"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCGrid.2015.159"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-06200-6_18"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1736020.1736029"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1640089.1640096"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2384616.2384625"},{"key":"e_1_2_2_9_1","volume-title":"Proceedings of the 3rd Workshop on Software Tools for MultiCore Systems.","author":"Boyer Michael","year":"2008","unstructured":"Michael Boyer, Kevin Skadron, and Westley Weimer. 2008. Automated dynamic analysis of CUDA programs. In Proceedings of the 3rd Workshop on Software Tools for MultiCore Systems."},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1869459.1869515"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-38088-4_15"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2535838.2535882"},{"key":"e_1_2_2_13_1","volume-title":"Engineering a Compiler","author":"Cooper Keith","unstructured":"Keith Cooper and Linda Torczon. 2010. Engineering a Compiler (2nd ed.). Morgan Kaufmann.","edition":"2"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/2337159.2337182"},{"key":"e_1_2_2_15_1","unstructured":"Chen Ding Brian Gernhart Pengcheng Li and Matthew Hertz. 2014. Safe Parallel Programming in An Interpreted Language. Technical Report URCS #991. Department of Computer Science University of Rochester."},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250734.1250760"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2384616.2384650"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1542476.1542490"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/263764.263785"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","unstructured":"Anup Holey Vineeth Mekkat and Antonia Zhai. 2013. HAccRG: Hardware-accelerated data race detection in GPUs. In ICPP. 10.1109\/ICPP.2013.15","DOI":"10.1109\/ICPP.2013.15"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","unstructured":"Qiming Hou Kun Zhou and Baining Guo. 2009. Debugging GPU stream programs through automatic dataflow recording and visualization. In ACM SIGGRAPH Asia 2009 Papers. 10.1145\/1661412.1618499","DOI":"10.1145\/1661412.1618499"},{"key":"e_1_2_2_22_1","volume-title":"Proceedings of the Workshop on Determinism and Correctness in Parallel Programming.","author":"Ji Weixing","unstructured":"Weixing Ji, Li Lu, and Michael L. Scott. 2013. TARDIS: Task-level access race detection by intersecting sets. In Proceedings of the Workshop on Determinism and Correctness in Parallel Programming."},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451118"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2048066.2048087"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2009.18"},{"key":"e_1_2_2_26_1","volume-title":"Proceedings of the https:\/\/tu-dresden.de\/zih\/forschung\/projekte\/scout\/.","author":"Krzikalla Olaf","year":"2011","unstructured":"Olaf Krzikalla. 2011. Scout: A Source-to-Source Translator for SIMD-Optimizations. Proceedings of the https:\/\/tu-dresden.de\/zih\/forschung\/projekte\/scout\/."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919639"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254064.2254110"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/1882291.1882320"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145844"},{"key":"e_1_2_2_31_1","volume-title":"Proceedings of the 5th Workshop on Determinism and Correctness in Parallel Programming.","author":"Li Pengcheng","year":"2014","unstructured":"Pengcheng Li, Chen Ding, Xiaoyu Hu, and Tolga Soyata. 2014. LDetector: A low overhead race detector for GPU programs. In Proceedings of the 5th Workshop on Determinism and Correctness in Parallel Programming."},{"key":"e_1_2_2_32_1","unstructured":"Pengcheng Li Ziang Hu and Handong Ye. 2015. Compiler and Method for Global-Scope Basic-Block Reordering. https:\/\/www.google.com\/patents\/US20150040106 US Patent App. 14\/445 983."},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/2388996.2389036"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.20"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2014.24"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/143369.143426"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2594291.2594300"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","unstructured":"Wenjing Ma and Gagan Agrawal. 2010. An integer programming framework for optimizing shared memory use on GPUs. In PACT. 10.1145\/1854273.1854348","DOI":"10.1145\/1854273.1854348"},{"key":"e_1_2_2_39_1","unstructured":"NVIDIA. 2014. Cuda Memcheck Tool. Retrieved from https:\/\/developer.nvidia.com\/CUDA-MEMCHECK."},{"key":"e_1_2_2_40_1","unstructured":"NVIDIA. 2016. CUDA C Programming Guide. Retrieved from http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/."},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/238721.238760"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.888645"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/1736020.1736030"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254064.2254127"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/268998.266641"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","unstructured":"Michael L. Scott. 2013. Shared-Memory Synchronization. Morgan 8 Claypool Publishers.","DOI":"10.5555\/2534458"},{"key":"e_1_2_2_47_1","first-page":"66","article-title":"OpenCL: A parallel programming standard for heterogeneous computing systems","volume":"12","author":"Stone John E.","year":"2010","unstructured":"John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. IEEE Design Test 12, 3 (2010), 66--72.","journal-title":"IEEE Design Test"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1806596.1806604"},{"key":"e_1_2_2_49_1","unstructured":"UIUC. 2012. The Parboil Benchmark Suite. Retrieved from http:\/\/impact.crhc.illinois.edu\/parboil\/parboil.aspx."},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/79173.79181"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/1950365.1950370"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462182"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/1950365.1950408"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/1941553.1941574"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2013.44"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346191"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3046678","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3046678","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3046678","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T09:18:25Z","timestamp":1763457505000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3046678"}},"subtitle":["Low-Overhead GPU Race Detection Without Access Monitoring"],"short-title":[],"issued":{"date-parts":[[2017,3,21]]},"references-count":56,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2017,3,31]]}},"alternative-id":["10.1145\/3046678"],"URL":"https:\/\/doi.org\/10.1145\/3046678","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,3,21]]},"assertion":[{"value":"2016-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-01-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-03-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}