{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:28:05Z","timestamp":1750220885167,"version":"3.41.0"},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2019,12,26]],"date-time":"2019-12-26T00:00:00Z","timestamp":1577318400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["CCF-1150013 (CAREER), CCF-1439126, CCF-1150036 (CAREER), and CCF-1439062"],"award-info":[{"award-number":["CCF-1150013 (CAREER), CCF-1439126, CCF-1150036 (CAREER), and CCF-1439062"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Battelle for DOE","award":["DE-AC05-76RL01830"],"award-info":[{"award-number":["DE-AC05-76RL01830"]}]},{"name":"U.S. Department of Energy's (DOE) Office of Science, Office of Advanced Scientific Computing Research, under DOE Early Career","award":["63823 and DE-SC0010295"],"award-info":[{"award-number":["63823 and DE-SC0010295"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2019,12,31]]},"abstract":"<jats:p>\n            The pursuit of computational efficiency has led to the proliferation of\n            <jats:italic>throughput-oriented<\/jats:italic>\n            hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to execute data-parallel computations in a vectorized manner efficiently. However, many algorithms are more naturally expressed as divide-and-conquer, recursive,\n            <jats:italic>task-parallel<\/jats:italic>\n            computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This article presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel\u2019s SSE4.2 vector units, as well as accelerators using Intel\u2019s AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.\n          <\/jats:p>","DOI":"10.1145\/3365663","type":"journal-article","created":{"date-parts":[[2019,12,26]],"date-time":"2019-12-26T21:05:46Z","timestamp":1577394346000},"page":"1-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Extracting SIMD Parallelism from Recursive Task-Parallel Programs"],"prefix":"10.1145","volume":"6","author":[{"given":"Bin","family":"Ren","sequence":"first","affiliation":[{"name":"William 8 Mary, Pacific Northwest National Laboratory"}]},{"given":"Shruthi","family":"Balakrishna","sequence":"additional","affiliation":[{"name":"Purdue University"}]},{"given":"Youngjoon","family":"Jo","sequence":"additional","affiliation":[{"name":"Purdue University"}]},{"given":"Sriram","family":"Krishnamoorthy","sequence":"additional","affiliation":[{"name":"Pacific Northwest National Laboratory"}]},{"given":"Kunal","family":"Agrawal","sequence":"additional","affiliation":[{"name":"Washington University in St. Louis"}]},{"given":"Milind","family":"Kulkarni","sequence":"additional","affiliation":[{"name":"Purdue University"}]}],"member":"320","published-online":{"date-parts":[[2019,12,26]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In HPG\u201909. 145--149.  Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In HPG\u201909. 145--149.","DOI":"10.1145\/1572769.1572792"},{"key":"e_1_2_1_2_1","unstructured":"Barcelona OpenMP Task Suite (BOTS) 2012. Barcelona OpenMP Task Suite (BOTS). https:\/\/pm.bsc.es\/projects\/bots.  Barcelona OpenMP Task Suite (BOTS) 2012. Barcelona OpenMP Task Suite (BOTS). https:\/\/pm.bsc.es\/projects\/bots."},{"key":"e_1_2_1_3_1","volume-title":"Sumit Gulwani, Cesar Kunz, and Mark Marron.","author":"Barthe Gilles","year":"2013","unstructured":"Gilles Barthe , Juan Manuel Crespo , Sumit Gulwani, Cesar Kunz, and Mark Marron. 2013 . From relational verification to SIMD loop synthesis. In PPoPP\u2019 13. 123--134. Gilles Barthe, Juan Manuel Crespo, Sumit Gulwani, Cesar Kunz, and Mark Marron. 2013. From relational verification to SIMD loop synthesis. In PPoPP\u201913. 123--134."},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Lars Bergstrom Matthew Fluet Mike Rainey John Reppy Stephen Rosen and Adam Shaw. 2013. Data-only flattening for nested data parallelism. ACM SIGPLAN Notices 48. ACM 81--92.  Lars Bergstrom Matthew Fluet Mike Rainey John Reppy Stephen Rosen and Adam Shaw. 2013. Data-only flattening for nested data parallelism. ACM SIGPLAN Notices 48. ACM 81--92.","DOI":"10.1145\/2517327.2442525"},{"volume-title":"SPAA\u201904: Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM","author":"Guy","key":"e_1_2_1_5_1","unstructured":"Guy E. Blelloch and Phillip B. Gibbons. 2004. Effectively sharing a cache among threads . In SPAA\u201904: Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM , New York,, 235--244. DOI:https:\/\/doi.org\/10.1145\/1007912.1007948 10.1145\/1007912.1007948 Guy E. Blelloch and Phillip B. Gibbons. 2004. Effectively sharing a cache among threads. In SPAA\u201904: Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York,, 235--244. DOI:https:\/\/doi.org\/10.1145\/1007912.1007948"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/0743-7315(90)90087-6"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/209936.209958"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4374"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/1413957.1413967"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-44681-8_76"},{"key":"e_1_2_1_11_1","volume-title":"Simon","author":"Chhugani Jatin","year":"2012","unstructured":"Jatin Chhugani , Changkyu Kim , Hemant Shukla , Jongsoo Park , Pradeep Dubey , John Shalf , and Horst D . Simon . 2012 . Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems. In SC\u201912. Article 1, 11 pages. Jatin Chhugani, Changkyu Kim, Hemant Shukla, Jongsoo Park, Pradeep Dubey, John Shalf, and Horst D. Simon. 2012. Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems. In SC\u201912. Article 1, 11 pages."},{"key":"e_1_2_1_12_1","unstructured":"Cilk 2010. Cilk. http:\/\/supertech.csail.mit.edu\/cilk\/.  Cilk 2010. Cilk. http:\/\/supertech.csail.mit.edu\/cilk\/."},{"volume-title":"Shallow bounding","author":"Dammertz Holger","key":"e_1_2_1_13_1","unstructured":"Holger Dammertz , Johannes Hanika , and Alexander Keller . 2008. Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays. In EGSR\u2019 08 . 1225--1233. Holger Dammertz, Johannes Hanika, and Alexander Keller. 2008. Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays. In EGSR\u201908. 1225--1233."},{"key":"e_1_2_1_14_1","first-page":"2","article-title":"Programming with exceptions in JCilk. Sci","volume":"63","author":"Danaher John S.","year":"2006","unstructured":"John S. Danaher , I.- Ting Angelina Lee , and Charles E. Leiserson . 2006 . Programming with exceptions in JCilk. Sci . Comput. Program. 63 , 2 (Dec. 2006), 147--171. John S. Danaher, I.-Ting Angelina Lee, and Charles E. Leiserson. 2006. Programming with exceptions in JCilk. Sci. Comput. Program. 63, 2 (Dec. 2006), 147--171.","journal-title":"Comput. Program."},{"key":"e_1_2_1_15_1","first-page":"7","article-title":"A fast computer method for matrix transposing","volume":"21","author":"Eklundh J. O.","year":"1972","unstructured":"J. O. Eklundh . 1972 . A fast computer method for matrix transposing . IEEE Trans. Comput. 21 , 7 (July 1972), 801--803. J. O. Eklundh. 1972. A fast computer method for matrix transposing. IEEE Trans. Comput. 21, 7 (July 1972), 801--803.","journal-title":"IEEE Trans. Comput."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783716"},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Matteo Frigo Pablo Halpern Charles E. Leiserson and Stephen Lewin-Berlin. 2009. Reducers and other Cilk++ hyperobjects. In SPAA\u201909. 79--90.  Matteo Frigo Pablo Halpern Charles E. Leiserson and Stephen Lewin-Berlin. 2009. Reducers and other Cilk++ hyperobjects. In SPAA\u201909. 79--90.","DOI":"10.1145\/1583991.1584017"},{"key":"e_1_2_1_18_1","volume-title":"Randall","author":"Frigo Matteo","year":"1998","unstructured":"Matteo Frigo , Charles E. Leiserson , and Keith H . Randall . 1998 . The implementation of the Cilk-5 multithreaded language. In PLDI\u2019 98. 212--223. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the Cilk-5 multithreaded language. In PLDI\u201998. 212--223."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2012.257"},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","unstructured":"Yi Guo R. Barik R. Raman and V. Sarkar. 2009. Work-first and help-first scheduling policies for async-finish task parallelism. In IPDPS\u201909. 1--12.  Yi Guo R. Barik R. Raman and V. Sarkar. 2009. Work-first and help-first scheduling policies for async-finish task parallelism. In IPDPS\u201909. 1--12.","DOI":"10.1109\/IPDPS.2009.5161079"},{"key":"e_1_2_1_21_1","volume-title":"Owens","author":"Gupta Kshitij","year":"2012","unstructured":"Kshitij Gupta , Jeff A. Stuart , and John D . Owens . 2012 . A study of persistent threads style GPU programming for GPGPU workloads. In 2012 Innovative Parallel Computing (InPar). IEEE , 1--14. Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In 2012 Innovative Parallel Computing (InPar). IEEE, 1--14."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2009.73"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/0021-9991(90)90230-X"},{"key":"e_1_2_1_24_1","doi-asserted-by":"crossref","unstructured":"R. D. Hornung and J. A. Keasler. 2013. A Case for Improved C++ Compiler Support to Enable Performance Portability in Large Physics Simulation Codes. Technical Report. Tech. rep. Lawrence Livermore National Laboratory (LLNL) Livermore CA.  R. D. Hornung and J. A. Keasler. 2013. A Case for Improved C++ Compiler Support to Enable Performance Portability in Large Physics Simulation Codes. Technical Report. Tech. rep. Lawrence Livermore National Laboratory (LLNL) Livermore CA.","DOI":"10.2172\/1078540"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2010.88"},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Paul Hudak and Eric Mohr. 1988. Graphinators and the duality of SIMD and MIMD. In LFP\u201988. 224--234.  Paul Hudak and Eric Mohr. 1988. Graphinators and the duality of SIMD and MIMD. In LFP\u201988. 224--234.","DOI":"10.1145\/62678.62714"},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Xin Huo Sriram Krishnamoorthy and Gagan Agrawal. 2013. Efficient scheduling of recursive control flow on GPUs. In ICS\u201913. 409--420.  Xin Huo Sriram Krishnamoorthy and Gagan Agrawal. 2013. Efficient scheduling of recursive control flow on GPUs. In ICS\u201913. 409--420.","DOI":"10.1145\/2464996.2479870"},{"key":"e_1_2_1_28_1","unstructured":"Youngjoon Jo Michael Goldfarb and Milind Kulkarni. 2013. Automatic vectorization of tree traversals. In PACT\u201913. 363--374.  Youngjoon Jo Michael Goldfarb and Milind Kulkarni. 2013. Automatic vectorization of tree traversals. In PACT\u201913. 363--374."},{"key":"e_1_2_1_29_1","doi-asserted-by":"crossref","unstructured":"Youngjoon Jo and Milind Kulkarni. 2011. Enhancing locality for recursive traversals of recursive structures. In OOPSLA\u201911. 463--482.  Youngjoon Jo and Milind Kulkarni. 2011. Enhancing locality for recursive traversals of recursive structures. In OOPSLA\u201911. 463--482.","DOI":"10.1145\/2076021.2048104"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807206"},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Seonggun Kim and Hwansoo Han. 2012. Efficient SIMD code generation for irregular kernels. In PPoPP\u201912. 55--64.  Seonggun Kim and Hwansoo Han. 2012. Efficient SIMD code generation for irregular kernels. In PPoPP\u201912. 55--64.","DOI":"10.1145\/2370036.2145824"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1504\/IJHPCN.2004.008897"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/0196-6774(90)90002-V"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2015.107"},{"key":"e_1_2_1_35_1","volume-title":"Russell Power, and Jinyang Li","author":"Liao Yisheng","year":"2013","unstructured":"Yisheng Liao , Alex Rubinsteyn , Russell Power, and Jinyang Li . 2013 . Learning random forests on the GPU. New York University, Department of Computer Science ( 2013). Yisheng Liao, Alex Rubinsteyn, Russell Power, and Jinyang Li. 2013. Learning random forests on the GPU. New York University, Department of Computer Science (2013)."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0020-0255(98)10080-4"},{"key":"e_1_2_1_37_1","volume-title":"Padua","author":"Maleki Saeed","year":"2011","unstructured":"Saeed Maleki , Yaoqing Gao , Maria J. Garzar\u00e1n , Tommy Wong , and David A . Padua . 2011 . An evaluation of vectorizing compilers. In PACT\u2019 11. 372--382. Saeed Maleki, Yaoqing Gao, Maria J. Garzar\u00e1n, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In PACT\u201911. 372--382."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3611"},{"volume-title":"Proceedings of the 17th Conference on ACM Annual Computer Science Conference (CSC\u201989)","author":"Martin H. W.","key":"e_1_2_1_39_1","unstructured":"H. W. Martin and B. J. Orr . 1989. A random binary tree generator . In Proceedings of the 17th Conference on ACM Annual Computer Science Conference (CSC\u201989) . ACM, New York, 33--38. DOI:https:\/\/doi.org\/10.1145\/75427.75429 10.1145\/75427.75429 H. W. Martin and B. J. Orr. 1989. A random binary tree generator. In Proceedings of the 17th Conference on ACM Annual Computer Science Conference (CSC\u201989). ACM, New York, 33--38. DOI:https:\/\/doi.org\/10.1145\/75427.75429"},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","unstructured":"Todd Mytkowicz Madanlal Musuvathi and Wolfram Schulte. 2014. Data-parallel finite-state machines. In ASPLOS\u201914. 529--542.  Todd Mytkowicz Madanlal Musuvathi and Wolfram Schulte. 2014. Data-parallel finite-state machines. In ASPLOS\u201914. 529--542.","DOI":"10.1145\/2644865.2541988"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3865"},{"key":"e_1_2_1_42_1","doi-asserted-by":"crossref","unstructured":"Dorit Nuzman and Ayal Zaks. 2008. Outer-loop vectorization: Revisited for short SIMD architectures. In PACT\u201908. 2--11.  Dorit Nuzman and Ayal Zaks. 2008. Outer-loop vectorization: Revisited for short SIMD architectures. In PACT\u201908. 2--11.","DOI":"10.1145\/1454115.1454119"},{"key":"e_1_2_1_43_1","unstructured":"NVIDIA. 2015. CUDA. http:\/\/www.nvidia.com\/object\/cuda_home_new.html.  NVIDIA. 2015. CUDA. http:\/\/www.nvidia.com\/object\/cuda_home_new.html."},{"key":"e_1_2_1_44_1","volume-title":"UTS: An unbalanced tree search benchmark. In LCPC\u201906. 235--250.","author":"Olivier Stephen","year":"2007","unstructured":"Stephen Olivier , Jun Huan , Jinze Liu , Jan Prins , James Dinan , P. Sadayappan , and Chau-Wen Tseng . 2007 . UTS: An unbalanced tree search benchmark. In LCPC\u201906. 235--250. Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2007. UTS: An unbalanced tree search benchmark. In LCPC\u201906. 235--250."},{"key":"e_1_2_1_45_1","unstructured":"OpenMP Architecture Review Board. 2008. OpenMP Specification and Features. http:\/\/openmp.org\/wp\/.  OpenMP Architecture Review Board. 2008. OpenMP Specification and Features. http:\/\/openmp.org\/wp\/."},{"key":"e_1_2_1_46_1","volume-title":"Wood","author":"Orr Marc S.","year":"2014","unstructured":"Marc S. Orr , Bradford M. Beckmann , Steven K. Reinhardt , and David A . Wood . 2014 . Fine-grain task aggregation and coordination on GPUs. In ISCA\u2019 14. 181--192. Marc S. Orr, Bradford M. Beckmann, Steven K. Reinhardt, and David A. Wood. 2014. Fine-grain task aggregation and coordination on GPUs. In ISCA\u201914. 181--192."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/1409060.1409096"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2004.840306"},{"key":"e_1_2_1_49_1","unstructured":"James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O\u2019Reilly.  James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O\u2019Reilly."},{"key":"e_1_2_1_50_1","doi-asserted-by":"crossref","unstructured":"Bin Ren Gagan Agrawal James R. Larus Todd Mytkowicz Tomi Poutanen and Wolfram Schulte. 2013. SIMD parallelization of applications that traverse irregular data structures. In CGO\u201913. 1--10.  Bin Ren Gagan Agrawal James R. Larus Todd Mytkowicz Tomi Poutanen and Wolfram Schulte. 2013. SIMD parallelization of applications that traverse irregular data structures. In CGO\u201913. 1--10.","DOI":"10.1109\/CGO.2013.6494989"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2737924.2738004"},{"key":"e_1_2_1_52_1","doi-asserted-by":"crossref","unstructured":"Bin Ren Sriram Krishnamoorthy Kunal Agrawal and Milind Kulkarni. 2017. Exploiting vector and multicore parallelism for recursive data-and task-parallel programs. ACM SIGPLAN Notices 52. ACM 117--130.  Bin Ren Sriram Krishnamoorthy Kunal Agrawal and Milind Kulkarni. 2017. Exploiting vector and multicore parallelism for recursive data-and task-parallel programs. ACM SIGPLAN Notices 52. ACM 117--130.","DOI":"10.1145\/3155284.3018763"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/45.6.653"},{"key":"e_1_2_1_54_1","doi-asserted-by":"crossref","unstructured":"Michael Steffen and Joseph Zambreno. 2010. Improving SIMT efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. In MICRO\u201943. 237--248.  Michael Steffen and Joseph Zambreno. 2010. Improving SIMT efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. In MICRO\u201943. 237--248.","DOI":"10.1109\/MICRO.2010.45"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2661229.2661250"},{"key":"e_1_2_1_56_1","first-page":"3","article-title":"OpenCL: A parallel programming standard for heterogeneous computing systems","volume":"12","author":"Stone John E.","year":"2010","unstructured":"John E. Stone , David Gohara , and Guochun Shi . 2010 . OpenCL: A parallel programming standard for heterogeneous computing systems . IEEE Des. Test 12 , 3 (May 2010), 66--73. John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. IEEE Des. Test 12, 3 (May 2010), 66--73.","journal-title":"IEEE Des. Test"},{"key":"e_1_2_1_57_1","unstructured":"TPL 2007. The Task Parallel Library. http:\/\/msdn.microsoft.com\/en-us\/magazine\/cc163340.aspx.  TPL 2007. The Task Parallel Library. http:\/\/msdn.microsoft.com\/en-us\/magazine\/cc163340.aspx."},{"key":"e_1_2_1_58_1","volume-title":"Owens","author":"Tzeng Stanley","year":"2010","unstructured":"Stanley Tzeng , Anjul Patney , and John D . Owens . 2010 . Task management for irregular-parallel workloads on the GPU. In HPG\u2019 10. 29--37. Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task management for irregular-parallel workloads on the GPU. In HPG\u201910. 29--37."},{"key":"e_1_2_1_59_1","volume-title":"BrainSlug: Transparent acceleration of deep learning through depth-first parallelism. arXiv preprint arXiv:1804.08378","author":"Weber Nicolas","year":"2018","unstructured":"Nicolas Weber , Florian Schmidt , Mathias Niepert , and Felipe Huici . 2018. BrainSlug: Transparent acceleration of deep learning through depth-first parallelism. arXiv preprint arXiv:1804.08378 ( 2018 ). Nicolas Weber, Florian Schmidt, Mathias Niepert, and Felipe Huici. 2018. BrainSlug: Transparent acceleration of deep learning through depth-first parallelism. arXiv preprint arXiv:1804.08378 (2018)."},{"volume-title":"Proceedings of the 19th Symposium on Interactive 3D Graphics and Games. ACM, 39--45","author":"Weber Thomas","key":"e_1_2_1_60_1","unstructured":"Thomas Weber , Michael Wimmer , and John D. Owens . 2015. Parallel Reyes-style adaptive subdivision with bounded memory usage . In Proceedings of the 19th Symposium on Interactive 3D Graphics and Games. ACM, 39--45 . Thomas Weber, Michael Wimmer, and John D. Owens. 2015. Parallel Reyes-style adaptive subdivision with bounded memory usage. In Proceedings of the 19th Symposium on Interactive 3D Graphics and Games. ACM, 39--45."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICECCS.2015.21"},{"key":"e_1_2_1_62_1","unstructured":"X10 2006. The X10 Programming Language. www.research.ibm.com\/x10\/.  X10 2006. The X10 Programming Language. www.research.ibm.com\/x10\/."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2016.71"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3203217.3203243"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2014.07.004"},{"key":"e_1_2_1_66_1","doi-asserted-by":"crossref","unstructured":"Kun Zhou Qiming Hou Rui Wang and Baining Guo. 2008. Real-time KD-tree construction on graphics hardware. ACM Transactions on Graphics (TOG) 27. ACM 126.  Kun Zhou Qiming Hou Rui Wang and Baining Guo. 2008. Real-time KD-tree construction on graphics hardware. ACM Transactions on Graphics (TOG) 27. ACM 126.","DOI":"10.1145\/1457515.1409079"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3365663","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3365663","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3365663","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:44:21Z","timestamp":1750203861000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3365663"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,12,26]]},"references-count":66,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2019,12,31]]}},"alternative-id":["10.1145\/3365663"],"URL":"https:\/\/doi.org\/10.1145\/3365663","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"type":"print","value":"2329-4949"},{"type":"electronic","value":"2329-4957"}],"subject":[],"published":{"date-parts":[[2019,12,26]]},"assertion":[{"value":"2015-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-09-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-12-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}