{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T11:48:28Z","timestamp":1763466508310,"version":"3.41.0"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2015,3,9]],"date-time":"2015-03-09T00:00:00Z","timestamp":1425859200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF Graduate Research Fellowship Grant &num; DGE-0707424"},{"name":"NSF Expedition in Computing Award &num; CCF-0926127"},{"name":"C-FAR (one of six centers of STARnet, an SRC program sponsored by MARCO and DARPA)"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2015,4,16]]},"abstract":"<jats:p>The purpose of this research is to find a neural-network-based solution to the well-known problem of branch divergence in Single Instruction Multiple Data (SIMD) architectures. Our approach differs from existing techniques that handle branch (or control-flow) divergence, which use costly hardware modifications, low-utilization masking techniques, or static prediction methods. As we examine divergent applications, we characterize the degree of data-dependent control flow seen in each and isolate the code regions (or \u201ckernels\u201d) that cause the most performance degradation due to branch divergence. We then train neural networks (NNs) offline to approximate these kernels and inject the NN computations directly into the applications as substitutes for the kernels they approximate. This essentially translates control flow into nondivergent computation, trading off precision for performance. As our methodology manipulates application source code directly, it is inherently platform agnostic and can be adopted as a general means for accelerating divergent applications on data-parallel architectures. In this article, we present the Neuralizer, an automated software flow for kernel identification, NN training, and NN integration, as well as supplementary user-controlled optimization techniques. Evaluating our approach on a variety of divergent applications run on a Graphics Processing Unit (GPU), we on average achieve performance gains of 13.6 \u00d7 and energy savings of 14.8 \u00d7 with 96% accuracy.<\/jats:p>","DOI":"10.1145\/2717311","type":"journal-article","created":{"date-parts":[[2015,3,9]],"date-time":"2015-03-09T19:03:01Z","timestamp":1425927781000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["Accelerating Divergent Applications on SIMD Architectures Using Neural Networks"],"prefix":"10.1145","volume":"12","author":[{"given":"Beayna","family":"Grigorian","sequence":"first","affiliation":[{"name":"University of California, Los Angeles, CA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Glenn","family":"Reinman","sequence":"additional","affiliation":[{"name":"University of California, Los Angeles, CA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2015,3,9]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1806596.1806620"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073012.1073017"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454128"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2509136.2509546"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2025113.2025131"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2008.05.014"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2012.6402898"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2011.2179038"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2228360.2228512"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2333660.2333747"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155676"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2150976.2151008"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.48"},{"volume-title":"Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture. 25--36","author":"Fung W. W. L.","key":"e_1_2_1_14_1","unstructured":"W. W. L. Fung and T. M. Aamodt . 2011. Thread block compaction for efficient SIMT control flow . In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture. 25--36 . W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture. 25--36."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.12"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/AHS.2014.6880184"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1128022.1128023"},{"key":"e_1_2_1_18_1","volume-title":"Neural Networks: A Comprehensive Foundation","author":"Haykin Simon","year":"1998","unstructured":"Simon Haykin . 1998 . Neural Networks: A Comprehensive Foundation ( 2 nd ed.). Prentice Hall PTR. Simon Haykin. 1998. Neural Networks: A Comprehensive Foundation (2nd ed.). Prentice Hall PTR.","edition":"2"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/0893-6080(91)90009-T"},{"key":"e_1_2_1_20_1","unstructured":"P3 International. 2014. Kill A Watt. http:\/\/www.p3international.com\/products\/p4400.html. (2014).  P3 International. 2014. Kill A Watt. http:\/\/www.p3international.com\/products\/p4400.html. (2014)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/360128.360145"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000080"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the 2nd Annual ASCI Conference. 132--137","author":"Lomont Chris","year":"2011","unstructured":"Chris Lomont . 2011 . Introduction to Intel advanced vector extensions . In Proceedings of the 2nd Annual ASCI Conference. 132--137 . Chris Lomont. 2011. Introduction to Intel advanced vector extensions. In Proceedings of the 2nd Annual ASCI Conference. 132--137."},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the 18th IEEE\/IFIP VLSI System on Chip Conference. 91--95","author":"Meher P. K.","year":"2010","unstructured":"P. K. Meher . 2010 . An optimized lookup-table for the evaluation of sigmoid function for artificial neural networks . In Proceedings of the 18th IEEE\/IFIP VLSI System on Chip Conference. 91--95 . P. K. Meher. 2010. An optimized lookup-table for the evaluation of sigmoid function for artificial neural networks. In Proceedings of the 18th IEEE\/IFIP VLSI System on Chip Conference. 91--95."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815992"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155656"},{"volume-title":"Implementation of a Fast Artificial Neural Network Library (FANN). Report","author":"Nissen Steffen","key":"e_1_2_1_29_1","unstructured":"Steffen Nissen . 2003. Implementation of a Fast Artificial Neural Network Library (FANN). Report . Department of Computer Science University of Copenhagen (DIKU) . Steffen Nissen. 2003. Implementation of a Fast Artificial Neural Network Library (FANN). Report. Department of Computer Science University of Copenhagen (DIKU)."},{"key":"e_1_2_1_30_1","unstructured":"Nvidia. 2013. GeForce GTX 480. Retrieved from http:\/\/www.geforce.com\/hardware\/desktop-gpus\/geforce-gtx-480.  Nvidia. 2013. GeForce GTX 480. Retrieved from http:\/\/www.geforce.com\/hardware\/desktop-gpus\/geforce-gtx-480."},{"key":"e_1_2_1_31_1","unstructured":"Nvidia. 2014a. CUDA 5.5 Production Release. Retrieved from http:\/\/developer.nvidia.com\/cuda-downloads.  Nvidia. 2014a. CUDA 5.5 Production Release. Retrieved from http:\/\/developer.nvidia.com\/cuda-downloads."},{"key":"e_1_2_1_32_1","unstructured":"Nvidia. 2014b. CUDA Math Library. Retrieved from http:\/\/developer.nvidia.com\/cuda-math-library.  Nvidia. 2014b. CUDA Math Library. Retrieved from http:\/\/developer.nvidia.com\/cuda-math-library."},{"key":"e_1_2_1_33_1","unstructured":"J. M. Ortega and W. C. Rheinboldt. 1970. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press.   J. M. Ortega and W. C. Rheinboldt. 1970. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626400000214"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1038\/323533a0"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541948"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540711"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/1993498.1993518"},{"volume-title":"CUDA by Example: An Introduction to General-Purpose GPU Programming","author":"Sanders Jason","key":"e_1_2_1_39_1","unstructured":"Jason Sanders and Edward Kandrot . 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming . Addison-Wesley . Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2012.2232647"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1360612.1360617"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2025113.2025133"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339693"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485954"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155640"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2018323.2018331"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342011434814"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/5326.897072"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2717311","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2717311","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T07:00:44Z","timestamp":1750230044000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2717311"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,3,9]]},"references-count":47,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2015,4,16]]}},"alternative-id":["10.1145\/2717311"],"URL":"https:\/\/doi.org\/10.1145\/2717311","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2015,3,9]]},"assertion":[{"value":"2014-09-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-03-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}