{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,28]],"date-time":"2025-06-28T07:03:35Z","timestamp":1751094215456,"version":"3.41.0"},"reference-count":37,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2015,5,11]],"date-time":"2015-05-11T00:00:00Z","timestamp":1431302400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Program of China National 1000 Young Talent Plan"},{"DOI":"10.13039\/501100002858","name":"China Postdoctoral Science Foundation","doi-asserted-by":"crossref","award":["2013M540362 and 2014T70418"],"award-info":[{"award-number":["2013M540362 and 2014T70418"]}],"id":[{"id":"10.13039\/501100002858","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61202026, 61332001, 61402285, and 61373155"],"award-info":[{"award-number":["61202026, 61332001, 61402285, and 61373155"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2015,7,8]]},"abstract":"<jats:p>\n            A modern general-purpose graphics processing unit (GPGPU) usually consists of multiple streaming multiprocessors (SMs), each having a pipeline that incorporates a group of threads executing a common instruction flow. Although SMs are designed to work independently, we observe that they tend to exhibit very similar behavior for many workloads. If multiple SMs can be grouped and work in the lock-step manner, it is possible to save energy by sharing the front-end units among multiple SMs, including the instruction fetch, decode, and schedule components. However, such sharing brings architectural challenges and sometime causes performance degradation. In this article, we show our design, implementation, and evaluation for such an architecture, which we call\n            <jats:italic>Buddy SM<\/jats:italic>\n            . Specifically, multiple SMs can be opportunistically grouped into a buddy cluster. One SM becomes the master, and the rest become the slaves. The front-end unit of the master works actively for itself as well as for the slaves, whereas the front-end logics of the slaves are power gated. For efficient flow control and program correctness, the proposed architecture can identify unfavorable conditions and ungroup the buddy cluster when necessary. We analyze various techniques to improve the performance and energy efficiency of Buddy SM. Detailed experiments manifest that 37.2% front-end and 7.5% total GPU energy reduction can be achieved.\n          <\/jats:p>","DOI":"10.1145\/2744202","type":"journal-article","created":{"date-parts":[[2015,5,11]],"date-time":"2015-05-11T16:30:57Z","timestamp":1431361857000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Buddy SM"],"prefix":"10.1145","volume":"12","author":[{"given":"Tao","family":"Zhang","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Naifeng","family":"Jing","sequence":"additional","affiliation":[{"name":"Department of Micro-Nano Electronics, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kaiming","family":"Jiang","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wei","family":"Shu","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Min-You","family":"Wu","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaoyao","family":"Liang","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2015,5,11]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Fung","author":"Aamodt Tor M.","year":"2013","unstructured":"Tor M. Aamodt and Wilson W. L . Fung . 2013 . GPGPU-Sim 3.x Manual. Retrieved February 1, 2015, from http:\/\/gpgpu-sim.org\/manual\/index.php\/GPGPU-Sim&lowbar;3.x&lowbar;Manual. Tor M. Aamodt and Wilson W. L. Fung. 2013. GPGPU-Sim 3.x Manual. Retrieved February 1, 2015, from http:\/\/gpgpu-sim.org\/manual\/index.php\/GPGPU-Sim&lowbar;3.x&lowbar;Manual."},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909)","author":"Bakhoda Ali","year":"2009","unstructured":"Ali Bakhoda , George L. Yuan , Wilson W. L. Fung , Henry Wong , and Tor M. Aamodt . 2009. Analyzing CUDA workloads using a detailed GPU simulator . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909) . IEEE, Los Alamitos, CA, 163--174. DOI:http:\/\/dx.doi.org\/10.1109\/ISPASS. 2009 .4919648 10.1109\/ISPASS.2009.4919648 Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909). IEEE, Los Alamitos, CA, 163--174. DOI:http:\/\/dx.doi.org\/10.1109\/ISPASS.2009.4919648"},{"key":"e_1_2_1_3_1","volume-title":"Immersive PC Experience. Retrieved","author":"Brookwood Nathan","year":"2015","unstructured":"Nathan Brookwood . 2010. AMD Fusion Family of APUs: Enabling a Superior , Immersive PC Experience. Retrieved February 1, 2015 , from http:\/\/www.amd.com\/Documents\/48423_fusion_whitepaper_WEB.pdf. Nathan Brookwood. 2010. AMD Fusion Family of APUs: Enabling a Superior, Immersive PC Experience. Retrieved February 1, 2015, from http:\/\/www.amd.com\/Documents\/48423_fusion_whitepaper_WEB.pdf."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2012.6402918"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2005.20"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the IEEE 17th International Symposium on High-Performance Computer Architecture (HPCA\u201911)","author":"Wilson W.","year":"2011","unstructured":"Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow . In Proceedings of the IEEE 17th International Symposium on High-Performance Computer Architecture (HPCA\u201911) . IEEE, Los Alamitos, CA, 25--36. DOI:http:\/\/dx.doi.org\/10.1109\/HPCA. 2011 .5749714 10.1109\/HPCA.2011.5749714 Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the IEEE 17th International Symposium on High-Performance Computer Architecture (HPCA\u201911). IEEE, Los Alamitos, CA, 25--36. DOI:http:\/\/dx.doi.org\/10.1109\/HPCA.2011.5749714"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.12"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1543753.1543756"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000093"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522331"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815998"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250662.1250686"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485952"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.5555\/1331699.1331733"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.2012.6378694"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-36424-2_12"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485964"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2012.20"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815992"},{"key":"e_1_2_1_21_1","volume-title":"Jouppi","author":"Muralimanohar Naveen","year":"2009","unstructured":"Naveen Muralimanohar , Rajeev Balasubramonian , and Norman P . Jouppi . 2009 . CACTI 6.0: A Tool to Model Large Caches. Retrieved February 1, 2015, from http:\/\/www.hpl.hp.com\/techreports\/2009\/HPL-2009-85.pdf Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Retrieved February 1, 2015, from http:\/\/www.hpl.hp.com\/techreports\/2009\/HPL-2009-85.pdf"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155656"},{"key":"e_1_2_1_23_1","volume-title":"NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi. Retrieved","author":"NVIDIA Corporation","year":"2015","unstructured":"NVIDIA Corporation . 2009. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi. Retrieved February 1, 2015 , from http:\/\/www.nvidia.com\/content\/PDF\/fermi_white_papers\/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. NVIDIA Corporation. 2009. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi. Retrieved February 1, 2015, from http:\/\/www.nvidia.com\/content\/PDF\/fermi_white_papers\/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf."},{"key":"e_1_2_1_24_1","volume-title":"NVIDIA CUDA Toolkit 4.1\u2014Archive. Retrieved","author":"NVIDIA Corporation","year":"2015","unstructured":"NVIDIA Corporation . 2012a. NVIDIA CUDA Toolkit 4.1\u2014Archive. Retrieved February 1, 2015 , from https:\/\/developer.nvidia.com\/cuda-toolkit-41-archive NVIDIA Corporation. 2012a. NVIDIA CUDA Toolkit 4.1\u2014Archive. Retrieved February 1, 2015, from https:\/\/developer.nvidia.com\/cuda-toolkit-41-archive"},{"key":"e_1_2_1_25_1","volume-title":"NVIDIA\u2019s Next Generation CUDA Compute Architecture: Kepler GK110. Retrieved","author":"NVIDIA Corporation","year":"2015","unstructured":"NVIDIA Corporation . 2012b. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Kepler GK110. Retrieved February 1, 2015 , from http:\/\/www.nvidia.com\/content\/PDF\/kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. NVIDIA Corporation. 2012b. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Kepler GK110. Retrieved February 1, 2015, from http:\/\/www.nvidia.com\/content\/PDF\/kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/2337159.2337167"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522352"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485953"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.16"},{"key":"e_1_2_1_30_1","article-title":"NVIDIA\u2019s GeForce GTX 480 Finally Unleashed","author":"Sandhu Tarinder","year":"2010","unstructured":"Tarinder Sandhu . 2010 . NVIDIA\u2019s GeForce GTX 480 Finally Unleashed . Reviewed and Rated. Retrieved February 1, 2015 from http:\/\/hexus.net\/tech\/reviews\/graphics\/24000-nvidias-geforce-gtx-480-finally-unleashed-reviewed-rated\/&quest;page=2. Tarinder Sandhu. 2010. NVIDIA\u2019s GeForce GTX 480 Finally Unleashed. Reviewed and Rated. Retrieved February 1, 2015 from http:\/\/hexus.net\/tech\/reviews\/graphics\/24000-nvidias-geforce-gtx-480-finally-unleashed-reviewed-rated\/&quest;page=2.","journal-title":"Reviewed and Rated. Retrieved"},{"key":"e_1_2_1_31_1","volume-title":"Geng Daniel Liu, and Wen Mei W. Hwu","author":"Stratton John A.","year":"2012","unstructured":"John A. Stratton , Christopher Rodrigues , I- Jui Sung , Nady Obeid , Li-Wen Chang , Nasser Anssari , Geng Daniel Liu, and Wen Mei W. Hwu . 2012 . Parboil : A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report, IMPACT-12-01, University of Illinois Urbana-Champaign , Champaign, IL. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report, IMPACT-12-01, University of Illinois Urbana-Champaign, Champaign, IL."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/1391469.1391666"},{"key":"e_1_2_1_33_1","volume-title":"Retrieved","author":"Valich Theo","year":"2015","unstructured":"Theo Valich . 2010. NVIDIA \u201cFermi \u201d GeForce Die Sizes Exposed. Retrieved February 1, 2015 , from http:\/\/www.brightsideofnews.com\/news\/2010\/8\/9\/nvidia-fermi-geforce-die-sizes-exposed.aspx. Theo Valich. 2010. NVIDIA \u201cFermi\u201d GeForce Die Sizes Exposed. Retrieved February 1, 2015, from http:\/\/www.brightsideofnews.com\/news\/2010\/8\/9\/nvidia-fermi-geforce-die-sizes-exposed.aspx."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000094"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.2014.6974695"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/NAS.2011.51"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346182"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2744202","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2744202","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T05:07:16Z","timestamp":1750223236000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2744202"}},"subtitle":["Sharing Pipeline Front-End for Improved Energy Efficiency in GPGPUs"],"short-title":[],"issued":{"date-parts":[[2015,5,11]]},"references-count":37,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2015,7,8]]}},"alternative-id":["10.1145\/2744202"],"URL":"https:\/\/doi.org\/10.1145\/2744202","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2015,5,11]]},"assertion":[{"value":"2014-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-05-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}