{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:41:42Z","timestamp":1750308102463,"version":"3.41.0"},"reference-count":35,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2005,11,1]],"date-time":"2005-11-01T00:00:00Z","timestamp":1130803200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGARCH Comput. Archit. News"],"published-print":{"date-parts":[[2005,11]]},"abstract":"<jats:p>Chip multi-processors (CMPs) already have widespread commercial availability, and technology roadmaps project enough on-chip transistors to replicate tens or hundreds of current processor cores. How will we express parallelism, partition applications, and schedule\/place\/migrate threads on these highly-parallel CMPs?This paper presents and evaluates a new approach to highly-parallel CMPs, advocating a new hardware-software contract. The software layer is encouraged to expose large amounts of multi-granular, heterogeneous parallelism. The hardware, meanwhile, is designed to offer low-overhead, low-area support for orchestrating and modulating this parallelism on CMPs at runtime. Specifically, our proposed CMP architecture consists of architectural and ISA support targeting thread creation, scheduling and context-switching, designed to facilitate effective hardware run-time mapping of threads to cores at low overheads.Dynamic modulation of parallelism provides the ability to respond to run-time variability that arises from dataset changes, memory system effects and power spikes and lulls, to name a few. It also naturally provides a long-term CMP platform with performance portability and tolerance to frequency and reliability variations across multiple CMP generations. Our simulations of a range of applications possessing do-all, streaming and recursive parallellism show speedups of 4-11.5X and energy-delay-product savings of 3.8X, on average, on a 16-core vs. a 1-core system. This is achieved with modest amounts of hardware support that allows for low overheads in thread creation, scheduling and context-switching. In particular, our simulations motivated the need for hardware support, showing that the large thread management overheads of current run-time software systems can lead to up to 6.5X slowdown. The difficulties faced in static scheduling were shown in our simulations with a static scheduling algorithm, fed with oracle profiled inputs suffering up to 107% slowdown compared to NDP's hardware scheduler, due to its inability to handle memory system variabilities. More broadly, we feel that the ideas presented here show promise for scaling to the systems expected in ten years, where the advantages of high transistor counts may be dampened by difficulties in circuit variations and reliability. These issues will make dynamic scheduling and adaptation mandatory; our proposals represent a first step towards that direction.<\/jats:p>","DOI":"10.1145\/1105734.1105742","type":"journal-article","created":{"date-parts":[[2006,2,6]],"date-time":"2006-02-06T18:14:10Z","timestamp":1139249650000},"page":"54-63","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Hardware-modulated parallelism in chip multiprocessors"],"prefix":"10.1145","volume":"33","author":[{"given":"Julia","family":"Chen","sequence":"first","affiliation":[{"name":"Princeton University"}]},{"given":"Philo","family":"Juang","sequence":"additional","affiliation":[{"name":"Princeton University"}]},{"given":"Kevin","family":"Ko","sequence":"additional","affiliation":[{"name":"Princeton University"}]},{"given":"Gilberto","family":"Contreras","sequence":"additional","affiliation":[{"name":"Princeton University"}]},{"given":"David","family":"Penry","sequence":"additional","affiliation":[{"name":"Princeton University"}]},{"given":"Ram","family":"Rangan","sequence":"additional","affiliation":[{"name":"Princeton University"}]},{"given":"Adam","family":"Stoler","sequence":"additional","affiliation":[{"name":"Princeton University"}]},{"given":"Li-Shiuan","family":"Peh","sequence":"additional","affiliation":[{"name":"Princeton University"}]},{"given":"Margaret","family":"Martonosi","sequence":"additional","affiliation":[{"name":"Princeton University"}]}],"member":"320","published-online":{"date-parts":[[2005,11]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/277651.277678"},{"key":"e_1_2_1_2_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.21236\/ADA343398","volume-title":"Pthreads for dynamic and irregular parallelism,\" in Supercomputing '98: Proceedings of the 1998 ACM\/IEEE conference on Supercomputing (CDROM)","author":"Narlikar G. J.","year":"1998","unstructured":"G. J. Narlikar and G. E. Blelloch , \" Pthreads for dynamic and irregular parallelism,\" in Supercomputing '98: Proceedings of the 1998 ACM\/IEEE conference on Supercomputing (CDROM) ,. 1998 , pp. 1 -- 16 . G. J. Narlikar and G. E. Blelloch, \"Pthreads for dynamic and irregular parallelism,\" in Supercomputing '98: Proceedings of the 1998 ACM\/IEEE conference on Supercomputing (CDROM),. 1998, pp. 1--16."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/1025127.1026007"},{"key":"e_1_2_1_4_1","first-page":"3","volume-title":"A bandwidth-efficient architecture for media processing,\" in Proceedings of the 31st annual ACM\/IEEE international symposium on Microarchitecture","author":"Rixner S.","year":"1998","unstructured":"S. Rixner , W. J. Dally , U. J. Kapasi , B. Khailany , A. Lopez-Lagunas , P. R. Mattson , and J. D. Owens , \" A bandwidth-efficient architecture for media processing,\" in Proceedings of the 31st annual ACM\/IEEE international symposium on Microarchitecture . IEEE Computer Society Press , 1998 , pp. 3 -- 13 . S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. Lopez-Lagunas, P. R. Mattson, and J. D. Owens, \"A bandwidth-efficient architecture for media processing,\" in Proceedings of the 31st annual ACM\/IEEE international symposium on Microarchitecture. IEEE Computer Society Press, 1998, pp. 3--13."},{"key":"e_1_2_1_5_1","volume-title":"International Symposium on Computer Architecture","author":"Taylor M. B.","year":"2004","unstructured":"M. B. Taylor RAW microprocessor : An exposed-wire-delay architecture for ILP and streams,\" in Proc . International Symposium on Computer Architecture , June 2004 . M. B. Taylor et al., \"Evaulation of the RAW microprocessor: An exposed-wire-delay architecture for ILP and streams,\" in Proc. International Symposium on Computer Architecture, June 2004."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1024393.1024395"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.127260"},{"volume-title":"A, \"The mit alewife machine: A large-scale distributed-memory multiprocessor,\" Kluwer academic","year":"1991","key":"e_1_2_1_8_1","unstructured":"e. a. Agrawal , A, \"The mit alewife machine: A large-scale distributed-memory multiprocessor,\" Kluwer academic Publishers , 1991 . e. a. Agrawal, A, \"The mit alewife machine: A large-scale distributed-memory multiprocessor,\" Kluwer academic Publishers, 1991."},{"key":"e_1_2_1_9_1","first-page":"271","volume-title":"Proceedings of the 35th International Symposium on Microarchitecture (MICRO)","author":"Vachharajani M.","year":"2002","unstructured":"M. Vachharajani in Proceedings of the 35th International Symposium on Microarchitecture (MICRO) , November 2002 , pp. 271 -- 282 . M. Vachharajani et al., \"Microarchitectural exploration with Liberty,\" in Proceedings of the 35th International Symposium on Microarchitecture (MICRO), November 2002, pp. 271--282."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339657"},{"key":"e_1_2_1_11_1","unstructured":"\"International technology roadmap for semiconductors \" http:\/\/public.itrs.net.  \"International technology roadmap for semiconductors \" http:\/\/public.itrs.net."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.80160"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1006\/jpdc.1999.1578"},{"key":"e_1_2_1_14_1","volume-title":"P. Brucker, \"Scheduling algorithms","year":"2004","unstructured":"P. Brucker, \"Scheduling algorithms , 4 th edition.\" Springer , 2004 , ISBN 3540205241. P. Brucker, \"Scheduling algorithms, 4th edition.\" Springer, 2004, ISBN 3540205241.","edition":"4"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/859618.859667"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339673"},{"key":"e_1_2_1_17_1","volume-title":"Proc. MICRO","author":"Swanson S.","year":"2003","unstructured":"S. Swanson in Proc. MICRO , November 2003 . S. Swanson et al., \"Wavescalar,\" in Proc. MICRO, November 2003."},{"key":"e_1_2_1_18_1","volume-title":"Synchroscalar: A multiple clock domain, power-aware, tile-based embedded processor,\" in Proceedings of the International Symposium on Computer Architecture","author":"Oliver J.","year":"2004","unstructured":"J. Oliver , \" Synchroscalar: A multiple clock domain, power-aware, tile-based embedded processor,\" in Proceedings of the International Symposium on Computer Architecture , 2004 . J. Oliver et al., \"Synchroscalar: A multiple clock domain, power-aware, tile-based embedded processor,\" in Proceedings of the International Symposium on Computer Architecture, 2004."},{"key":"e_1_2_1_19_1","volume-title":"Overview of the monsoon project,\" in Proceedings of ICCD","author":"Traub K.","year":"1991","unstructured":"K. Traub , M. Beckerle , G. Padadopoulous , J. Hicks , and J. Young , \" Overview of the monsoon project,\" in Proceedings of ICCD , 1991 . K. Traub, M. Beckerle, G. Padadopoulous, J. Hicks, and J. Young, \"Overview of the monsoon project,\" in Proceedings of ICCD, 1991."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/74925.74930"},{"key":"e_1_2_1_21_1","volume-title":"An architectural design of a highly parallel dataflow machine,\" in IFIP","author":"Yamaguchi Y.","year":"1989","unstructured":"Y. Yamaguchi and S. Sakai , \" An architectural design of a highly parallel dataflow machine,\" in IFIP , 1989 . Y. Yamaguchi and S. Sakai, \"An architectural design of a highly parallel dataflow machine,\" in IFIP, 1989."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2465.2468"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.48862"},{"key":"e_1_2_1_24_1","unstructured":"D. E. Culler K. E. Schauser and T. von Eicken \"Two fundamental limits on dataflow multiprocessing \" in IFIP 1993.   D. E. Culler K. E. Schauser and T. von Eicken \"Two fundamental limits on dataflow multiprocessing \" in IFIP 1993."},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"D. E. Culler S. C. Goldstein K. E. Schauser and T. von Eicken \"Tam - a compiler controlled threaded abstract machine \" 1993.  D. E. Culler S. C. Goldstein K. E. Schauser and T. von Eicken \"Tam - a compiler controlled threaded abstract machine \" 1993.","DOI":"10.1006\/jpdc.1993.1070"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/223982.224449"},{"key":"e_1_2_1_27_1","first-page":"71","volume-title":"IEEE Computer Society","author":"Kim H.-S.","year":"2002","unstructured":"H.-S. Kim and J. E. Smith , \" An instruction set and microarchitecture for instruction level distributed processing,\" in Proceedings of the 29th annual international symposium on Computer architecture . IEEE Computer Society , 2002 , pp. 71 -- 81 . H.-S. Kim and J. E. Smith, \"An instruction set and microarchitecture for instruction level distributed processing,\" in Proceedings of the 29th annual international symposium on Computer architecture. IEEE Computer Society, 2002, pp. 71--81."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/264107.264201"},{"key":"e_1_2_1_29_1","volume-title":"PACT","author":"Parcerisa J.-M.","year":"2002","unstructured":"J.-M. Parcerisa interconnects for clustered microarchitectures,\" in Proc . PACT , 2002 . J.-M. Parcerisa et al., \"Efficient interconnects for clustered microarchitectures,\" in Proc. PACT, 2002."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/223982.224451"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/165123.165158"},{"key":"e_1_2_1_32_1","first-page":"146","volume-title":"The m-machine multicomputer,\" in Proceedings of the 28th annual international symposium on Microarchitecture","author":"Fillo M.","year":"1995","unstructured":"M. Fillo , S. W. Keckler , W. J. Dally , N. P. Carter , A. Chang , Y. Gurevich , and W. S. Lee , \" The m-machine multicomputer,\" in Proceedings of the 28th annual international symposium on Microarchitecture . IEEE Computer Society Press , 1995 , pp. 146 -- 156 . M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich, and W. S. Lee, \"The m-machine multicomputer,\" in Proceedings of the 28th annual international symposium on Microarchitecture. IEEE Computer Society Press, 1995, pp. 146--156."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/191995.192056"},{"key":"e_1_2_1_34_1","volume-title":"Principles and Practices of Interconnection Networks","author":"Dally W.","year":"2004","unstructured":"W. Dally and B. Towles , Principles and Practices of Interconnection Networks . Morgan Kaufman Publishers , 2004 . W. Dally and B. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufman Publishers, 2004."},{"key":"e_1_2_1_35_1","volume-title":"HPCA","author":"Taylor M. B.","year":"2003","unstructured":"M. B. Taylor : On-chip interconnect for ilp in partitioned architectures,\" in Proc . HPCA , February 2003 . M. B. Taylor et al., \"Scalar operand networks: On-chip interconnect for ilp in partitioned architectures,\" in Proc. HPCA, February 2003."}],"container-title":["ACM SIGARCH Computer Architecture News"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1105734.1105742","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1105734.1105742","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T16:08:03Z","timestamp":1750262883000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1105734.1105742"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2005,11]]},"references-count":35,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2005,11]]}},"alternative-id":["10.1145\/1105734.1105742"],"URL":"https:\/\/doi.org\/10.1145\/1105734.1105742","relation":{},"ISSN":["0163-5964"],"issn-type":[{"type":"print","value":"0163-5964"}],"subject":[],"published":{"date-parts":[[2005,11]]},"assertion":[{"value":"2005-11-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}