{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T12:19:07Z","timestamp":1763468347902,"version":"3.45.0"},"reference-count":35,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2017,6,28]],"date-time":"2017-06-28T00:00:00Z","timestamp":1498608000000},"content-version":"vor","delay-in-days":365,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Facebook","award":["Graduate Fellowship"],"award-info":[{"award-number":["Graduate Fellowship"]}]},{"DOI":"10.13039\/100006435","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF-1018188, CCF-1314633, CCF-1314590"],"award-info":[{"award-number":["CCF-1018188, CCF-1314633, CCF-1314590"]}],"id":[{"id":"10.13039\/100006435","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000015","name":"U.S. Department of Energy","doi-asserted-by":"publisher","award":["DE-AC02-05CH11231"],"award-info":[{"award-number":["DE-AC02-05CH11231"]}],"id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"publisher"}]},{"name":"e Intel Science and Technology Center for Cloud Computing"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2016,6,28]]},"abstract":"<jats:p>The running time of nested parallel programs on shared-memory machines depends in significant part on how well the scheduler mapping the program to the machine is optimized for the organization of caches and processor cores on the machine. Recent work proposed \u201cspace-bounded schedulers\u201d for scheduling such programs on the multilevel cache hierarchies of current machines. The main benefit of this class of schedulers is that they provably preserve locality of the program at every level in the hierarchy, which can result in fewer cache misses and better use of bandwidth than the popular work-stealing scheduler. On the other hand, compared to work stealing, space-bounded schedulers are inferior at load balancing and may have greater scheduling overheads, raising the question as to the relative effectiveness of the two schedulers in practice.<\/jats:p>\n                  <jats:p>In this article, we provide the first experimental study aimed at addressing this question. To facilitate this study, we built a flexible experimental framework with separate interfaces for programs and schedulers. This enables a head-to-head comparison of the relative strengths of schedulers in terms of running times and cache miss counts across a range of benchmarks. (The framework is validated by comparisons with the Intel\u00ae Cilk\u2122 Plus work-stealing scheduler.) We present experimental results on a 32-core Xeon\u00ae 7560 comparing work stealing, hierarchy-minded work stealing, and two variants of space-bounded schedulers on both divide-and-conquer microbenchmarks and some popular algorithmic kernels. Our results indicate that space-bounded schedulers reduce the number of L3 cache misses compared to work-stealing schedulers by 25% to 65% for most of the benchmarks, but incur up to 27% additional scheduler and load-imbalance overhead. Only for memory-intensive benchmarks can the reduction in cache misses overcome the added overhead, resulting in up to a 25% improvement in running time for synthetic benchmarks and about 20% improvement for algorithmic kernels. We also quantify runtime improvements varying the available bandwidth per core (the \u201cbandwidth gap\u201d) and show up to 50% improvements in the running times of kernels as this gap increases fourfold. As part of our study, we generalize prior definitions of space-bounded schedulers to allow for more practical variants (while still preserving their guarantees) and explore implementation tradeoffs.<\/jats:p>","DOI":"10.1145\/2938389","type":"journal-article","created":{"date-parts":[[2016,7,5]],"date-time":"2016-07-05T10:08:13Z","timestamp":1467713293000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Experimental Analysis of Space-Bounded Schedulers"],"prefix":"10.1145","volume":"3","author":[{"given":"Harsha Vardhan","family":"Simhadri","sequence":"first","affiliation":[{"name":"Lawrence Berkeley National Lab, Berkeley CA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guy E.","family":"Blelloch","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh PA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jeremy T.","family":"Fineman","sequence":"additional","affiliation":[{"name":"Georgetown University, NW, Washington D.C."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Phillip B.","family":"Gibbons","sequence":"additional","affiliation":[{"name":"Intel Labs Pittsburgh, Pittsburgh PA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Aapo","family":"Kyrola","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh PA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2016,6,28]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00224-002-1057-3"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/PMMP.1993.315548"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/1347082.1347137"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145840"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989493.1989553"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2492408.2492417"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/301970.301974"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1810479.1810519"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/237502.237574"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/324133.324234"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1248377.1248392"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1378533.1378574"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2013.04.008"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2010.5470354"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1188455.1188543"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/277650.277725"},{"key":"e_1_2_1_17_1","volume-title":"White Paper: Fujitsu Primergy servers memory performance of Xeon 7500 (Nehalem-EX) based systems","author":"Solutions Fujitsu Technology","year":"2010","unstructured":"Fujitsu Technology Solutions. 2010. White Paper: Fujitsu Primergy servers memory performance of Xeon 7500 (Nehalem-EX) based systems. http:\/\/globalsp.ts.fujitsu.com\/dmsp\/Publications\/public\/wp-nehalem-ex-memory-performance-ww-en.pdf. (2010)."},{"key":"e_1_2_1_18_1","unstructured":"Intel. 2013a. Intel Thread Building Blocks Reference Manual. http:\/\/software.intel.com\/sites\/products\/documentation\/doclib\/tbb_sa\/help\/index.htm#reference\/reference.htm. (2013). Version 4.1."},{"key":"e_1_2_1_19_1","unstructured":"Intel. 2013b. Performance Counter Monitor (PCM). http:\/\/www.intel.com\/software\/pcm. (2013). Version 2.4."},{"key":"e_1_2_1_20_1","unstructured":"Andi Kleen. 2004. An NUMA API for Linux. http:\/\/halobates.de\/numaapi3.pdf. (August 2004)."},{"key":"e_1_2_1_21_1","unstructured":"Ronald Kriemann. 2004. Implementation and Usage of a Thread Pool based on POSIX Threads. http:\/\/www.hlnum.org\/english\/projects\/tools\/threadpool\/doc.html. (2004)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/337449.337465"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-010-0405-3"},{"key":"e_1_2_1_24_1","unstructured":"Adam Litke Eric Mundon and Nishanth Aravamudan. 2006. libhugetlbfs. http:\/\/libhugetlbfs.sourceforge.net. (2006)."},{"key":"e_1_2_1_25_1","unstructured":"Microsoft. 2013. Task Parallel Library. http:\/\/msdn.microsoft.com\/en-us\/library\/dd460717.aspx. (2013). NET version 4.5."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2009.22"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00224-001-1030-6"},{"key":"e_1_2_1_28_1","volume-title":"http:\/\/www.openmp.org\/mp-documents\/spec30.pdf. (May","author":"Architecture Review Board MP","year":"2008","unstructured":"OpenMP Architecture Review Board. 2008. OpenMP API. http:\/\/www.openmp.org\/mp-documents\/spec30.pdf. (May 2008). v 3.0."},{"key":"e_1_2_1_29_1","unstructured":"Perfmon2. 2012. libpfm. http:\/\/perfmon2.sourceforge.net\/. (2012)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15277-1_21"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2312005.2312018"},{"key":"e_1_2_1_32_1","unstructured":"Harsha Vardhan Simhadri. 2013. Program-Centric Cost Models for Locality and Parallelism. Ph.D. Dissertation. CMU. http:\/\/reports-archive.adm.cs.cmu.edu\/anon\/2013\/CMU-CS-13-124.pdf."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1583991.1584019"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2688500.2688514"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcss.2010.06.012"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2938389","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2938389","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2938389","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T09:26:17Z","timestamp":1763457977000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2938389"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,6,28]]},"references-count":35,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2016,6,28]]}},"alternative-id":["10.1145\/2938389"],"URL":"https:\/\/doi.org\/10.1145\/2938389","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"type":"print","value":"2329-4949"},{"type":"electronic","value":"2329-4957"}],"subject":[],"published":{"date-parts":[[2016,6,28]]},"assertion":[{"value":"2014-09-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-02-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-06-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}