{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,10,6]],"date-time":"2023-10-06T05:01:20Z","timestamp":1696568480984},"reference-count":34,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2006,12]]},"abstract":"<jats:p>Application performance on high-performance shared-memory systems is often limited by sharing patterns resulting in cache-coherence bottlenecks. Current approaches to identify coherence bottlenecks incur considerable run-time overhead and do not scale. We present two novel hardware-assisted coherence-analysis techniques that reduce trace sizes by two orders of magnitude over full traces. First, hardware performance monitoring is combined with capturing stores in software to provide a lossy-trace mechanism, which is an order of magnitude faster than software-instrumentation-based full-tracing and retains accuracy. Second, selected long-latency loads are instrumented via binary rewriting, which provides even higher accuracy and control over tracing, but requires additional overhead.<\/jats:p>","DOI":"10.1145\/1187976.1187978","type":"journal-article","created":{"date-parts":[[2007,1,16]],"date-time":"2007-01-16T19:38:29Z","timestamp":1168976309000},"page":"390-423","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Analysis of cache-coherence bottlenecks with hybrid hardware\/software techniques"],"prefix":"10.1145","volume":"3","author":[{"given":"Jaydeep","family":"Marathe","sequence":"first","affiliation":[{"name":"North Carolina State University, Raleigh, NC"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Frank","family":"Mueller","sequence":"additional","affiliation":[{"name":"North Carolina State University, Raleigh, NC"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bronis R.","family":"de Supinski","sequence":"additional","affiliation":[{"name":"Lawrence Livermore National Laboratory, Livermore, CA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2006,12]]},"reference":[{"key":"e_1_2_1_1_1","first-page":"3","article-title":"The NAS Parallel Benchmarks","volume":"5","author":"Bailey D. H.","year":"1991","unstructured":"Bailey , D. H. , Barszcz , E. , Barton , J. T. , Browning , D. S. , Carter , R. L. , Dagum , D. , Fatoohi , R. A. , Frederickson , P. O. , Lasinski , T. A. , Schreiber , R. S. , Simon , H. D. , Venkatakrishnan , V. , and Weeratunga , S. K. 1991 . The NAS Parallel Benchmarks . The International Journal of Supercomputer Applications 5 , 3 (Fall), 63--73. Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., Dagum, D., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., Simon, H. D., Venkatakrishnan, V., and Weeratunga, S. K. 1991. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications 5, 3 (Fall), 63--73.","journal-title":"The International Journal of Supercomputer Applications"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the SIGMETRICS and PERFORMANCE '92 International Conference on Measurement and Modeling of Computer Systems. ACM Press","author":"Brewer E. A.","unstructured":"Brewer , E. A. , Dellarocas , C. N. , Colbrook , A. , and Weihl , W. E . 1992. Proteus: A high-performance parallel-architecture simulator . In Proceedings of the SIGMETRICS and PERFORMANCE '92 International Conference on Measurement and Modeling of Computer Systems. ACM Press , New York. 247--248. 10.1145\/133057.133146 Brewer, E. A., Dellarocas, C. N., Colbrook, A., and Weihl, W. E. 1992. Proteus: A high-performance parallel-architecture simulator. In Proceedings of the SIGMETRICS and PERFORMANCE '92 International Conference on Measurement and Modeling of Computer Systems. ACM Press, New York. 247--248. 10.1145\/133057.133146"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1177\/109434200001400404"},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Buck B. R. and Hollingsworth J. K. 2000b. Using hardware performance monitors to isolate memory bottlenecks. In Supercomputing. ACM New York. 64--65.   Buck B. R. and Hollingsworth J. K. 2000b. Using hardware performance monitors to isolate memory bottlenecks. In Supercomputing. ACM New York. 64--65.","DOI":"10.1109\/SC.2000.10034"},{"key":"e_1_2_1_5_1","unstructured":"Buck B. R. and Hollingsworth J. K. 2004. Data centric cache measurement on the intel itanium 2 processor. In Supercomputing ACM New York. 10.1109\/SC.2004.21   Buck B. R. and Hollingsworth J. K. 2004. Data centric cache measurement on the intel itanium 2 processor. In Supercomputing ACM New York. 10.1109\/SC.2004.21"},{"key":"e_1_2_1_6_1","volume-title":"T. M.","author":"Burger D.","year":"1996","unstructured":"Burger , D. , Austin , T. M. , and Bennett, S. 1996 . Evaluating future microprocessors: The simplescalar tool set. Technical Report CS-TR-1996-1308, University of Wisconsin, Madison . Burger, D., Austin, T. M., and Bennett, S. 1996. Evaluating future microprocessors: The simplescalar tool set. Technical Report CS-TR-1996-1308, University of Wisconsin, Madison."},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the 1991 International Conference on Parallel Processing.","author":"Davis H.","unstructured":"Davis , H. , Goldschmidt , S. R. , and Hennessy , J . 1991. Multiprocessor simulation and tracing using tango . In Proceedings of the 1991 International Conference on Parallel Processing. Vol. II , Software. CRC Press, Boca Raton, FL. II--99--II--107. Davis, H., Goldschmidt, S. R., and Hennessy, J. 1991. Multiprocessor simulation and tracing using tango. In Proceedings of the 1991 International Conference on Parallel Processing. Vol. II, Software. CRC Press, Boca Raton, FL. II--99--II--107."},{"key":"e_1_2_1_8_1","volume-title":"SIGMA: A simulator infrastructure to guide memory analysis. In Supercomputing.","author":"DeRose L.","year":"2002","unstructured":"DeRose , L. , Ekanadham , K. , Hollingsworth , J. K. , and Sbaraglia , S . 2002 . SIGMA: A simulator infrastructure to guide memory analysis. In Supercomputing. DeRose, L., Ekanadham, K., Hollingsworth, J. K., and Sbaraglia, S. 2002. SIGMA: A simulator infrastructure to guide memory analysis. In Supercomputing."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.982915"},{"key":"e_1_2_1_11_1","volume-title":"Intel Itanium2 Processor Reference Manual for Software Development and Optimization","author":"Intel","unstructured":"Intel . 2004. Intel Itanium2 Processor Reference Manual for Software Development and Optimization . Vol. 1 . Intel . Intel. 2004. Intel Itanium2 Processor Reference Manual for Software Development and Optimization. Vol. 1. Intel."},{"key":"e_1_2_1_12_1","volume-title":"Intel Itanium2 Processor---Reference Manual","author":"Intel Corp. 2004.","unstructured":"Intel Corp. 2004. Intel Itanium2 Processor---Reference Manual . Intel Corp . Intel Corp. 2004. Intel Itanium2 Processor---Reference Manual. Intel Corp."},{"key":"e_1_2_1_13_1","volume-title":"ACM SIGPLAN Conference on Programming Language Design and Implementation. 196--204","author":"Krishnamurthy A.","year":"2071","unstructured":"Krishnamurthy , A. and Yelick , K . 1995. Optimizing parallel programs with explicit synchronization . In ACM SIGPLAN Conference on Programming Language Design and Implementation. 196--204 . 10.1145\/ 2071 10.207142 Krishnamurthy, A. and Yelick, K. 1995. Optimizing parallel programs with explicit synchronization. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 196--204. 10.1145\/207110.207142"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.318580"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/244804.244806"},{"key":"e_1_2_1_16_1","unstructured":"LLNL. 2002. Asci purple codes. http:\/\/www.llnl.gov\/asci\/purple.  LLNL. 2002. Asci purple codes. http:\/\/www.llnl.gov\/asci\/purple."},{"key":"e_1_2_1_17_1","volume-title":"ACM SIGPLAN Conference on Programming Language Design and Implementation. 10","author":"Luk C.-K.","unstructured":"Luk , C.-K. , Cohn , R. , Muth , R. , Patil , H. , Klauser , A. , Lowney , G. , Wallace , S. , Reddi , V. , and Hazelwood , K . 2005. Pin: Building customized program analysis tools with dynamic instrumentation . In ACM SIGPLAN Conference on Programming Language Design and Implementation. 10 .1145\/1065010.1065034 Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V., and Hazelwood, K. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 10.1145\/1065010.1065034"},{"key":"e_1_2_1_18_1","volume-title":"International Symposium on Code Generation and Optimization. 289--300","author":"Marathe J.","unstructured":"Marathe , J. , Mueller , F. , Mohan , T. , de Supinski , B. R. , McKee , S. A. , and Yoo , A . 2003. Metric: Tracking down inefficiencies in the memory hierarchy via binary rewriting . In International Symposium on Code Generation and Optimization. 289--300 . Marathe, J., Mueller, F., Mohan, T., de Supinski, B. R., McKee, S. A., and Yoo, A. 2003. Metric: Tracking down inefficiencies in the memory hierarchy via binary rewriting. In International Symposium on Code Generation and Optimization. 289--300."},{"key":"e_1_2_1_19_1","volume-title":"International Conference on Supercomputing. 287--297","author":"Marathe J.","unstructured":"Marathe , J. , Nagarajan , A. , and Mueller , F . 2004. Detailed cache coherence characterization for openmp benchmarks . In International Conference on Supercomputing. 287--297 . 10.1145\/1006209.1006250 Marathe, J., Nagarajan, A., and Mueller, F. 2004. Detailed cache coherence characterization for openmp benchmarks. In International Conference on Supercomputing. 287--297. 10.1145\/1006209.1006250"},{"key":"e_1_2_1_20_1","volume-title":"International Conference on Supercomputing. 10","author":"Marathe J.","year":"2005","unstructured":"Marathe , J. , Mueller , F. , and de Supinski , B. R. 2005 . A hybrid hardware\/software approach to efficiently determine cache coherence bottlenecks . In International Conference on Supercomputing. 10 .1145\/1088149.1088153 Marathe, J., Mueller, F., and de Supinski, B. R. 2005. A hybrid hardware\/software approach to efficiently determine cache coherence bottlenecks. In International Conference on Supercomputing. 10.1145\/1088149.1088153"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the 1992 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. 1--12","author":"Martonosi M.","unstructured":"Martonosi , M. , Gupta , A. , and Anderson , T . 1992. Memspy: Analyzing memory system bottlenecks in programs . In Proceedings of the 1992 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. 1--12 . 10.1145\/133057.133079 Martonosi, M., Gupta, A., and Anderson, T. 1992. Memspy: Analyzing memory system bottlenecks in programs. In Proceedings of the 1992 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. 1--12. 10.1145\/133057.133079"},{"key":"e_1_2_1_22_1","volume-title":"International Conference on Supercomputing. 154--165","author":"Mellor-Crummey J.","unstructured":"Mellor-Crummey , J. , Fowler , R. , and Whalley , D . 2001. Tools for application-oriented performance tuning . In International Conference on Supercomputing. 154--165 . 10.1145\/377792.377826 Mellor-Crummey, J., Fowler, R., and Whalley, D. 2001. Tools for application-oriented performance tuning. In International Conference on Supercomputing. 154--165. 10.1145\/377792.377826"},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Mohan T. de Supinski B. R. McKee S. A. Mueller F. Yoo A. and Schulz M. 2003. Identifying and exploiting spatial regularity in data memory references. In Supercomputing. 10.1145\/1048935.1050199   Mohan T. de Supinski B. R. McKee S. A. Mueller F. Yoo A. and Schulz M. 2003. Identifying and exploiting spatial regularity in data memory references. In Supercomputing. 10.1145\/1048935.1050199","DOI":"10.1145\/1048935.1050199"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the of the 3rd Workshop on Runtime Verification (Boulder).","author":"Nethercote N.","unstructured":"Nethercote , N. and Seward , J . 2003. Valgrind: A program supervision framework . In Proceedings of the of the 3rd Workshop on Runtime Verification (Boulder). Nethercote, N. and Seward, J. 2003. Valgrind: A program supervision framework. In Proceedings of the of the 3rd Workshop on Runtime Verification (Boulder)."},{"key":"e_1_2_1_25_1","volume-title":"IEEE International Conference on Computer Design: VLSI in Computers and Processors. IEEE Computer Society","author":"Nguyen A.-T.","unstructured":"Nguyen , A.-T. , Michael , M. , Sharma , A. , and Torrellas , J . 1996. The augmint multiprocessor simulation toolkit: Implementation, experimentation and tracing facilities . In IEEE International Conference on Computer Design: VLSI in Computers and Processors. IEEE Computer Society , Washington, DC. 486--491. Nguyen, A.-T., Michael, M., Sharma, A., and Torrellas, J. 1996. The augmint multiprocessor simulation toolkit: Implementation, experimentation and tracing facilities. In IEEE International Conference on Computer Design: VLSI in Computers and Processors. IEEE Computer Society, Washington, DC. 486--491."},{"key":"e_1_2_1_26_1","unstructured":"Omni. 2003. C versions of nas-2.3 serial programs. http:\/\/phase.hpcc.jp\/Omni\/benchmarks\/NPB.  Omni. 2003. C versions of nas-2.3 serial programs. http:\/\/phase.hpcc.jp\/Omni\/benchmarks\/NPB."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2004.28"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/88.473612"},{"key":"e_1_2_1_29_1","volume-title":"EWOMP '99 (Lund). 32--39","author":"Sato M.","unstructured":"Sato , M. , Satoh , S. , Kusano , K. , and Tanaka , Y . 1999. Design of OpenMP compiler for an SMP cluster . In EWOMP '99 (Lund). 32--39 . Sato, M., Satoh, S., Kusano, K., and Tanaka, Y. 1999. Design of OpenMP compiler for an SMP cluster. In EWOMP '99 (Lund). 32--39."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/1239928.1239935"},{"key":"e_1_2_1_31_1","volume-title":"Proc. of the 37th Annual Simulation Symposium","author":"Tao J.","unstructured":"Tao , J. and Weidendorfer , J . 2004. Cache simulation based on runtime instrumentation for OpenMP applications . In Proc. of the 37th Annual Simulation Symposium , Arlington, VA. 97--103. Tao, J. and Weidendorfer, J. 2004. Cache simulation based on runtime instrumentation for OpenMP applications. In Proc. of the 37th Annual Simulation Symposium, Arlington, VA. 97--103."},{"key":"e_1_2_1_32_1","volume-title":"Proc. of the 36th Annual Simulation Symposium","author":"Tao J.","unstructured":"Tao , J. , Schulz , M. , and Karl , W . 2003. A simulation tool for evaluating shared memory performance . In Proc. of the 36th Annual Simulation Symposium , Orlando, FL. Tao, J., Schulz, M., and Karl, W. 2003. A simulation tool for evaluating shared memory performance. In Proc. of the 36th Annual Simulation Symposium, Orlando, FL."},{"key":"e_1_2_1_33_1","volume-title":"International Parallel and Distributed Processing Symposium.","author":"Thiffault C.","unstructured":"Thiffault , C. , Voss , M. , Healey , S. T. , and Kim , S. W . 2003. Dynamic instrumentation of large-scale mpi\/openmp applications . In International Parallel and Distributed Processing Symposium. Thiffault, C., Voss, M., Healey, S. T., and Kim, S. W. 2003. Dynamic instrumentation of large-scale mpi\/openmp applications. In International Parallel and Distributed Processing Symposium."},{"key":"e_1_2_1_34_1","volume-title":"SC '04: Proceedings of the 2004 ACM\/IEEE conference on Supercomputing. IEEE Computer Society","author":"Tikir M. M.","year":"2004","unstructured":"Tikir , M. M. and Hollingsworth , J. K . 2004. Using hardware counters to automatically improve memory performance . In SC '04: Proceedings of the 2004 ACM\/IEEE conference on Supercomputing. IEEE Computer Society , Washington, DC. 46. 10.1109\/SC. 2004 .64 Tikir, M. M. and Hollingsworth, J. K. 2004. Using hardware counters to automatically improve memory performance. In SC '04: Proceedings of the 2004 ACM\/IEEE conference on Supercomputing. IEEE Computer Society, Washington, DC. 46. 10.1109\/SC.2004.64"},{"key":"e_1_2_1_35_1","volume-title":"ACM SIGPLAN Conference on Programming Language Design and Implementation. 30--44","author":"Wolf M. E.","unstructured":"Wolf , M. E. and Lam , M. S . 1991. A data locality optimizating algorithm . In ACM SIGPLAN Conference on Programming Language Design and Implementation. 30--44 . 10.1145\/113445.113449 Wolf, M. E. and Lam, M. S. 1991. A data locality optimizating algorithm. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 30--44. 10.1145\/113445.113449"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1187976.1187978","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T17:55:23Z","timestamp":1672250123000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1187976.1187978"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2006,12]]},"references-count":34,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2006,12]]}},"alternative-id":["10.1145\/1187976.1187978"],"URL":"https:\/\/doi.org\/10.1145\/1187976.1187978","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2006,12]]},"assertion":[{"value":"2006-12-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}