{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,24]],"date-time":"2025-08-24T01:22:28Z","timestamp":1755998548345,"version":"3.41.0"},"reference-count":78,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2019,7,16]],"date-time":"2019-07-16T00:00:00Z","timestamp":1563235200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100004682","name":"Oracle","doi-asserted-by":"publisher","award":["ERO project 1332"],"award-info":[{"award-number":["ERO project 1332"]}],"id":[{"id":"10.13039\/100004682","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Program. Lang. Syst."],"published-print":{"date-parts":[[2019,9,30]]},"abstract":"<jats:p>Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, leading to missed parallelization opportunities. In this article, we provide a better understanding of task granularity for task-parallel applications running on a single Java Virtual Machine in a shared-memory multicore. We present a new methodology to accurately and efficiently collect the granularity of each executed task, implemented in a novel profiler (available open-source) that collects carefully selected metrics from the whole system stack with low overhead, and helps developers locate performance and scalability problems. We analyze task granularity in the DaCapo, ScalaBench, and Spark Perf benchmark suites, revealing inefficiencies related to fine-grained and coarse-grained tasks in several applications. We demonstrate that the collected task-granularity profiles are actionable by optimizing task granularity in several applications, achieving speedups up to a factor of 5.90\u00d7. Our results highlight the importance of analyzing and optimizing task granularity on the Java Virtual Machine.<\/jats:p>","DOI":"10.1145\/3338497","type":"journal-article","created":{"date-parts":[[2019,7,16]],"date-time":"2019-07-16T12:39:01Z","timestamp":1563280741000},"page":"1-47","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Analysis and Optimization of Task Granularity on the Java Virtual Machine"],"prefix":"10.1145","volume":"41","author":[{"given":"Andrea","family":"Ros\u00e0","sequence":"first","affiliation":[{"name":"Universit\u00e0 della Svizzera italiana, Lugano, Switzerland"}]},{"given":"Eduardo","family":"Rosales","sequence":"additional","affiliation":[{"name":"Universit\u00e0 della Svizzera italiana, Lugano, Switzerland"}]},{"given":"Walter","family":"Binder","sequence":"additional","affiliation":[{"name":"Universit\u00e0 della Svizzera italiana, Lugano, Switzerland"}]}],"member":"320","published-online":{"date-parts":[[2019,7,16]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","unstructured":"Umut A. Acar Arthur Chargu\u00e9raud and Mike Rainey. 2011. Oracle scheduling: Controlling granularity in implicitly parallel languages. In OOPSLA. 499--518. 10.1145\/2048066.2048106","DOI":"10.1145\/2048066.2048106"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.5555\/7929"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/258915.258924"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCC.2014.32"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","unstructured":"Walter Binder Jarle Hulaas and Philippe Moret. 2007. Advanced Java bytecode instrumentation. In PPPJ. 135--144. 10.1145\/1294325.1294344","DOI":"10.1145\/1294325.1294344"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","unstructured":"Stephen M. Blackburn Robin Garner Chris Hoffmann Asjad M. Khang Kathryn S. McKinley Rotem Bentzur Amer Diwan Daniel Feinberg Daniel Frampton Samuel Z. Guyer Martin Hirzel Antony Hosking Maria Jump Han Lee J. Eliot B. Moss Aashish Phansalkar Darko Stefanovi\u0107 Thomas VanDrunen Daniel von Dincklage and Ben Wiedermann. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA. 169--190. 10.1145\/1167473.1167488","DOI":"10.1145\/1167473.1167488"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1368088.1368119"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2010.232"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","unstructured":"Guojing Cong Sreedhar Kodali Sriram Krishnamoorthy Doug Lea Vijay Saraswat and Tong Wen. 2008. Solving large irregular graph problems using adaptive work-stealing. In ICPP. 536--545. 10.1109\/ICPP.2008.88","DOI":"10.1109\/ICPP.2008.88"},{"key":"e_1_2_1_10_1","unstructured":"Databricks. 2015. Spark Performance Tests. Retrieved from https:\/\/github.com\/databricks\/spark-perf."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","unstructured":"Florian David Gael Thomas Julia Lawall and Gilles Muller. 2014. Continuously measuring critical section pressure with the free-lunch profiler. In OOPSLA. 291--307. 10.1145\/2660193.2660210","DOI":"10.1145\/2660193.2660210"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","unstructured":"Bruno Dufour Karel Driesen Laurie Hendren and Clark Verbrugge. 2003. Dynamic metrics for Java. In OOPSLA. 149--168. 10.1145\/949305.949320","DOI":"10.1145\/949305.949320"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","unstructured":"Alejandro Duran Julita Corbal\u00e1n and Eduard Ayguad\u00e9. 2008. An adaptive cut-off for task parallelism. In SC. 1--11.","DOI":"10.5555\/1413370.1413407"},{"key":"e_1_2_1_14_1","unstructured":"H2. 2018. H2 Database Engine. Retrieved from http:\/\/www.h2database.com."},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Kevin Hammond Hans-Wolfgang Loidl and Andrew S Partridge. 1995. Visualising granularity in parallel programs: A graphical winnowing system for Haskell. In HPFC. 208--221.","DOI":"10.1007\/978-1-4471-3573-9_8"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","unstructured":"Matthias Hauswirth Peter F. Sweeney Amer Diwan and Michael Hind. 2004. Vertical profiling: Understanding the behavior of object-oriented applications. In OOPSLA. 251--269. 10.1145\/1028976.1028998","DOI":"10.1145\/1028976.1028998"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1810479.1810509"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","unstructured":"Carl Hewitt Peter Bishop and Richard Steiger. 1973. A universal modular ACTOR formalism for artificial intelligence. In IJCAI. 235--245.","DOI":"10.5555\/1624775.1624804"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","unstructured":"Lorenz Huelsbergen James R. Larus and Alexander Aiken. 1994. Using the run-time sizes of data structures to guide parallel-thread creation. In LSP. 79--90. 10.1145\/182590.182442","DOI":"10.1145\/182590.182442"},{"key":"e_1_2_1_20_1","unstructured":"IBM. 2007. DayTrader. Retrieved from https:\/\/www.ibm.com\/support\/knowledgecenter\/en\/linuxonibm\/liaag\/wascrypt\/l0wscry00_daytrader.htm."},{"key":"e_1_2_1_21_1","unstructured":"ICL. 2017. PAPI. Retrieved from http:\/\/icl.utk.edu\/papi\/."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","unstructured":"Hiroshi Inoue and Toshio Nakatani. 2009. How a Java VM can get more from a hardware performance monitor. In OOPSLA. 137--154. 10.1145\/1640089.1640100","DOI":"10.1145\/1640089.1640100"},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Shintaro Iwasaki and Kenjiro Taura. 2016. Autotuning of a cut-off for task parallel programs. In MCSoC. 353--360.","DOI":"10.1145\/2967938.2967968"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/133889"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","unstructured":"Stephen Kell Danilo Ansaloni Walter Binder and Luk\u00e1\u0161 Marek. 2012. The JVM is not observable enough (and what to do about it). In VMIL. 33--38. 10.1145\/2414740.2414747","DOI":"10.1145\/2414740.2414747"},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Gregor Kiczales John Lamping Anurag Mendhekar Chris Maeda Cristina Lopes Jean-Marc Loingtier and John Irwin. 1997. Aspect-oriented programming. In ECOOP. 220--242.","DOI":"10.1007\/BFb0053381"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF00128489"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","unstructured":"Vivek Kumar Daniel Frampton Stephen M. Blackburn David Grove and Olivier Tardieu. 2012. Work-stealing without the baggage. In OOPSLA. 297--314. 10.1145\/2384616.2384639","DOI":"10.1145\/2384616.2384639"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","unstructured":"Philipp Lengauer Verena Bitto Hanspeter M\u00f6ssenb\u00f6ck and Markus Weninger. 2017. A comprehensive Java benchmark study on memory and garbage collection behavior of DaCapo DaCapo Scala and SPECjvm2008. In ICPE. 3--14. 10.1145\/3030207.3030211","DOI":"10.1145\/3030207.3030211"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.75"},{"key":"e_1_2_1_31_1","unstructured":"Linux man. 2013. top(1). Retrieved from https:\/\/linux.die.net\/man\/1\/top."},{"key":"e_1_2_1_32_1","unstructured":"Linux man. 2018. Documentation of CLOCK_MONOTONIC in clock_gettime(). Retrieved from https:\/\/linux.die.net\/man\/3\/clock_gettime."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1006\/jsco.1996.0038"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","unstructured":"Luk\u00e1\u0161 Marek Stephen Kell Yudi Zheng Lubom\u00edr Bulej Walter Binder Petr T\u016fma Danilo Ansaloni Aibek Sarimbekov and Andreas Sewe. 2013. ShadowVM: Robust and comprehensive dynamic program analysis for the Java platform. In GPCE. 105--114. 10.1145\/2517208.2517219","DOI":"10.1145\/2517208.2517219"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","unstructured":"Luk\u00e1\u0161 Marek Alex Villaz\u00f3n Yudi Zheng Danilo Ansaloni Walter Binder and Zhengwei Qi. 2012. DiSL: A domain-specific language for bytecode instrumentation. In AOSD. 239--250. 10.1145\/2162049.2162077","DOI":"10.1145\/2162049.2162077"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.86103"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1480945.1480967"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2851141.2851156"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/1806596.1806618"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","unstructured":"Albert Noll and Thomas Gross. 2013. Online feedback-directed optimizations for parallel Java code. In OOPSLA. 713--728. 10.1145\/2509136.2509518","DOI":"10.1145\/2509136.2509518"},{"key":"e_1_2_1_41_1","unstructured":"Oracle. 2017. Documentation of System.nanotime(). Retrieved from https:\/\/docs.oracle.com\/javase\/9\/docs\/api\/java\/lang\/System.html."},{"key":"e_1_2_1_42_1","unstructured":"Oracle. 2017. Java Native Interface. Retrieved from https:\/\/docs.oracle.com\/javase\/9\/docs\/specs\/jni\/index.html."},{"key":"e_1_2_1_43_1","unstructured":"Oracle. 2017. Java Platform Standard Edition 8 Java Development Kit Version 9 API Specification. Retrieved from https:\/\/docs.oracle.com\/javase\/9\/docs\/api\/."},{"key":"e_1_2_1_44_1","unstructured":"Oracle. 2017. Java Virtual Machine Tool Interface (JVM TI). Retrieved from https:\/\/docs.oracle.com\/javase\/9\/docs\/specs\/jvmti.html."},{"key":"e_1_2_1_45_1","unstructured":"Oracle. 2017. The Parallel Collector. Retrieved from https:\/\/docs.oracle.com\/javase\/9\/gctuning\/parallel-collector1.htm."},{"key":"e_1_2_1_46_1","unstructured":"Oracle. 2017. ExecutorService. Retrieved from https:\/\/docs.oracle.com\/javase\/9\/docs\/api\/java\/util\/concurrent\/ExecutorService.html."},{"key":"e_1_2_1_47_1","unstructured":"Oracle. 2017. ForkJoinPool. Retrieved from https:\/\/docs.oracle.com\/javase\/9\/docs\/api\/java\/util\/concurrent\/ForkJoinPool.html."},{"key":"e_1_2_1_48_1","unstructured":"Oracle. 2017. ThreadPoolExecutor. Retrieved from https:\/\/docs.oracle.com\/javase\/9\/docs\/api\/java\/util\/concurrent\/ThreadPoolExecutor.html."},{"key":"e_1_2_1_49_1","unstructured":"perf. 2015. Linux profiling with performance counters. Retrieved from https:\/\/perf.wiki.kernel.org."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jvlc.2018.10.007"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2993236.2993241"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3136040.3136061"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","unstructured":"Andrea Ros\u00e0 Eduardo Rosales and Walter Binder. 2018. Analyzing and optimizing task granularity on the JVM. In CGO. 27--37. 10.1145\/3168828","DOI":"10.1145\/3168828"},{"key":"e_1_2_1_54_1","doi-asserted-by":"crossref","unstructured":"Eduardo Rosales Andrea Ros\u00e0 and Walter Binder. 2017. tgp: A task-granularity profiler for the Java virtual machine. In APSEC. 570--575.","DOI":"10.1109\/APSEC.2017.67"},{"key":"e_1_2_1_55_1","volume-title":"How-to: Tune Your Apache Spark Jobs (Part 1).","author":"Ryza Sandy","year":"2015","unstructured":"Sandy Ryza. 2015. How-to: Tune Your Apache Spark Jobs (Part 1). Retrieved from http:\/\/blog.cloudera.com\/blog\/2015\/03\/how-to-tune-your-apache-spark-jobs-part-1\/."},{"key":"e_1_2_1_56_1","volume-title":"How-to: Tune Your Apache Spark Jobs (Part 2).","author":"Ryza Sandy","year":"2015","unstructured":"Sandy Ryza. 2015. How-to: Tune Your Apache Spark Jobs (Part 2). Retrieved from http:\/\/blog.cloudera.com\/blog\/2015\/03\/how-to-tune-your-apache-spark-jobs-part-2\/."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.scico.2011.11.003"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/2755573.2755603"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","unstructured":"Andreas Sewe Mira Mezini Aibek Sarimbekov and Walter Binder. 2011. Da Capo Con Scala: Design and analysis of a scala benchmark suite for the Java virtual machine. In OOPSLA. 657--676. 10.1145\/2048066.2048118","DOI":"10.1145\/2048066.2048118"},{"key":"e_1_2_1_60_1","unstructured":"The Apache Software Foundation. 2018. Apache Spark\u2014RDD Programming Guide. Retrieved from https:\/\/spark.apache.org\/docs\/latest\/rdd-programming-guide.html."},{"key":"e_1_2_1_61_1","unstructured":"The Apache Software Foundation. 2018. Apache Spark MLlib. Retrieved from https:\/\/spark.apache.org\/mllib\/."},{"key":"e_1_2_1_62_1","unstructured":"The Apache Software Foundation. 2018. Apache Tomcat. Retrieved from http:\/\/tomcat.apache.org."},{"key":"e_1_2_1_63_1","unstructured":"The Apache Software Foundation. 2018. Lucene. Retrieved from https:\/\/lucene.apache.org."},{"key":"e_1_2_1_64_1","unstructured":"The Apache Software Foundation. 2018. Spark Configuration. Retrieved from https:\/\/spark.apache.org\/docs\/latest\/configuration.html."},{"key":"e_1_2_1_65_1","unstructured":"The Apache Software Foundation. 2018. Spark Streaming. Retrieved from https:\/\/spark.apache.org\/streaming\/."},{"key":"e_1_2_1_66_1","unstructured":"The Apache Software Foundation. 2018. SparkContext API. Retrieved from https:\/\/spark.apache.org\/docs\/2.3.0\/api\/java\/org\/apache\/spark\/SparkContext.html."},{"key":"e_1_2_1_67_1","unstructured":"The Eclipse Foundation. 2016. Jetty. Retrieved from http:\/\/www.eclipse.org\/jetty\/."},{"key":"e_1_2_1_68_1","unstructured":"The Eclipse Foundation. 2018. Eclipse. Retrieved from https:\/\/www.eclipse.org."},{"key":"e_1_2_1_69_1","unstructured":"The Stanford Natural Language Processing Group. 2010. Stanford Topic Modeling Toolbox. Retrieved from https:\/\/nlp.stanford.edu\/software\/tmt\/tmt-0.4\/."},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","unstructured":"Peter Thoman Herbert Jordan and Thomas Fahringer. 2013. Adaptive granularity control in task parallel programs using multiversioning. In Euro-Par. 164--177. 10.1007\/978-3-642-40047-6_19","DOI":"10.1007\/978-3-642-40047-6_19"},{"key":"e_1_2_1_71_1","unstructured":"TPC. 2010. TPC-C. Retrieved from http:\/\/www.tpc.org\/tpcc\/."},{"key":"e_1_2_1_72_1","doi-asserted-by":"crossref","unstructured":"Alex Villaz\u00f3n Haiyang Sun Andrea Ros\u00e0 Eduardo Rosales Daniele Bonetta Isabella Defilippis Sergio Oporto and Walter Binder. 2019. Automated large-scale multi-language dynamic program analysis in the wild. In ECOOP. 1--26.","DOI":"10.1145\/3359061.3362777"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","unstructured":"Adarsh Yoga and Santosh Nagarakatte. 2017. A fast causal profiler for task parallel programs. In ESEC\/FSE. 15--26. 10.1145\/3106237.3106254","DOI":"10.1145\/3106237.3106254"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","unstructured":"Matei Zaharia Mosharaf Chowdhury Tathagata Das Ankur Dave Justin Ma Murphy McCauley Michael J. Franklin Scott Shenker and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI. 1--14.","DOI":"10.5555\/2228298.2228301"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","unstructured":"Jisheng Zhao Jun Shirako V. Krishna Nandivada and Vivek Sarkar. 2010. Reducing task creation and termination overhead in explicitly parallel programs. In PACT. 169--180. 10.1145\/1854273.1854298","DOI":"10.1145\/1854273.1854298"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","unstructured":"Yudi Zheng Lubom\u00edr Bulej and Walter Binder. 2015. Accurate profiling in the presence of dynamic compilation. In OOPSLA. 433--450. 10.1145\/2814270.2814281","DOI":"10.1145\/2814270.2814281"},{"key":"e_1_2_1_77_1","doi-asserted-by":"crossref","unstructured":"Yudi Zheng Andrea Ros\u00e0 Luca Salucci Yao Li Haiyang Sun Omar Javed Lumob\u00edr Bulej Lydia Y. Chen Zhengwei Qi and Walter Binder. 2016. AutoBench: Finding workloads that you need using pluggable hybrid analyses. In SANER. 639--643.","DOI":"10.1109\/SANER.2016.70"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/335231.335261"}],"container-title":["ACM Transactions on Programming Languages and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3338497","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3338497","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:12:48Z","timestamp":1750201968000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3338497"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,7,16]]},"references-count":78,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2019,9,30]]}},"alternative-id":["10.1145\/3338497"],"URL":"https:\/\/doi.org\/10.1145\/3338497","relation":{},"ISSN":["0164-0925","1558-4593"],"issn-type":[{"type":"print","value":"0164-0925"},{"type":"electronic","value":"1558-4593"}],"subject":[],"published":{"date-parts":[[2019,7,16]]},"assertion":[{"value":"2018-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-07-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}