{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,19]],"date-time":"2025-09-19T10:58:12Z","timestamp":1758279492004,"version":"3.41.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,6,20]],"date-time":"2023-06-20T00:00:00Z","timestamp":1687219200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"U.S. National Science Foundation Principles and Practice of Scalable Systems (PPoSS) program and by U.S. Department of Energy and Pacific Northwest National Laboratory","award":["532181"],"award-info":[{"award-number":["532181"]}]},{"DOI":"10.13039\/100000001","name":"U.S. National Science Foundation","doi-asserted-by":"crossref","award":["SHF-1910197, SHF-1943114, CCF-155151, and OAC-2204011"],"award-info":[{"award-number":["SHF-1910197, SHF-1943114, CCF-155151, and OAC-2204011"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"<jats:p>\n            Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular dimension sizes and nonzero distribution. CANDECOMP\/PARAFAC decomposition (\n            <jats:sc>Cpd<\/jats:sc>\n            ) is a popular low-rank tensor decomposition for discovering latent features. The increasing overhead on memory and execution time of\n            <jats:sc>Cpd<\/jats:sc>\n            for large tensors requires distributed memory implementations as the only feasible solution. The sparsity and irregularity of tensors hinder the improvement of performance and scalability of distributed memory implementations. While previous works have been proved successful in\n            <jats:sc>Cpd<\/jats:sc>\n            for tensors with relatively regular dimension sizes and nonzero distribution, they either deliver unsatisfactory performance and scalability for irregular tensors or require significant time overhead in preprocessing. In this work, we focus on medium-grained tensor distribution to address their limitation for irregular tensors. We first thoroughly investigate through theoretical and experimental analysis. We disclose that the main cause of poor\n            <jats:sc>Cpd<\/jats:sc>\n            performance and scalability is the imbalance of multiple types of computations and communications and their tradeoffs; and sparsity and irregularity make it challenging to achieve their balances and tradeoffs. Irregularity of a sparse tensor is categorized based on two aspects: very different dimension sizes and a non-uniform nonzero distribution. Typically, focusing on optimizing one type of load imbalance causes other ones more severe for irregular tensors. To address such challenges, we propose irregularity-aware distributed\n            <jats:sc>Cpd<\/jats:sc>\n            that leverages the sparsity and irregularity information to identify the best tradeoff between different imbalances with low time overhead. We materialize the idea with two optimization methods: the prediction-based grid configuration and matrix-oriented distribution policy, where the former forms the global balance among computations and communications, and the latter further adjusts the balances among computations. The experimental results show that our proposed irregularity-aware distributed\n            <jats:sc>Cpd<\/jats:sc>\n            is more scalable and outperforms the medium- and fine-grained distributed implementations by up to 4.4 \u00d7 and 11.4 \u00d7 on 1,536 processors, respectively. Our optimizations support different sparse tensor formats, such as compressed sparse fiber (CSF), coordinate (COO), and Hierarchical Coordinate (HiCOO), and gain good scalability for all of them.\n          <\/jats:p>","DOI":"10.1145\/3580315","type":"journal-article","created":{"date-parts":[[2023,2,7]],"date-time":"2023-02-07T13:20:26Z","timestamp":1675776026000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition"],"prefix":"10.1145","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6084-2793","authenticated-orcid":false,"given":"Zheng","family":"Miao","sequence":"first","affiliation":[{"name":"Hangzhou Dianzi University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7191-4422","authenticated-orcid":false,"given":"Jon C.","family":"Calhoun","sequence":"additional","affiliation":[{"name":"Clemson University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2218-3675","authenticated-orcid":false,"given":"Rong","family":"Ge","sequence":"additional","affiliation":[{"name":"Clemson University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1270-4147","authenticated-orcid":false,"given":"Jiajia","family":"Li","sequence":"additional","affiliation":[{"name":"North Carolina State University, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,6,20]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Mart\u00edn Abadi et\u00a0al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015) arXiv preprint arXiv:1603.04467."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2018.2841843"},{"key":"e_1_3_2_4_2","first-page":"29","volume-title":"Proceedings of the ACM\/IEEE Conference on Supercomputing","author":"Alpatov Phillip","year":"1997","unstructured":"Phillip Alpatov, Greg Baker, H. Carter Edwards, John Gunnels, Greg Morrow, James Overfelt, and Robert van de Geijn. 1997. PLAPACK Parallel linear algebra package design overview. In Proceedings of the ACM\/IEEE Conference on Supercomputing. IEEE, 29\u201329."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.5555\/2627435.2697055"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611976137.1"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC.2019.8916319"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3225058.3225133"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1155\/2012\/713587"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3205289.3205315"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2020.3012624"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3330355"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1016\/0010-4655(96)00017-3"},{"key":"e_1_3_2_14_2","first-page":"1296","volume-title":"Advances in Neural Information Processing Systems 27","author":"Choi Joon Hee","year":"2014","unstructured":"Joon Hee Choi and S. Vishwanathan. 2014. DFacTo: Distributed factorization of tensors. In Advances in Neural Information Processing Systems 27. Curran Associates, Inc., 1296\u20131304."},{"key":"e_1_3_2_15_2","article-title":"Era of big data processing: A new approach via tensor networks and tensor decompositions","volume":"1403","author":"Cichocki Andrzej","year":"2014","unstructured":"Andrzej Cichocki. 2014. Era of big data processing: A new approach via tensor networks and tensor decompositions. CoRR abs\/1403.2048 (2014).","journal-title":"CoRR"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447818.3461703"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/2623330.2623658"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2015.7113355"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/2339530.2339583"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3225058.3225127"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807624"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1137\/16M1102744"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1137\/07070111X"},{"key":"e_1_3_2_24_2","unstructured":"Reservoir Labs. 2016. ENSIGN: Multi-Domain Analytics. (2016). Retrieved from https:\/\/reservoir-ensign.github.io\/usecases\/ENSIGN.html."},{"key":"e_1_3_2_25_2","unstructured":"Jiajia Li Yuchen Ma and Richard Vuduc. 2018. ParTI!: A Parallel Tensor Infrastructure for multicore CPUs and GPUs (Version 1.0.0). (Oct.Retrieved from: https:\/\/github.com\/hpcgarage\/ParTI."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00022"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2017.75"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356216"},{"key":"e_1_3_2_29_2","article-title":"Tensorizing neural networks","volume":"1509","author":"Novikov Alexander","year":"2015","unstructured":"Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. 2015. Tensorizing neural networks. CoRR abs\/1509.06569 (2015).","journal-title":"CoRR"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33460-3_39"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2015.29"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3097983.3098014"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2017.10.013"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1137\/140993478"},{"key":"e_1_3_2_35_2","unstructured":"Shaden Smith Jee W. Choi Jiajia Li Richard Vuduc Jongsoo Park Xing Liu and George Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. Retrieved from: http:\/\/frostt.io\/."},{"key":"e_1_3_2_36_2","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS)","author":"Smith Shaden","year":"2016","unstructured":"Shaden Smith and George Karypis. 2016. A medium-grained algorithm for distributed sparse tensor factorization. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE."},{"key":"e_1_3_2_37_2","unstructured":"Shaden Smith Niranjay Ravindran Nicholas Sidiropoulos and George Karypis. 2016. SPLATT: The Surprisingly ParalleL spArse Tensor Toolkit (Version 1.1.1). Retrieved from: https:\/\/github.com\/ShadenSmith\/splatt."},{"key":"e_1_3_2_38_2","first-page":"90","volume-title":"Proceedings of the European Conference on Parallel Processing","author":"Solomonik Edgar","year":"2011","unstructured":"Edgar Solomonik and James Demmel. 2011. Communication-optimal parallel 2.5 D matrix multiplication and LU factorization algorithms. In Proceedings of the European Conference on Parallel Processing. Springer, 90\u2013109."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342005051521"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF02289464"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1096-9128(199704)9:4<255::AID-CPE250>3.0.CO;2-2"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3580315","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3580315","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:42Z","timestamp":1750178262000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3580315"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,20]]},"references-count":40,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,30]]}},"alternative-id":["10.1145\/3580315"],"URL":"https:\/\/doi.org\/10.1145\/3580315","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"type":"print","value":"2329-4949"},{"type":"electronic","value":"2329-4957"}],"subject":[],"published":{"date-parts":[[2023,6,20]]},"assertion":[{"value":"2021-09-14","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-01-11","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}