{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,2]],"date-time":"2026-03-02T22:43:15Z","timestamp":1772491395456,"version":"3.50.1"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"7","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,3]]},"abstract":"<jats:p>\n            We consider the problem of automatic parallelism in high-performance, tensor-based systems. Our focus is on\n            <jats:italic toggle=\"yes\">intra-operator parallelism<\/jats:italic>\n            for inference tasks on a single GPU server or CPU cluster, where each operator is automatically broken op so that it runs on multiple devices. We assert that tensor-based systems should offer a programming abstraction based on an\n            <jats:italic toggle=\"yes\">extended Einstein summation notation<\/jats:italic>\n            , which is a fully declarative, mathematical specification for tensor computations. We show that any computation specified in the Einstein summation notation can be re-written into an equivalent\n            <jats:italic toggle=\"yes\">tensor-relational<\/jats:italic>\n            computation that facilitates intra-operator parallelism, and this re-write generalizes existing notations of tensor parallelism such as \"data parallel\" and \"model parallel.\" We consider the algorithmic problem of optimally computing a tensor-relational decomposition of a graph of operations specified in our extended Einstein summation notation.\n          <\/jats:p>","DOI":"10.14778\/3734839.3734858","type":"journal-article","created":{"date-parts":[[2025,8,29]],"date-time":"2025-08-29T16:01:06Z","timestamp":1756483266000},"page":"2240-2253","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["EinDecomp: Decomposition of Declaratively-Specified Machine Learning and Numerical Computations for Parallel Execution"],"prefix":"10.14778","volume":"18","author":[{"given":"Daniel","family":"Bourgeois","sequence":"first","affiliation":[{"name":"Rice University"}]},{"given":"Zhimin","family":"Ding","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Dimitrije","family":"Jankov","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Jiehui","family":"Li","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Mahmoud","family":"Sleem","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Yuxin","family":"Tang","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Jiawen","family":"Yao","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Xinyu","family":"Yao","sequence":"additional","affiliation":[{"name":"Rice University"}]},{"given":"Chris","family":"Jermaine","sequence":"additional","affiliation":[{"name":"Rice University"}]}],"member":"320","published-online":{"date-parts":[[2025,8,29]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). 265\u2013283.","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). 265\u2013283."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2020.07.001"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.395.0575"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41404.2022.00051"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589266"},{"key":"e_1_2_1_6_1","unstructured":"Matthias Boehm Iulian Antonov Sebastian Baunsgaard Mark Dokter Robert Ginth\u00f6r Kevin Innerebner Florijan Klezin Stefanie N. Lindstaedt Arnab Phani Benjamin Rath Berthold Reinwald Shafaq Siddiqui and Sebastian Benjamin Wrede. 2020. SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. In CIDR."},{"key":"e_1_2_1_7_1","volume-title":"High performance code generation in MLIR: An early case study with GEMM. arXiv preprint arXiv:2003.00532","author":"Bondhugula Uday","year":"2020","unstructured":"Uday Bondhugula. 2020. High performance code generation in MLIR: An early case study with GEMM. arXiv preprint arXiv:2003.00532 (2020)."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3132413"},{"key":"e_1_2_1_9_1","unstructured":"Tianqi Chen Thierry Moreau Ziheng Jiang Lianmin Zheng Eddie Yan Haichen Shen Meghan Cowan Leyuan Wang Yuwei Hu Luis Ceze et al. 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 578\u2013594."},{"key":"e_1_2_1_10_1","unstructured":"Nvidia Corporation. 2023. cuTENSOR: A CUDA Library for Tensor Algebra. https:\/\/docs.nvidia.com\/cuda\/cutensor\/latest\/index.html Accessed: 2024-10-01."},{"key":"e_1_2_1_11_1","volume-title":"DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines. In CIDR. https:\/\/www.cidrdb.org\/cidr2022\/papers\/p4-damme.pdf","author":"Patrick Damme","year":"2022","unstructured":"Patrick Damme et al. 2022. DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines. In CIDR. https:\/\/www.cidrdb.org\/cidr2022\/papers\/p4-damme.pdf"},{"key":"e_1_2_1_12_1","unstructured":"Jeffrey Dean Greg Corrado Rajat Monga Kai Chen Matthieu Devin Mark Mao Marc'aurelio Ranzato Andrew Senior Paul Tucker Ke Yang et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223\u20131231."},{"key":"e_1_2_1_13_1","volume-title":"Christopher Jermaine, Yuxin Tang, Jiehui Li, Xinyu Yao, Sleem Mahmoud Abdelghafar, and Daniel Bourgeois.","author":"Ding Zhimin","year":"2024","unstructured":"Zhimin Ding, Jiawen Yao, Brianna Barrow, Tania Lorido Botran, Christopher Jermaine, Yuxin Tang, Jiehui Li, Xinyu Yao, Sleem Mahmoud Abdelghafar, and Daniel Bourgeois. 2024. TURNIP: A\" Nondeterministic\" GPU Runtime with CPU RAM Offload. arXiv preprint arXiv:2405.16283 (2024)."},{"key":"e_1_2_1_14_1","volume-title":"The gravitational equations and the problem of motion. Annals of mathematics","author":"Einstein Albert","year":"1938","unstructured":"Albert Einstein, Leopold Infeld, and Banesh Hoffmann. 1938. The gravitational equations and the problem of motion. Annals of mathematics (1938), 65\u2013100."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICAPP.1997.651531"},{"key":"e_1_2_1_16_1","first-page":"1","article-title":"Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity","volume":"23","author":"Fedus William","year":"2022","unstructured":"William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1\u201339.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2211.02753"},{"key":"e_1_2_1_18_1","volume-title":"Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arXiv preprint arXiv:1606.04487","author":"Hadjis Stefan","year":"2016","unstructured":"Stefan Hadjis, Ce Zhang, Ioannis Mitliagkas, Dan Iter, and Christopher R\u00e9. 2016. Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arXiv preprint arXiv:1606.04487 (2016)."},{"key":"e_1_2_1_19_1","volume-title":"VLDB","volume":"94","author":"Hasan Waqar","year":"1994","unstructured":"Waqar Hasan and Rajeev Motwani. 1994. Optimization algorithms for exploiting the parallelism-communication tradeoff in pipelined parallelism. In VLDB, Vol. 94. Citeseer, 12\u201315."},{"key":"e_1_2_1_20_1","volume-title":"Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems. 103\u2013112.","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems. 103\u2013112."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359630"},{"key":"e_1_2_1_22_1","first-page":"1","article-title":"Beyond Data and Model Parallelism for Deep Neural Networks","volume":"1","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of Machine Learning and Systems 1 (2019), 1\u201313.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_23_1","volume-title":"Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training. arXiv preprint arXiv:2111.05972","author":"Karakus Can","year":"2021","unstructured":"Can Karakus, Rahul Huilgol, Fei Wu, Anirudh Subramanian, Cade Daniel, Derya Cavdar, Teng Xu, Haohan Chen, Arash Rahnama, and Luis Quintela. 2021. Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training. arXiv preprint arXiv:2111.05972 (2021)."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2661829.2661864"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2661829.2661842"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133901"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO51591.2021.9370308"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i04.5881"},{"key":"e_1_2_1_29_1","unstructured":"Shen Li Yanli Zhao Rohan Varma Omkar Salpekar Pieter Noordhuis Teng Li Adam Paszke Jeff Smith Brian Vaughan Pritam Damania et al. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. arXiv preprint arXiv:2006.15704 (2020)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2827988"},{"key":"e_1_2_1_31_1","first-page":"382","article-title":"Managing intra-operator parallelism in parallel database systems","volume":"95","author":"Mehta Manish","year":"1995","unstructured":"Manish Mehta and David J DeWitt. 1995. Managing intra-operator parallelism in parallel database systems. In VLDB, Vol. 95. 382\u2013394.","journal-title":"VLDB"},{"key":"e_1_2_1_32_1","volume-title":"Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv preprint arXiv:2211.13878","author":"Miao Xupeng","year":"2022","unstructured":"Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv preprint arXiv:2211.13878 (2022)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/2002938.2002940"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/3025111.3025117"},{"key":"e_1_2_1_36_1","unstructured":"PyTorch. 2023. PyTocrch 2.0. https:\/\/pytorch.org\/get-started\/pytorch-2.0"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553486"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_2_1_39_1","volume-title":"Tamara Norman, James Molloy, Jonathan Godwin","author":"Schaarschmidt Michael","year":"2021","unstructured":"Michael Schaarschmidt, Dominik Grewe, Dimitrios Vytiniotis, Adam Paszke, Georg Stefan Schmid, Tamara Norman, James Molloy, Jonathan Godwin, Norman Alexander Rink, and Vinod Nair. 2021. Automap: Towards Ergonomic Automated Parallelism for ML Models. arXiv preprint arXiv:2112.02958 (2021)."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588717"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","unstructured":"Maximilian E. Sch\u00fcle Tobias G\u00f6tz Alfons Kemper and Thomas Neumann. 2022. ArrayQL Integration into Code-Generating Database Systems. In EDBT. 10.5441\/002\/edbt.2022.04","DOI":"10.5441\/002\/edbt.2022.04"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/HOTI.2015.13"},{"key":"e_1_2_1_43_1","volume-title":"Mesh-tensorflow: Deep learning for supercomputers. arXiv preprint arXiv:1811.02084","author":"Shazeer Noam","year":"2018","unstructured":"Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, and Cliff Young. 2018. Mesh-tensorflow: Deep learning for supercomputers. arXiv preprint arXiv:1811.02084 (2018)."},{"key":"e_1_2_1_44_1","volume-title":"International Conference on Machine Learning. PMLR, 31094\u201331116","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning. PMLR, 31094\u201331116."},{"key":"e_1_2_1_45_1","volume-title":"Jared Casper, and Bryan Catanzaro.","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick Le Gresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_2_1_46_1","volume-title":"European Conference on Parallel Processing. Springer, 90\u2013109","author":"Solomonik Edgar","year":"2011","unstructured":"Edgar Solomonik and James Demmel. 2011. Communication-optimal parallel 2.5 D matrix multiplication and LU factorization algorithms. In European Conference on Parallel Processing. Springer, 90\u2013109."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/1995441.1995446"},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research)","author":"Tang Yuxin","unstructured":"Yuxin Tang, Zhimin Ding, Dimitrije Jankov, Binhang Yuan, Daniel Bourgeois, and Chris Jermaine. 2023. Auto-Differentiation of Relational Computations for Very Large Scale Machine Learning. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.), Vol. 202. PMLR, 33581\u201333598. https:\/\/proceedings.mlr.press\/v202\/tang23a.html"},{"key":"e_1_2_1_49_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_2_1_50_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Unger Colin","year":"2022","unstructured":"Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. 2022. Unity: Accelerating {DNN} training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 267\u2013284."},{"key":"e_1_2_1_51_1","volume-title":"Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730","author":"Vasilache Nicolas","year":"2018","unstructured":"Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018)."},{"key":"e_1_2_1_52_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 6000\u20136010."},{"key":"e_1_2_1_53_1","volume-title":"arXiv preprint arXiv:2105.04663","author":"Xu Yuanzhong","year":"2021","unstructured":"Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, and Marcello Maggioni. 2021. GSPMD:General and Scalable Parallelization for ML Computation Graphs. arXiv preprint arXiv:2105.04663 (2021)."},{"key":"e_1_2_1_54_1","doi-asserted-by":"crossref","unstructured":"Binhang Yuan Dimitrije Jankov Jia Zou Yuxin Tang Daniel Bourgeois and Chris Jermaine. 2021. Tensor Relational Algebra for Distributed Machine Learning System Design. (2021).","DOI":"10.14778\/3457390.3457399"},{"key":"e_1_2_1_55_1","unstructured":"Xiru Zhang Michael Mckenna Jill P Mesirov and David L Waltz. 1990. An efficient implementation of the back-propagation algorithm on the connection machine CM-2. In Advances in neural information processing systems. 801\u2013809."},{"key":"e_1_2_1_56_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559\u2013578."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3734839.3734858","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,29]],"date-time":"2025-08-29T16:03:09Z","timestamp":1756483389000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3734839.3734858"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3]]},"references-count":56,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2025,3]]}},"alternative-id":["10.14778\/3734839.3734858"],"URL":"https:\/\/doi.org\/10.14778\/3734839.3734858","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,3]]},"assertion":[{"value":"2025-08-29","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}