{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,4]],"date-time":"2026-06-04T12:26:04Z","timestamp":1780575964059,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":27,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,8,29]],"date-time":"2022-08-29T00:00:00Z","timestamp":1661731200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,8,29]]},"DOI":"10.1145\/3545008.3545087","type":"proceedings-article","created":{"date-parts":[[2023,1,15]],"date-time":"2023-01-15T01:04:08Z","timestamp":1673744648000},"page":"1-11","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":33,"title":["Tesseract: Parallelize the Tensor Parallelism Efficiently"],"prefix":"10.1145","author":[{"given":"Boxiang","family":"Wang","sequence":"first","affiliation":[{"name":"HPC-AI Technology Inc., Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Qifan","family":"Xu","sequence":"additional","affiliation":[{"name":"University of California, Los Angeles, United States of America"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhengda","family":"Bian","sequence":"additional","affiliation":[{"name":"HPC-AI Technology Inc., China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yang","family":"You","sequence":"additional","affiliation":[{"name":"School of Computing, National University of Singapore, Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,1,13]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Takuya Akiba Shuji Suzuki and Keisuke Fukuda. 2017. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325(2017)."},{"key":"e_1_3_2_1_2_1","volume-title":"Advances in Neural Information Processing Systems, H.\u00a0Larochelle, M.\u00a0Ranzato, R.\u00a0Hadsell, M.F. Balcan, and H.\u00a0Lin (Eds.). Vol.\u00a033. Curran Associates","author":"Brown Tom","year":"1877","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared\u00a0D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H.\u00a0Larochelle, M.\u00a0Ranzato, R.\u00a0Hadsell, M.F. Balcan, and H.\u00a0Lin (Eds.). Vol.\u00a033. Curran Associates, Inc., 1877\u20131901. https:\/\/proceedings.neurips.cc\/paper\/2020\/file\/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf"},{"key":"e_1_3_2_1_3_1","unstructured":"Lynn\u00a0Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.\u00a0D. Dissertation. USA. AAI7010025."},{"key":"e_1_3_2_1_4_1","unstructured":"Tianqi Chen Bing Xu Chiyuan Zhang and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arxiv:1604.06174\u00a0[cs.LG]"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_3_2_1_6_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=YicbFdNTTy","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=YicbFdNTTy"},{"key":"e_1_3_2_1_7_1","unstructured":"Priya Goyal Piotr Doll\u00e1r Ross Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677(2017)."},{"key":"e_1_3_2_1_8_1","volume-title":"Introduction to Parallel Computing","author":"Grama Ananth","unstructured":"Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta. 2003. Introduction to Parallel Computing(second ed.). Addison-Wesley."},{"key":"e_1_3_2_1_9_1","volume-title":"Advances in Neural Information Processing Systems, H.\u00a0Wallach, H.\u00a0Larochelle, A.\u00a0Beygelzimer, F.\u00a0d'Alch\u00e9-Buc, E.\u00a0Fox, and R.\u00a0Garnett (Eds.). Vol.\u00a032. Curran Associates","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc\u00a0V Le, Yonghui Wu, and zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems, H.\u00a0Wallach, H.\u00a0Larochelle, A.\u00a0Beygelzimer, F.\u00a0d'Alch\u00e9-Buc, E.\u00a0Fox, and R.\u00a0Garnett (Eds.). Vol.\u00a032. Curran Associates, Inc.https:\/\/proceedings.neurips.cc\/paper\/2019\/file\/093f65e080a295f8076b1c5722a46aa2-Paper.pdf"},{"key":"e_1_3_2_1_10_1","unstructured":"Xianyan Jia Shutao Song Wei He Yangzihao Wang Haidong Rong Feihu Zhou Liqiang Xie Zhenyu Guo Yuanzhou Yang Liwei Yu Tiegang Chen Guangxiao Hu Shaohuai Shi and Xiaowen Chu. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. CoRR abs\/1807.11205(2018). arXiv:1807.11205http:\/\/arxiv.org\/abs\/1807.11205"},{"key":"e_1_3_2_1_11_1","unstructured":"Sameer Kumar Victor Bitorff Dehao Chen Chiachen Chou Blake Hechtman HyoukJoong Lee Naveen Kumar Peter Mattson Shibo Wang Tao Wang 2019. Scale mlperf-0.6 models on google tpu-v3 pods. arXiv preprint arXiv:1909.09756(2019)."},{"key":"e_1_3_2_1_12_1","volume-title":"Mixed Precision Training. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=r1gs9JgRZ","author":"Micikevicius Paulius","year":"2018","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=r1gs9JgRZ"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_1_14_1","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018)."},{"key":"e_1_3_2_1_15_1","volume-title":"Language models are unsupervised multitask learners. OpenAI blog 1, 8","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_2_1_18_1","unstructured":"Mohammad Shoeybi Mostofa Patwary Raul Puri Patrick LeGresley Jared Casper and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arxiv:1909.08053\u00a0[cs.CL]"},{"key":"e_1_3_2_1_19_1","volume-title":"Euro-Par 2011 Parallel Processing","author":"Solomonik Edgar","unstructured":"Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Euro-Par 2011 Parallel Processing, Emmanuel Jeannot, Raymond Namyst, and Jean Roman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 90\u2013109."},{"key":"e_1_3_2_1_20_1","volume-title":"van\u00a0de Geijn and Jerrell Watts","author":"A.","year":"1995","unstructured":"Robert\u00a0A. van\u00a0de Geijn and Jerrell Watts. 1995. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Technical Report. USA."},{"key":"e_1_3_2_1_21_1","volume-title":"Advances in Neural Information Processing Systems, I.\u00a0Guyon, U.\u00a0Von Luxburg, S.\u00a0Bengio, H.\u00a0Wallach, R.\u00a0Fergus, S.\u00a0Vishwanathan, and R.\u00a0Garnett (Eds.). Vol.\u00a030. Curran Associates","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan\u00a0N Gomez, \u0141\u00a0ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I.\u00a0Guyon, U.\u00a0Von Luxburg, S.\u00a0Bengio, H.\u00a0Wallach, R.\u00a0Fergus, S.\u00a0Vishwanathan, and R.\u00a0Garnett (Eds.). Vol.\u00a030. Curran Associates, Inc.https:\/\/proceedings.neurips.cc\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_3_2_1_22_1","unstructured":"Qifan Xu Shenggui Li Chaoyu Gong and Yang You. 2021. An Efficient 2D Method for Training Super-Large Deep Learning Models. arxiv:2104.05343\u00a0[cs.LG]"},{"key":"e_1_3_2_1_23_1","unstructured":"Masafumi Yamazaki Akihiko Kasagi Akihiro Tabuchi Takumi Honda Masahiro Miwa Naoto Fukumoto Tsuguchika Tabaru Atsushi Ike and Kohta Nakashima. 2019. Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds. arXiv preprint arXiv:1903.12650(2019)."},{"key":"e_1_3_2_1_24_1","unstructured":"Chris Ying Sameer Kumar Dehao Chen Tao Wang and Youlong Cheng. 2018. Image classification at supercomputer scale. arXiv preprint arXiv:1811.06992(2018)."},{"key":"e_1_3_2_1_25_1","unstructured":"Yang You Igor Gitman and Boris Ginsburg. 2017. Large Batch Training of Convolutional Networks. arxiv:1708.03888\u00a0[cs.CV]"},{"key":"e_1_3_2_1_26_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Syx4wnEtvH","author":"You Yang","year":"2020","unstructured":"Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Syx4wnEtvH"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3225058.3225069"}],"event":{"name":"ICPP '22: 51st International Conference on Parallel Processing","location":"Bordeaux France","acronym":"ICPP '22"},"container-title":["Proceedings of the 51st International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3545008.3545087","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3545008.3545087","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:44Z","timestamp":1750186964000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3545008.3545087"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,29]]},"references-count":27,"alternative-id":["10.1145\/3545008.3545087","10.1145\/3545008"],"URL":"https:\/\/doi.org\/10.1145\/3545008.3545087","relation":{},"subject":[],"published":{"date-parts":[[2022,8,29]]},"assertion":[{"value":"2023-01-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}