{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T17:56:17Z","timestamp":1757613377166,"version":"3.44.0"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"9","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,5]]},"abstract":"<jats:p>Cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage are widely used to store raw data for machine learning applications. When the data is later processed, the analysis predominantly focuses on regions of interest (such as a small bounding box in a larger image) and discards uninteresting regions. Machine learning applications can significantly accelerate their I\/O if they push this data filtering step to the cloud. Prior work has proposed different methods to partially read array (tensor) objects, such as chunking, reading a contiguous byte range, and evaluating a lambda function. No method is optimal; estimating the total time and cost of a data retrieval requires an understanding of the data serialization order, the chunk size and platform-specific properties. This paper introduces ArrayMorph, a cloud-based array data storage system that automatically determines which is the best method to use to retrieve regions of interest from data on the cloud. ArrayMorph formulates data accesses as hyperslab queries, and optimizes them using a multi-phase cost-based approach. ArrayMorph seamlessly integrates with Python\/PyTorch-based ML applications, and is experimentally shown to transfer up to 9.8X less data than existing systems. This makes ML applications run up to 1.7X faster and 9X cheaper than prior solutions.<\/jats:p>","DOI":"10.14778\/3746405.3746437","type":"journal-article","created":{"date-parts":[[2025,9,3]],"date-time":"2025-09-03T17:06:20Z","timestamp":1756919180000},"page":"3189-3202","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["ArrayMorph: Optimizing Hyperslab Queries on the Cloud for Machine Learning Pipelines"],"prefix":"10.14778","volume":"18","author":[{"given":"Ruochen","family":"Jiang","sequence":"first","affiliation":[{"name":"The Ohio State University, Columbus, OH, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Spyros","family":"Blanas","sequence":"additional","affiliation":[{"name":"The Ohio State University, Columbus, OH, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,9,3]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation","author":"Agarwal Sharad","year":"2010","unstructured":"Sharad Agarwal, John Dunagan, Navendu Jain, Stefan Saroiu, Alec Wolman, and Harbinder Bhogan. 2010. Volley: automated data placement for geo-distributed cloud services. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (San Jose, California) (NSDI'10). USENIX Association, USA, 2."},{"key":"e_1_2_1_2_1","volume-title":"Article 219 (Nov.","author":"Alam Md Mahbub","year":"2022","unstructured":"Md Mahbub Alam, Luis Torgo, and Albert Bifet. 2022. A Survey on Spatiotemporal Data Analytics Systems. ACM Comput. Surv. 54, 10s, Article 219 (Nov. 2022), 38 pages."},{"key":"e_1_2_1_3_1","unstructured":"Argonne Leadership Computing Facility. 2025. Deep Learning I\/O (DLIO) Benchmark. https:\/\/github.com\/argonne-lcf\/dlio_benchmark\/tree\/main. Accessed: 2025-06-21."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742797"},{"key":"e_1_2_1_5_1","first-page":"3","article-title":"Characterizing I\/O in Machine Learning with MLPerf Storage","volume":"51","author":"Balmau Oana","year":"2022","unstructured":"Oana Balmau. 2022. Characterizing I\/O in Machine Learning with MLPerf Storage. SIGMOD Rec. 51, 3 (Nov. 2022), 47\u201348.","journal-title":"SIGMOD Rec."},{"key":"e_1_2_1_6_1","volume-title":"Proc. ACM Manag. Data 2, 1, Article 55 (March","author":"Bang Tiemo","year":"2024","unstructured":"Tiemo Bang, Chris Douglas, Natacha Crooks, and Joseph M. Hellerstein. 2024. SkyPIE: A Fast & Accurate Oracle for Object Placement. Proc. ACM Manag. Data 2, 1, Article 55 (March 2024), 27 pages."},{"key":"e_1_2_1_7_1","first-page":"12","article-title":"A Two's Complement Parallel Array Multiplication Algorithm","volume":"22","author":"Baugh Charles R.","year":"1973","unstructured":"Charles R. Baugh and Bruce A. Wooley. 1973. A Two's Complement Parallel Array Multiplication Algorithm. IEEE Trans. Comput. 22, 12 (Dec. 1973), 1045\u20131047.","journal-title":"IEEE Trans. Comput."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807271"},{"key":"e_1_2_1_9_1","unstructured":"Microsoft Corporation. 2024. Image Detection Demo with Pytorch-Wildlife. https:\/\/cameratraps.readthedocs.io\/en\/latest\/demo\/image_detection_demo.html#3.-JSON-Format Accessed: 2025-06-21."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2903741"},{"key":"e_1_2_1_11_1","volume-title":"DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications. 81\u201391","author":"Devarajan H.","year":"2021","unstructured":"H. Devarajan, H. Zheng, A. Kougkas, X.-H. Sun, and V. Vishwanath. 2021. DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications. 81\u201391 (2021)."},{"key":"e_1_2_1_12_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018). https:\/\/arxiv.org\/abs\/1810.04805 Accessed: 2025-06-21."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2723709"},{"key":"e_1_2_1_14_1","unstructured":"OpenStack Foundation. 2024. OpenStack: Open Source Cloud Computing Platform. https:\/\/www.openstack.org\/ Accessed: 2025-06-21."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588195.3592995"},{"key":"e_1_2_1_16_1","doi-asserted-by":"crossref","first-page":"e58145","DOI":"10.7554\/eLife.58145","article-title":"anTraX, a software package for high-throughput video tracking of color-tagged insects","volume":"9","author":"Gal Asaf","year":"2020","unstructured":"Asaf Gal, Jonathan Saragosti, and Daniel JC Kronauer. 2020. anTraX, a software package for high-throughput video tracking of color-tagged insects. Elife 9 (2020), e58145.","journal-title":"Elife"},{"key":"e_1_2_1_17_1","unstructured":"HDF Group. 2023. Highly Scalable Data Service. https:\/\/github.com\/HDFGroup\/hsds Accessed: 2025-06-21."},{"key":"e_1_2_1_18_1","volume-title":"International Conference on Computational Science. Springer, 481\u2013492","author":"Grzesik Piotr","year":"2022","unstructured":"Piotr Grzesik and Dariusz Mrozek. 2022. Accelerating edge metagenomic analysis with serverless-based cloud offloading. In International Conference on Computational Science. Springer, 481\u2013492."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742795"},{"key":"e_1_2_1_20_1","volume-title":"Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770\u2013778","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770\u2013778."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1093\/mnras\/stx1544"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3267809.3267827"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3459240"},{"key":"e_1_2_1_24_1","unstructured":"Ruochen Jiang. 2024. ArrayMorph. https:\/\/github.com\/ruochenj123\/ArrayMorph. Accessed: 2025-06-21."},{"key":"e_1_2_1_25_1","volume-title":"An overview of MODIS Land data processing and product status. Remote sensing of Environment 83, 1\u20132","author":"Justice CO","year":"2002","unstructured":"CO Justice, JRG Townshend, EF Vermote, E Masuoka, RE Wolfe, Nazmi Saleous, DP Roy, and JT Morisette. 2002. An overview of MODIS Land data processing and product status. Remote sensing of Environment 83, 1\u20132 (2002), 3\u201315."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137664"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the ACM International Conference on Supercomputing","author":"Kang Donghe","year":"2019","unstructured":"Donghe Kang, Vedang Patel, Ashwati Nair, Spyros Blanas, Yang Wang, and Srinivasan Parthasarathy. 2019. Henosis: workload-driven small array consolidation and placement for HDF5 applications on heterogeneous data stores. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona) (ICS '19). Association for Computing Machinery, New York, NY, USA, 392\u2013402."},{"key":"e_1_2_1_28_1","volume-title":"Serverless Data Analytics with Flint. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). 451\u2013455","author":"Kim Youngbin","year":"2018","unstructured":"Youngbin Kim and Jimmy Lin. 2018. Serverless Data Analytics with Flint. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). 451\u2013455."},{"key":"e_1_2_1_29_1","volume-title":"Jean Luca Bez, and Suren Byna","author":"Lewis Noah","year":"2025","unstructured":"Noah Lewis, Jean Luca Bez, and Suren Byna. 2025. I\/O in Machine Learning Applications on HPC Systems: A 360\u2013degree Survey. ACM Comput. Surv. 57, 10, Article 256 (May 2025), 41 pages."},{"key":"e_1_2_1_30_1","volume-title":"The Living Image of Animals","author":"LILA","year":"2023","unstructured":"LILA: The Living Image of Animals. 2023. SWG Camera Traps 2018\u20132022 Dataset. https:\/\/lila.science\/datasets\/swg-camera-traps. Accessed: 2025-06-21."},{"key":"e_1_2_1_31_1","unstructured":"Microsoft. 2023. Microsoft MegaDetector. https:\/\/github.com\/microsoft\/CameraTraps. Accessed: 2025-06-21."},{"key":"e_1_2_1_32_1","unstructured":"MinIO Inc. 2025. MinIO Object Storage. https:\/\/github.com\/minio\/minio. Accessed: 2025-06-21."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389758"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447786.3456239"},{"key":"e_1_2_1_35_1","unstructured":"NVIDIA. 2023. BERT for PyTorch. https:\/\/github.com\/NVIDIA\/DeepLearningExamples\/tree\/master\/PyTorch\/LanguageModeling\/BERT. Accessed: 2025-06-21."},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the 10th ACM International Systems and Storage Conference","author":"Oh Kwangsung","year":"2017","unstructured":"Kwangsung Oh, Abhishek Chandra, and Jon Weissman. 2017. TripS: automated multi-tiered data placement in a geo-distributed cloud environment. In Proceedings of the 10th ACM International Systems and Storage Conference (Haifa, Israel) (SYSTOR '17). Association for Computing Machinery, New York, NY, USA, Article 12, 11 pages."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1317331.1317337"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.14778\/3025111.3025117"},{"key":"e_1_2_1_39_1","volume-title":"John D'Uva, Mikhail Kislin, Dan H Sanes, Sarah D Kocher, Samuel S-H, Annegret L Falkner, Joshua W Shaevitz, and Mala Murthy.","author":"Pereira Talmo D","year":"2022","unstructured":"Talmo D Pereira, Nathaniel Tabris, Arie Matsliah, David M Turner, Junyu Li, Shruthi Ravindranath, Eleni S Papadoyannis, Edna Normand, David S Deutsch, Z. Yan Wang, Grace C McKenzie-Smith, Catalin C Mitelut, Marielisa Diez Castro, John D'Uva, Mikhail Kislin, Dan H Sanes, Sarah D Kocher, Samuel S-H, Annegret L Falkner, Joshua W Shaevitz, and Mala Murthy. 2022. SLEAP: A deep learning system for multi-animal pose tracking. Nature Methods 19, 4 (2022)."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380609"},{"key":"e_1_2_1_41_1","volume-title":"Camera Traps: A simulator and edge device application for classifying images. https:\/\/github.com\/tapis-project\/camera-traps Accessed: 2025-06-21.","author":"Project Tapis","year":"2022","unstructured":"Tapis Project. 2022. Camera Traps: A simulator and edge device application for classifying images. https:\/\/github.com\/tapis-project\/camera-traps Accessed: 2025-06-21."},{"key":"e_1_2_1_42_1","doi-asserted-by":"crossref","DOI":"10.1561\/9781638281498","volume-title":"Multidimensional Array Data Management. Foundations and Trends\u00ae in Databases 12, 2\u20133","author":"Rusu Florin","year":"2023","unstructured":"Florin Rusu. 2023. Multidimensional Array Data Management. Foundations and Trends\u00ae in Databases 12, 2\u20133 (2023), 69\u2013220."},{"key":"e_1_2_1_43_1","volume-title":"Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, USA, 328\u2013336","author":"Sarawagi Sunita","year":"1994","unstructured":"Sunita Sarawagi and Michael Stonebraker. 1994. Efficient Organization of Large Multidimensional Arrays. In Proceedings of the Tenth International Conference on Data Engineering. IEEE Computer Society, USA, 328\u2013336."},{"key":"e_1_2_1_44_1","volume-title":"Presto: SQL on Everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1802\u20131813","author":"Sethi Raghav","year":"2019","unstructured":"Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, and Christopher Berner. 2019. Presto: SQL on Everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1802\u20131813."},{"key":"e_1_2_1_45_1","volume-title":"2017 International Conference on Big Data, IoT and Data Science (BID). 49\u201353","author":"Singh Sachchidanand","year":"2017","unstructured":"Sachchidanand Singh. 2017. Optimize cloud computations using edge computing. In 2017 International Conference on Big Data, IoT and Data Science (BID). 49\u201353."},{"key":"e_1_2_1_46_1","unstructured":"William C Skamarock Joseph B Klemp Jimy Dudhia David O Gill Dale M Barker Michael G Duda Xiang-Yu Huang Wei Wang Jordan G Powers et al. 2008. A description of the advanced research WRF version 3. NCAR technical note 475 125 (2008) 10\u20135065."},{"key":"e_1_2_1_47_1","volume-title":"Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data","author":"Soroush Emad","year":"2011","unstructured":"Emad Soroush, Magdalena Balazinska, and Daniel Wang. 2011. ArrayStore: a storage manager for complex parallel array processing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (Athens, Greece) (SIGMOD '11). Association for Computing Machinery, New York, NY, USA, 253\u2013264."},{"key":"e_1_2_1_48_1","volume-title":"Chan Hee Song, and David Edward Carlyn","author":"Stevens Samuel","year":"2024","unstructured":"Samuel Stevens, Jiaman Wu, Matthew J. Thompson, Elizabeth G. Campolongo, Chan Hee Song, and David Edward Carlyn. 2024. BioCLIP. Accessed: 2025-06-21."},{"key":"e_1_2_1_49_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19412\u201319424","author":"Stevens Samuel","year":"2024","unstructured":"Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su. 2024. BioCLIP: A Vision Foundation Model for the Tree of Life. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19412\u201319424."},{"key":"e_1_2_1_50_1","volume-title":"Tapis: An API Platform for Reproducible, Distributed Computational Research. In Advances in Information and Communication","author":"Stubbs Joe","year":"2021","unstructured":"Joe Stubbs, Richard Cardone, Mike Packard, Anagha Jamthe, Smruti Padhy, Steve Terry, Julia Looney, Joseph Meiring, Steve Black, Maytal Dahan, Sean Cleveland, and Gwen Jacobs. 2021. Tapis: An API Platform for Reproducible, Distributed Computational Research. In Advances in Information and Communication, Kohei Arai (Ed.). Springer International Publishing, Cham, 878\u2013900."},{"key":"e_1_2_1_51_1","unstructured":"The HDF Group. 2021. HSDS New Features: AWS Lambda and Direct Access. https:\/\/www.hdfgroup.org\/wp-content\/uploads\/2021\/10\/HSDS-New-Features-AWS-Lambda-and-Direct-Access.pdf. Accessed: 2025-06-21."},{"key":"e_1_2_1_52_1","unstructured":"The HDF Group. 2024. HDF5 Documentation: Dataspace Transfer Property List-Selection Conversion Mode. https:\/\/support.hdfgroup.org\/documentation\/hdf5\/latest\/_h5_s__u_g.html#subsubsec_dataspace_transfer_select Accessed: 2025-06-21."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687553.1687609"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-022-27980-y"},{"key":"e_1_2_1_55_1","volume-title":"Proc. VLDB Endow. 14","author":"Yang Yifei","year":"2021","unstructured":"Yifei Yang, Matt Youill, Matthew Woicik, Yizhou Liu, Xiangyao Yu, Marco Serafini, Ashraf Aboulnaga, and Michael Stonebraker. 2021. FlexPushdownDB: hybrid pushdown and caching in a cloud DBMS. Proc. VLDB Endow. 14, 11 July 2021), 2101\u20132113."},{"key":"e_1_2_1_56_1","volume-title":"2020 IEEE 36th International Conference on Data Engineering (ICDE). 1802\u20131805","author":"Yu Xiangyao","year":"2020","unstructured":"Xiangyao Yu, Matt Youill, Matthew Woicik, Abdurrahman Ghanem, Marco Serafini, Ashraf Aboulnaga, and Michael Stonebraker. 2020. PushdownDB: Accelerating a DBMS Using S3 Computation. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 1802\u20131805."},{"key":"e_1_2_1_57_1","volume-title":"Proc. VLDB Endow. 11","author":"Rodriges Zalipynis Ramon Antonio","year":"2018","unstructured":"Ramon Antonio Rodriges Zalipynis. 2018. ChronosDB: distributed, file based, geospatial array DBMS. Proc. VLDB Endow. 11, 10 June 2018), 1247\u20131261."},{"key":"e_1_2_1_58_1","volume-title":"Annual Conference on Innovative Data Systems Research (CIDR'22)","author":"Zhang Qizhen","year":"2022","unstructured":"Qizhen Zhang, Philip A Bernstein, Daniel S Berger, Badrish Chandramouli, Vincent Liu, and Boon Thau Loo. 2022. Compucache: Remote computable caching using spot vms. In Annual Conference on Innovative Data Systems Research (CIDR'22)."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915247"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3746405.3746437","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T19:52:44Z","timestamp":1757015564000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3746405.3746437"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5]]},"references-count":59,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,5]]}},"alternative-id":["10.14778\/3746405.3746437"],"URL":"https:\/\/doi.org\/10.14778\/3746405.3746437","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2025,5]]},"assertion":[{"value":"2025-09-03","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}