{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,15]],"date-time":"2025-08-15T02:23:08Z","timestamp":1755224588619,"version":"3.43.0"},"reference-count":54,"publisher":"Association for Computing Machinery (ACM)","issue":"8","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62325602 and U21B2037"],"award-info":[{"award-number":["62325602 and U21B2037"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100006407","name":"Natural Science Foundation of Henan Province","doi-asserted-by":"crossref","award":["232102210025 and 232300421093"],"award-info":[{"award-number":["232102210025 and 232300421093"]}],"id":[{"id":"10.13039\/501100006407","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,8,31]]},"abstract":"<jats:p>\n            In environments where vision-based depth estimation systems, such as those utilizing infrared or imaging technologies, encounter limitations\u2014particularly in low-light conditions\u2014alternative approaches become essential. Echo depth estimation emerges as a compelling solution by leveraging the time delay of echoes to map the geometric structure of the surrounding environment. This method offers distinct advantages in specific scenarios, providing reliable data for accurate scene understanding and 3D reconstruction. Traditional echo depth estimation techniques primarily depend on spatial information captured by the encoder and depth predictions made by the decoder. However, these methods often fail to fully exploit the rich depth features present at different simultaneous frequencies. To address this challenge, we propose an echo depth estimation method via Attention-based Hierarchical Multi-scale Feature Fusion Network (AHMF-Net). This network is designed to extract spatial depth information from echo spectrograms across multiple scales and hierarchical levels, while fusing the most relevant information using an attention mechanism. AHMF-Net introduces two key modules in hierarchical levels: the Intra-layer Multi-scale Attention Feature Fusion (IMAF) module, which functions as the encoder to capture multi-scale features across varying granularities, and the Inter-layer Multi-Scale Detail Feature Fusion (IMDF) module, which integrates features from all encoding layers into the decoder to enable effective inter-layer multi-scale fusion. Additionally, the encoder incorporates an attention mechanism that enhances depth-related features by capturing channel dependencies at multiple scales. We evaluated AHMF-Net on the Replica, Matterport3D, and BatVision datasets, where it consistently outperformed state-of-the-art models in echo-based depth estimation, demonstrating superior accuracy and robustness. The source code is publicly available at\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/wjzhang-ai\/AHMF-Net\">https:\/\/github.com\/wjzhang-ai\/AHMF-Net<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3736768","type":"journal-article","created":{"date-parts":[[2025,5,26]],"date-time":"2025-05-26T21:18:14Z","timestamp":1748294294000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Echo Depth Estimation via Attention-based Hierarchical Multi-scale Feature Fusion Network"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8527-9826","authenticated-orcid":false,"given":"Wenjie","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China, Research Center of Intelligent Swarm Systems, Ministry of Education of the People\u2019s Republic of China, Zhengzhou, China, and National Supercomputing Center in Zhengzhou, Zhengzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-5448-6237","authenticated-orcid":false,"given":"Jun","family":"Yin","sequence":"additional","affiliation":[{"name":"School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0629-7588","authenticated-orcid":false,"given":"Peng","family":"Yu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7336-6252","authenticated-orcid":false,"given":"Yibo","family":"Guo","sequence":"additional","affiliation":[{"name":"School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China, Research Center of Intelligent Swarm Systems, Ministry of Education of the People\u2019s Republic of China, Zhengzhou, China, and National Supercomputing Center in Zhengzhou, Zhengzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5770-0417","authenticated-orcid":false,"given":"Xiaoheng","family":"Jiang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China, Research Center of Intelligent Swarm Systems, Ministry of Education of the People\u2019s Republic of China, Zhengzhou, China, and National Supercomputing Center in Zhengzhou, Zhengzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1079-2705","authenticated-orcid":false,"given":"Shaohui","family":"Jin","sequence":"additional","affiliation":[{"name":"School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China, Research Center of Intelligent Swarm Systems, Ministry of Education of the People\u2019s Republic of China, Zhengzhou, China, and National Supercomputing Center in Zhengzhou, Zhengzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6885-3451","authenticated-orcid":false,"given":"Mingliang","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou, China, Engineering Research Center of Intelligent Swarm Systems, Ministry of Education of the People\u2019s Republic of China, Zhengzhou, China, and National Supercomputing Center in Zhengzhou, Zhengzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2025,8,12]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00379"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3588571"},{"key":"e_1_3_1_4_2","first-page":"3955","article-title":"FSNet: Redesign self-supervised monodepth for full-scale depth prediction for autonomous driving","author":"Liu Yuxuan","year":"2023","unstructured":"Yuxuan Liu, Zhenhua Xu, Huaiyang Huang, Lujia Wang, and Ming Liu. 2023. FSNet: Redesign self-supervised monodepth for full-scale depth prediction for autonomous driving. IEEE Transactions on Automation Science and Engineering 21, 3 (2023), 3955\u20133965.","journal-title":"IEEE Transactions on Automation Science and Engineering"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.22364\/bjmc.2019.7.2.07"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3622788"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2024.3372078"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2022.3203110"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/VR51125.2022.00101"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/VR58804.2024.00059"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3167307"},{"key":"e_1_3_1_12_2","first-page":"9131","article-title":"Self-supervised deep monocular depth estimation with ambiguity boosting","volume":"44","author":"Luis Juan","year":"2021","unstructured":"Juan Luis, Gonzalez Bello, and Munchurl Kim. 2021. Self-supervised deep monocular depth estimation with ambiguity boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2021), 9131\u20139149.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00343"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00817"},{"key":"e_1_3_1_15_2","first-page":"54987","article-title":"Dynamo-depth: Fixing unsupervised depth estimation for dynamical scenes","volume":"36","author":"Sun Yihong","year":"2024","unstructured":"Yihong Sun and Bharath Hariharan. 2024. Dynamo-depth: Fixing unsupervised depth estimation for dynamical scenes. Advances in Neural Information Processing Systems 36 (2024), 54987\u201355005.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_16_2","first-page":"53025","article-title":"Iebins: Iterative elastic bins for monocular depth estimation","volume":"36","author":"Shao Shuwei","year":"2024","unstructured":"Shuwei Shao, Zhongcai Pei, Xingming Wu, Zhong Liu, Weihai Chen, and Zhengguo Li. 2024. Iebins: Iterative elastic bins for monocular depth estimation. Advances in Neural Information Processing Systems 36 (2024), 53025\u201353037.","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"9","key":"e_1_3_1_17_2","first-page":"5314","article-title":"On the synergies between machine learning and binocular stereo for depth estimation from images: A survey","volume":"44","author":"Poggi Matteo","year":"2022","unstructured":"Matteo Poggi, Fabio Tosi, Konstantinos Batsos, Philippos Mordohai, and Stefano Mattoccia. 2022. On the synergies between machine learning and binocular stereo for depth estimation from images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2022), 5314\u20135334.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.3032602"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICEIC54506.2022.9748249"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3471870"},{"key":"e_1_3_1_21_2","first-page":"917","article-title":"Sparse pseudo-lidar depth assisted monocular depth estimation","volume":"1","author":"Shao Shuwei","year":"2023","unstructured":"Shuwei Shao, Zhongcai Pei, Weihai Chen, Qiang Liu, Haosong Yue, and Zhengguo Li. 2023. Sparse pseudo-lidar depth assisted monocular depth estimation. IEEE Transactions on Intelligent Vehicles 9, 1 (2023), 917\u2013929.","journal-title":"IEEE Transactions on Intelligent Vehicles"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00895"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2952095"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA40945.2020.9196934"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58548-8_37"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9746476"},{"key":"e_1_3_1_27_2","first-page":"1","volume-title":"Proceedings of the IEEE\/RSJ International Conference on Intelligent Robots and Systems","author":"Brunetto Amandine","year":"2023","unstructured":"Amandine Brunetto, Sascha Hornauer, X. Yu Stella, and Fabien Moutarde. 2023. The audio-visual batvision dataset for research on sight and sound. In Proceedings of the IEEE\/RSJ International Conference on Intelligent Robots and Systems, 1\u20138."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2699184"},{"key":"e_1_3_1_29_2","first-page":"1","article-title":"Depth map prediction from a single image using a multi-scale deep network","volume":"27","author":"Eigen David","year":"2014","unstructured":"David Eigen, Christian Puhrsch, and Fergus Rob. 2014. Depth map prediction from a single image using a multi-scale deep network. Advances in Neural Information Processing Systems 27 (2014), 1\u20139.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2019.00116"},{"key":"e_1_3_1_31_2","unstructured":"Jin Han Lee Myung-Kyu Han Dong Wook Ko and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv:1907.10326. Retrieved from https:\/\/arxiv.org\/abs\/1907.10326"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.106"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01219-9_37"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3263870"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3049869"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2023.3251921"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01196"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV56688.2023.00581"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01778"},{"key":"e_1_3_1_40_2","unstructured":"Alexey Dosovitskiy. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3355461"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-023-01873-z"},{"issue":"3","key":"e_1_3_1_43_2","doi-asserted-by":"crossref","first-page":"1176","DOI":"10.1109\/TIP.2011.2163164","article-title":"Depth video enhancement based on weighted mode filtering","volume":"21","author":"Min Dongbo","year":"2011","unstructured":"Dongbo Min, Jiangbo Lu, and Minh N. Do. 2011. Depth video enhancement based on weighted mode filtering. IEEE Transactions on Image Processing: A Publication of the IEEE Signal Processing Society 21, 3 (2011), 1176\u20131190.","journal-title":"IEEE Transactions on Image Processing: A Publication of the IEEE Signal Processing Society"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3340225"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01155"},{"key":"e_1_3_1_46_2","first-page":"1","article-title":"ATS-UNet: Attentional 2-D time sequence UNet for global ionospheric one-day-ahead prediction","volume":"21","author":"Xue Kaiyu","year":"2024","unstructured":"Kaiyu Xue, Chuang Shi, and Cheng Wang. 2024. ATS-UNet: Attentional 2-D time sequence UNet for global ionospheric one-day-ahead prediction. IEEE Geoscience and Remote Sensing Letters 21 (2024), 1\u20135.","journal-title":"IEEE Geoscience and Remote Sensing Letters"},{"key":"e_1_3_1_47_2","unstructured":"Yaopeng Peng Milan Sonka and Danny Z. Chen. 2023. U-Net v2: Rethinking the skip connections of U-Net for medical image segmentation. arXiv:2311.17791. Retrieved from https:\/\/arxiv.org\/abs\/2311.17791"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/3DV.2016.32"},{"key":"e_1_3_1_49_2","unstructured":"Julian Straub Thomas Whelan Lingni Ma Yufan Chen Erik Wijmans Simon Green Jakob J. Engel Raul Mur-Artal Carl Ren Shobhit Verma et al. 2019. The replica dataset: A digital replica of indoor spaces. arXiv:1906.05797. Retrieved from https:\/\/arxiv.org\/abs\/1906.05797"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/3DV.2017.00081"},{"key":"e_1_3_1_51_2","first-page":"1","article-title":"Pytorch: An imperative style, high-performance deep learning library","volume":"32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019), 1\u201312.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00446"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/3DV.2016.32"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_1"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3736768","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,12]],"date-time":"2025-08-12T20:36:50Z","timestamp":1755031010000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3736768"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,12]]},"references-count":54,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2025,8,31]]}},"alternative-id":["10.1145\/3736768"],"URL":"https:\/\/doi.org\/10.1145\/3736768","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2025,8,12]]},"assertion":[{"value":"2024-11-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-04","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-08-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}