{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,19]],"date-time":"2026-02-19T15:24:51Z","timestamp":1771514691407,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":33,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"R&D Program of the Shaanxi Province of China","award":["No.2022GY-064"],"award-info":[{"award-number":["No.2022GY-064"]}]},{"name":"National Natural Science Foundation of China","award":["No.61971005"],"award-info":[{"award-number":["No.61971005"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3549201","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:12Z","timestamp":1665416592000},"page":"6868-6874","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":54,"title":["Can Language Understand Depth?"],"prefix":"10.1145","author":[{"given":"Renrui","family":"Zhang","sequence":"first","affiliation":[{"name":"Peking University, Beijing, China"}]},{"given":"Ziyao","family":"Zeng","sequence":"additional","affiliation":[{"name":"ShanghaiTech University, Shanghai, China"}]},{"given":"Ziyu","family":"Guo","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]},{"given":"Yafeng","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer, Baoji University of Arts and Science, Baoji, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_2_2_1","unstructured":"Wenjie Chang Yueyi Zhang and Zhiwei Xiong. 2021. Transformer-based Monocular Depth Estimation with Attention Supervision. (2021).  Wenjie Chang Yueyi Zhang and Zhiwei Xiong. 2021. Transformer-based Monocular Depth Estimation with Attention Supervision. (2021)."},{"key":"e_1_3_2_2_3_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_2_4_1","volume-title":"Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems","author":"Eigen David","year":"2014","unstructured":"David Eigen , Christian Puhrsch , and Rob Fergus . 2014. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems , Vol. 27 ( 2014 ). David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems , Vol. 27 (2014)."},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00214"},{"key":"e_1_3_2_2_6_1","volume-title":"Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544","author":"Gao Peng","year":"2021","unstructured":"Peng Gao , Shijie Geng , Renrui Zhang , Teli Ma , Rongyao Fang , Yongfeng Zhang , Hongsheng Li , and Yu Qiao . 2021 . Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021). Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2021. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021)."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1177\/0278364913491297"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.699"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_10_1","volume-title":"International Conference on Machine Learning. PMLR, 4904--4916","author":"Jia Chao","year":"2021","unstructured":"Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc Le , Yun-Hsuan Sung , Zhen Li , and Tom Duerig . 2021 . Scaling up visual and vision-language representation learning with noisy text supervision . In International Conference on Machine Learning. PMLR, 4904--4916 . Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904--4916."},{"key":"e_1_3_2_2_11_1","volume-title":"PLNet: Plane and Line Priors for Unsupervised Indoor Depth Estimation. In 2021 International Conference on 3D Vision (3DV). IEEE, 741--750","author":"Jiang Hualie","year":"2021","unstructured":"Hualie Jiang , Laiyan Ding , Junjie Hu , and Rui Huang . 2021 . PLNet: Plane and Line Priors for Unsupervised Indoor Depth Estimation. In 2021 International Conference on 3D Vision (3DV). IEEE, 741--750 . Hualie Jiang, Laiyan Ding, Junjie Hu, and Rui Huang. 2021. PLNet: Plane and Line Priors for Unsupervised Indoor Depth Estimation. In 2021 International Conference on 3D Vision (3DV). IEEE, 741--750."},{"key":"e_1_3_2_2_12_1","volume-title":"Dong Wook Ko, and Il Hong Suh","author":"Lee Jin Han","year":"2019","unstructured":"Jin Han Lee , Myung-Kyu Han , Dong Wook Ko, and Il Hong Suh . 2019 . From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019). Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019)."},{"key":"e_1_3_2_2_13_1","volume-title":"DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation. arXiv preprint arXiv:2203.14211","author":"Li Zhenyu","year":"2022","unstructured":"Zhenyu Li , Zehui Chen , Xianming Liu , and Junjun Jiang . 2022. DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation. arXiv preprint arXiv:2203.14211 ( 2022 ). Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. 2022. DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation. arXiv preprint arXiv:2203.14211 (2022)."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00506"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00594"},{"key":"e_1_3_2_2_17_1","volume-title":"Advances in Neural Information Processing Systems","volume":"34","author":"Mao Mingyuan","year":"2021","unstructured":"Mingyuan Mao , Renrui Zhang , Honghui Zheng , Teli Ma , Yan Peng , Errui Ding , Baochang Zhang , Shumin Han , 2021 . Dual-stream network for visual recognition . Advances in Neural Information Processing Systems , Vol. 34 (2021). Mingyuan Mao, Renrui Zhang, Honghui Zheng, Teli Ma, Yan Peng, Errui Ding, Baochang Zhang, Shumin Han, et al. 2021. Dual-stream network for visual recognition. Advances in Neural Information Processing Systems, Vol. 34 (2021)."},{"key":"e_1_3_2_2_18_1","volume-title":"End-to-end Learning for Joint Depth and Image Reconstruction from Diffracted Rotation. arXiv preprint arXiv:2204.07076","author":"Mel Mazen","year":"2022","unstructured":"Mazen Mel , Muhammad Siddiqui , and Pietro Zanuttigh . 2022. End-to-end Learning for Joint Depth and Image Reconstruction from Diffracted Rotation. arXiv preprint arXiv:2204.07076 ( 2022 ). Mazen Mel, Muhammad Siddiqui, and Pietro Zanuttigh. 2022. End-to-end Learning for Joint Depth and Image Reconstruction from Diffracted Rotation. arXiv preprint arXiv:2204.07076 (2022)."},{"key":"e_1_3_2_2_19_1","volume-title":"International Conference on Machine Learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , 2021 . Learning transferable visual models from natural language supervision . In International Conference on Machine Learning. PMLR, 8748--8763 . Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763."},{"key":"e_1_3_2_2_20_1","volume-title":"DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. arXiv preprint arXiv:2112.01518","author":"Rao Yongming","year":"2021","unstructured":"Yongming Rao , Wenliang Zhao , Guangyi Chen , Yansong Tang , Zheng Zhu , Guan Huang , Jie Zhou , and Jiwen Lu. 2021. DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. arXiv preprint arXiv:2112.01518 ( 2021 ). Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. 2021. DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. arXiv preprint arXiv:2112.01518 (2021)."},{"key":"e_1_3_2_2_21_1","first-page":"1571","article-title":"Make3D: Depth Perception from a Single Still Image","volume":"3","author":"Saxena Ashutosh","year":"2008","unstructured":"Ashutosh Saxena , Min Sun , and Andrew Y Ng . 2008 . Make3D: Depth Perception from a Single Still Image .. In Aaai , Vol. 3. 1571 -- 1576 . Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2008. Make3D: Depth Perception from a Single Still Image.. In Aaai, Vol. 3. 1571--1576.","journal-title":"Aaai"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33715-4_54"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3049869"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW54120.2021.00107"},{"key":"e_1_3_2_2_25_1","volume-title":"Inferring point clouds from single monocular images by depth intermediation. arXiv preprint arXiv:1812.01402","author":"Zeng Wei","year":"2018","unstructured":"Wei Zeng , Sezer Karaoglu , and Theo Gevers . 2018. Inferring point clouds from single monocular images by depth intermediation. arXiv preprint arXiv:1812.01402 ( 2018 ). Wei Zeng, Sezer Karaoglu, and Theo Gevers. 2018. Inferring point clouds from single monocular images by depth intermediation. arXiv preprint arXiv:1812.01402 (2018)."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2019.10.107"},{"key":"e_1_3_2_2_27_1","volume-title":"Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling. arXiv preprint arXiv:2111.03930","author":"Zhang Renrui","year":"2021","unstructured":"Renrui Zhang , Rongyao Fang , Peng Gao , Wei Zhang , Kunchang Li , Jifeng Dai , Yu Qiao , and Hongsheng Li. 2021a. Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling. arXiv preprint arXiv:2111.03930 ( 2021 ). Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2021a. Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling. arXiv preprint arXiv:2111.03930 (2021)."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00836"},{"key":"e_1_3_2_2_29_1","volume-title":"MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection. arXiv preprint arXiv:2203.13310","author":"Zhang Renrui","year":"2022","unstructured":"Renrui Zhang , Han Qiu , Tai Wang , Xuanzhuo Xu , Ziyu Guo , Yu Qiao , Peng Gao , and Hongsheng Li. 2022b. MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection. arXiv preprint arXiv:2203.13310 ( 2022 ). Renrui Zhang, Han Qiu, Tai Wang, Xuanzhuo Xu, Ziyu Guo, Yu Qiao, Peng Gao, and Hongsheng Li. 2022b. MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection. arXiv preprint arXiv:2203.13310 (2022)."},{"key":"e_1_3_2_2_30_1","volume-title":"VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts. arXiv preprint arXiv:2112.02399","author":"Zhang Renrui","year":"2021","unstructured":"Renrui Zhang , Longtian Qiu , Wei Zhang , and Ziyao Zeng . 2021b. VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts. arXiv preprint arXiv:2112.02399 ( 2021 ). Renrui Zhang, Longtian Qiu, Wei Zhang, and Ziyao Zeng. 2021b. VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts. arXiv preprint arXiv:2112.02399 (2021)."},{"key":"e_1_3_2_2_31_1","volume-title":"Chen Change Loy, and Bo Dai","author":"Zhou Chong","year":"2021","unstructured":"Chong Zhou , Chen Change Loy, and Bo Dai . 2021 a. DenseCLIP: Extract Free Dense Labels from CLIP. arXiv preprint arXiv:2112.01071 (2021). Chong Zhou, Chen Change Loy, and Bo Dai. 2021a. DenseCLIP: Extract Free Dense Labels from CLIP. arXiv preprint arXiv:2112.01071 (2021)."},{"key":"e_1_3_2_2_32_1","volume-title":"Chen Change Loy, and Ziwei Liu","author":"Zhou Kaiyang","year":"2021","unstructured":"Kaiyang Zhou , Jingkang Yang , Chen Change Loy, and Ziwei Liu . 2021 b. Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134 (2021). Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2021b. Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134 (2021)."},{"key":"e_1_3_2_2_33_1","volume-title":"Detecting Twenty-thousand Classes using Image-level Supervision. arXiv preprint arXiv:2201.02605","author":"Zhou Xingyi","year":"2022","unstructured":"Xingyi Zhou , Rohit Girdhar , Armand Joulin , Phillip Kr\u00e4henb\u00fchl , and Ishan Misra . 2022. Detecting Twenty-thousand Classes using Image-level Supervision. arXiv preprint arXiv:2201.02605 ( 2022 ). Xingyi Zhou, Rohit Girdhar, Armand Joulin, Phillip Kr\u00e4henb\u00fchl, and Ishan Misra. 2022. Detecting Twenty-thousand Classes using Image-level Supervision. arXiv preprint arXiv:2201.02605 (2022)."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3549201","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3549201","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:18Z","timestamp":1750182558000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3549201"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":33,"alternative-id":["10.1145\/3503161.3549201","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3549201","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}