{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T09:09:52Z","timestamp":1769159392276,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":62,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547978","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:01Z","timestamp":1665416581000},"page":"6347-6358","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":26,"title":["Less is More"],"prefix":"10.1145","author":[{"given":"Yiran","family":"Wang","sequence":"first","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, China"}]},{"given":"Zhiyu","family":"Pan","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, China"}]},{"given":"Xingyi","family":"Li","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, China"}]},{"given":"Zhiguo","family":"Cao","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, China"}]},{"given":"Ke","family":"Xian","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, China"}]},{"given":"Jianming","family":"Zhang","sequence":"additional","affiliation":[{"name":"Adobe Research, San Jose, CA, USA"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"e_1_3_2_2_2_1","volume-title":"BEiT: BERT Pre- Training of Image Transformers. In International Conference on Learning Representations.","author":"Bao Hangbo","year":"2022","unstructured":"Hangbo Bao , Li Dong , Songhao Piao , and Furu Wei . 2022 . BEiT: BERT Pre- Training of Image Transformers. In International Conference on Learning Representations. Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2022. BEiT: BERT Pre- Training of Image Transformers. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_3_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4009--4018","author":"Bhat Shariq Farooq","year":"2021","unstructured":"Shariq Farooq Bhat , Ibraheem Alhashim , and PeterWonka. 2021 . Adabins: Depth estimation using adaptive bins . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4009--4018 . Shariq Farooq Bhat, Ibraheem Alhashim, and PeterWonka. 2021. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4009--4018."},{"key":"e_1_3_2_2_4_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. In Advances in neural information processing systems Vol. 33. 1877--1901.  Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. In Advances in neural information processing systems Vol. 33. 1877--1901."},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475564"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2017.2740321"},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_2_8_1","volume-title":"International conference on machine learning. PMLR, 1691--1703","author":"Chen Mark","year":"2020","unstructured":"Mark Chen , Alec Radford , Rewon Child , Jeffrey Wu , Heewoo Jun , David Luan , and Ilya Sutskever . 2020 . Generative pretraining from pixels . In International conference on machine learning. PMLR, 1691--1703 . Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International conference on machine learning. PMLR, 1691--1703."},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01270-0_7"},{"key":"e_1_3_2_2_10_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,NAACL-HLT 2019","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . Bert: Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,NAACL-HLT 2019 , Minneapolis, MN, USA, June 2--7 , 2019, Volume 1 (Long and Short Papers). 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). 4171--4186."},{"key":"e_1_3_2_2_11_1","volume-title":"International Conference on Learning Representations.","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , 2020 . An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale . In International Conference on Learning Representations. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_12_1","volume-title":"Advances in neural information processing systems","author":"Eigen David","unstructured":"David Eigen , Christian Puhrsch , and Rob Fergus . 2014. Depth map prediction from a single image using a multi-scale deep network . In Advances in neural information processing systems , Vol. 27 . 2366--2374. David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, Vol. 27. 2366--2374."},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00675"},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00214"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1177\/0278364913491297"},{"key":"e_1_3_2_2_16_1","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV). 3828--3838","author":"Godard C.","unstructured":"C. Godard , O. Aodha , M. Firman , and G. Brostow . 2019. Digging into selfsupervised monocular depth estimation . In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV). 3828--3838 . C. Godard, O. Aodha, M. Firman, and G. Brostow. 2019. Digging into selfsupervised monocular depth estimation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV). 3828--3838."},{"key":"e_1_3_2_2_17_1","volume-title":"Advances in neural information processing systems","author":"Goodfellow Ian","unstructured":"Ian Goodfellow , Jean Pouget-Abadie , Mehdi Mirza , Bing Xu , David Warde-Farley , Sherjil Ozair , Aaron Courville , and Yoshua Bengio . 2014. Generative adversarial nets . In Advances in neural information processing systems , Vol. 27 . Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, Vol. 27."},{"key":"e_1_3_2_2_18_1","volume-title":"Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377","author":"He Kaiming","year":"2021","unstructured":"Kaiming He , Xinlei Chen , Saining Xie , Yanghao Li , Piotr Doll\u00e1r , and Ross Girshick . 2021. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 ( 2021 ). Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\u00e1r, and Ross Girshick. 2021. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)."},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_20_1","unstructured":"Geoffrey Hinton Oriol Vinyals Jeff Dean etal 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2 7 (2015).  Geoffrey Hinton Oriol Vinyals Jeff Dean et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2 7 (2015)."},{"key":"e_1_3_2_2_21_1","volume-title":"Long short-term memory. Neural computation 9, 8","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long short-term memory. Neural computation 9, 8 ( 1997 ), 1735--1780. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.179"},{"key":"e_1_3_2_2_23_1","volume-title":"Depth transfer: Depth extraction from video using non-parametric sampling","author":"Karsch Kevin","year":"2014","unstructured":"Kevin Karsch , Ce Liu , and Sing Bing Kang . 2014. Depth transfer: Depth extraction from video using non-parametric sampling . IEEE transactions on pattern analysis and machine intelligence 36, 11 ( 2014 ), 2144--2158. Kevin Karsch, Ce Liu, and Sing Bing Kang. 2014. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE transactions on pattern analysis and machine intelligence 36, 11 (2014), 2144--2158."},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00166"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/3DV.2016.32"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_3_2_2_27_1","volume-title":"Dong Wook Ko, and Il Hong Suh","author":"Lee Jin Han","year":"2019","unstructured":"Jin Han Lee , Myung-Kyu Han , Dong Wook Ko, and Il Hong Suh . 2019 . From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019). Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019)."},{"key":"e_1_3_2_2_28_1","volume-title":"Asian Conference on Computer Vision (ACCV). 663--678","author":"Li Ruibo","year":"2018","unstructured":"Ruibo Li , Ke Xian , Chunhua Shen , Zhiguo Cao , Hao Lu , and Lingxiao Hang . 2018 . Deep attention-based classification network for robust depth prediction . In Asian Conference on Computer Vision (ACCV). 663--678 . Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, and Lingxiao Hang. 2018. Deep attention-based classification network for robust depth prediction. In Asian Conference on Computer Vision (ACCV). 663--678."},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.549"},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.106"},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00271"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.3001940"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_2_34_1","volume-title":"Convtransformer: A convolutional transformer network for video frame synthesis. arXiv preprint arXiv:2011.10185","author":"Liu Zhouyong","year":"2020","unstructured":"Zhouyong Liu , Shun Luo , Wubin Li , Jingben Lu , Yufan Wu , Shilei Sun , Chunguo Li , and Luxi Yang . 2020 . Convtransformer: A convolutional transformer network for video frame synthesis. arXiv preprint arXiv:2011.10185 (2020). Zhouyong Liu, Shun Luo, Wubin Li, Jingben Lu, Yufan Wu, Shilei Sun, Chunguo Li, and Luxi Yang. 2020. Convtransformer: A convolutional transformer network for video frame synthesis. arXiv preprint arXiv:2011.10185 (2020)."},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3386569.3392377"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00594"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2020.3017478"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01580"},{"key":"e_1_3_2_2_39_1","volume-title":"Improving language understanding by generative pre-training. OpenAI blog","author":"Radford Alec","year":"2018","unstructured":"Alec Radford , Karthik Narasimhan , Tim Salimans , and Ilya Sutskever . 2018. Improving language understanding by generative pre-training. OpenAI blog ( 2018 ). Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI blog (2018)."},{"key":"e_1_3_2_2_40_1","unstructured":"Alec Radford JeffreyWu Rewon Child David Luan Dario Amodei Ilya Sutskever etal 2019. Language models are unsupervised multitask learners. OpenAI blog (2019).  Alec Radford JeffreyWu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog (2019)."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01196"},{"key":"e_1_3_2_2_42_1","volume-title":"Towards robust monocular depth estimation: Mixing datasets for zeroshot cross-dataset transfer","author":"Ranftl Ren\u00e9","year":"2020","unstructured":"Ren\u00e9 Ranftl , Katrin Lasinger , David Hafner , Konrad Schindler , and Vladlen Koltun . 2020. Towards robust monocular depth estimation: Mixing datasets for zeroshot cross-dataset transfer . IEEE transactions on pattern analysis and machine intelligence 44, 03 ( 2020 ), 1623--1637. Ren\u00e9 Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2020. Towards robust monocular depth estimation: Mixing datasets for zeroshot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence 44, 03 (2020), 1623--1637."},{"key":"e_1_3_2_2_43_1","volume-title":"Structure-from-Motion Revisited. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4104--4113","author":"Sch\u00f6nberger Johannes Lutz","year":"2016","unstructured":"Johannes Lutz Sch\u00f6nberger and Jan-Michael Frahm . 2016 . Structure-from-Motion Revisited. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4104--4113 . Johannes Lutz Sch\u00f6nberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4104--4113."},{"key":"e_1_3_2_2_44_1","volume-title":"Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV)","volume":"9907","author":"Sch\u00f6nberger Johannes Lutz","year":"2016","unstructured":"Johannes Lutz Sch\u00f6nberger , Enliang Zheng , Marc Pollefeys , and Jan-Michael Frahm . 2016 . Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV) , Vol. 9907 . 501--518. Johannes Lutz Sch\u00f6nberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. 2016. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV), Vol. 9907. 501--518."},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/78.650093"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33715-4_54"},{"key":"e_1_3_2_2_47_1","volume-title":"BA-Net: Dense Bundle Adjustment Networks. In International Conference on Learning Representations.","author":"Tang Chengzhou","year":"2018","unstructured":"Chengzhou Tang and Ping Tan . 2018 . BA-Net: Dense Bundle Adjustment Networks. In International Conference on Learning Representations. Chengzhou Tang and Ping Tan. 2018. BA-Net: Dense Bundle Adjustment Networks. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_48_1","volume-title":"International Conference on Learning Representations.","author":"Teed Zachary","year":"2019","unstructured":"Zachary Teed and Jia Deng . 2019 . DeepV2D: Video to Depth with Differentiable Structure from Motion . In International Conference on Learning Representations. Zachary Teed and Jia Deng. 2019. DeepV2D: Video to Depth with Differentiable Structure from Motion. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58536-5_24"},{"key":"e_1_3_2_2_50_1","volume-title":"Advances in neural information processing systems","author":"Vaswani Ashish","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , Lukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need . In Advances in neural information processing systems , Vol. 30 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, Vol. 30."},{"key":"e_1_3_2_2_51_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8953--8962","author":"Zhong Yiran","year":"2021","unstructured":"JianyuanWang, Yiran Zhong , Yuchao Dai , Stan Birchfield , Kaihao Zhang , Nikolai Smolyanskiy , and Hongdong Li . 2021 . Deep two-view structure-from-motion revisited . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8953--8962 . JianyuanWang, Yiran Zhong, Yuchao Dai, Stan Birchfield, Kaihao Zhang, Nikolai Smolyanskiy, and Hongdong Li. 2021. Deep two-view structure-from-motion revisited. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8953--8962."},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00759"},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00040"},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00069"},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.634"},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00077"},{"key":"e_1_3_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00578"},{"key":"e_1_3_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00181"},{"key":"e_1_3_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3306346.3323015"},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3478513.3480500"},{"key":"e_1_3_2_2_61_1","volume-title":"Unsupervised learning of monocular depth estimation with bundle adjustment, super-resolution and clip loss. arXiv preprint arXiv:1812.03368","author":"Zhou Lipu","year":"2018","unstructured":"Lipu Zhou , Jiamin Ye , Montiel Abello , Shengze Wang , and Michael Kaess . 2018. Unsupervised learning of monocular depth estimation with bundle adjustment, super-resolution and clip loss. arXiv preprint arXiv:1812.03368 ( 2018 ). Lipu Zhou, Jiamin Ye, Montiel Abello, Shengze Wang, and Michael Kaess. 2018. Unsupervised learning of monocular depth estimation with bundle adjustment, super-resolution and clip loss. arXiv preprint arXiv:1812.03368 (2018)."},{"key":"e_1_3_2_2_62_1","volume-title":"International Conference on Learning Representations.","author":"Zhu Xizhou","year":"2021","unstructured":"Xizhou Zhu , Weijie Su , Lewei Lu , Bin Li , Xiaogang Wang , and Jifeng Dai . 2021 . Deformable detr: Deformable transformers for end-to-end object detection . In International Conference on Learning Representations. Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547978","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547978","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:31Z","timestamp":1750186831000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547978"}},"subtitle":["Consistent Video Depth Estimation with Masked Frames Modeling"],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":62,"alternative-id":["10.1145\/3503161.3547978","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547978","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}