{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:19:06Z","timestamp":1750220346604,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":27,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,8,24]],"date-time":"2021-08-24T00:00:00Z","timestamp":1629763200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"the National Natural Science Foundation of China,the National Social Science Foundation of China,the Opening Project of State Key Laboratory of Digital Publishing Technology of Founder Group","award":["(62072463, 71531012),18ZDA309,413217003"],"award-info":[{"award-number":["(62072463, 71531012),18ZDA309,413217003"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,8,24]]},"DOI":"10.1145\/3460426.3463603","type":"proceedings-article","created":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T22:50:28Z","timestamp":1630536628000},"page":"127-134","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["RGB-D Scene Recognition based on Object-Scene Relation and Semantics-Preserving Attention"],"prefix":"10.1145","author":[{"given":"Yuhui","family":"Guo","sequence":"first","affiliation":[{"name":"Renmin University of China, Beijing, China"}]},{"given":"Xun","family":"Liang","sequence":"additional","affiliation":[{"name":"Renmin University of China, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2021,9]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298974"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1179"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298916"},{"key":"e_1_3_2_1_4_1","volume-title":"Scales and Dataset Bias. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016","author":"Herranz Luis","year":"2016","unstructured":"Luis Herranz , Shuqiang Jiang , and Xiangyang Li . 2016 . Scene Recognition with CNNs: Objects , Scales and Dataset Bias. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 , Las Vegas, NV, USA, June 27--30 , 2016. 571--579. https:\/\/doi.org\/10.1109\/CVPR.2016.68 10.1109\/CVPR.2016.68 Luis Herranz, Shuqiang Jiang, and Xiangyang Li. 2016. Scene Recognition with CNNs: Objects, Scales and Dataset Bias. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 571--579. https:\/\/doi.org\/10.1109\/CVPR.2016.68"},{"key":"e_1_3_2_1_5_1","volume-title":"DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016","author":"Johnson Justin","year":"2016","unstructured":"Justin Johnson , Andrej Karpathy , and Li Fei-Fei . 2016 . DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 , Las Vegas, NV, USA, June 27--30 , 2016. IEEE Computer Society, 4565--4574. https:\/\/doi.org\/10.1109\/CVPR.2016.494 10.1109\/CVPR.2016.494 Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 4565--4574. https:\/\/doi.org\/10.1109\/CVPR.2016.494"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2013.140"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0660-x"},{"volume-title":"LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling. In 14th European Conference on Computer Vision,ECCV. https:\/\/doi.org\/10","author":"Li Zhen","key":"e_1_3_2_1_8_1","unstructured":"Zhen Li , Yukang Gan , Xiaodan Liang , Yizhou Yu , Hui Cheng , and Liang Lin . 9906:541--557,2016. LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling. In 14th European Conference on Computer Vision,ECCV. https:\/\/doi.org\/10 .1007\/978--3--319--46475--6_34 10.1007\/978--3--319--46475--6_34 Zhen Li, Yukang Gan, Xiaodan Liang, Yizhou Yu, Hui Cheng, and Liang Lin. 9906:541--557,2016. LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling. In 14th European Conference on Computer Vision,ECCV. https:\/\/doi.org\/10.1007\/978--3--319--46475--6_34"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2858826"},{"key":"e_1_3_2_1_10_1","volume-title":"Employing Deep Part-Object Relationships for Salient Object Detection. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019","author":"Liu Yi","year":"2019","unstructured":"Yi Liu , Qiang Zhang , Dingwen Zhang , and Jungong Han . 2019 . Employing Deep Part-Object Relationships for Salient Object Detection. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019 , Seoul, Korea (South), October 27 - November 2, 2019. 1232--1241. https:\/\/doi.org\/10.1109\/ICCV.2019.00132 10.1109\/ICCV.2019.00132 Yi Liu, Qiang Zhang, Dingwen Zhang, and Jungong Han. 2019. Employing Deep Part-Object Relationships for Salient Object Detection. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. 1232--1241. https:\/\/doi.org\/10.1109\/ICCV.2019.00132"},{"volume-title":"3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings. http:\/\/arxiv.org\/abs\/1412","author":"Mao Junhua","key":"e_1_3_2_1_11_1","unstructured":"Junhua Mao , Wei Xu , Yi Yang , Jiang Wang , and Alan L. Yuille . 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) . In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings. http:\/\/arxiv.org\/abs\/1412 .6632 Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings. http:\/\/arxiv.org\/abs\/1412.6632"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0695-z"},{"key":"e_1_3_2_1_13_1","volume-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross B. Girshick , and Jian Sun . 2015 . Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015 , December 7 --12 , 2015, Montreal, Quebec, Canada. 91--99. http:\/\/papers.nips.cc\/paper\/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada. 91--99. http:\/\/papers.nips.cc\/paper\/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks"},{"key":"e_1_3_2_1_14_1","volume-title":"Proceedings, Part V (Lecture Notes in Computer Science","volume":"760","author":"Silberman Nathan","year":"2012","unstructured":"Nathan Silberman , Derek Hoiem , Pushmeet Kohli , and Rob Fergus . 2012 . Indoor Segmentation and Support Inference from RGBD Images. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7--13, 2012 , Proceedings, Part V (Lecture Notes in Computer Science , Vol. 7576). 746-- 760 . https:\/\/doi.org\/10.1007\/978--3--642--33715--4_54 10.1007\/978--3--642--33715--4_54 Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor Segmentation and Support Inference from RGBD Images. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7--13, 2012, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 7576). 746--760. https:\/\/doi.org\/10.1007\/978--3--642--33715--4_54"},{"key":"e_1_3_2_1_15_1","volume-title":"Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3--6","author":"Socher Richard","year":"2012","unstructured":"Richard Socher , Brody Huval , Bharath Putta Bath , Christopher D. Manning , and Andrew Y. Ng . 2012. Convolutional-Recursive Deep Learning for 3D Object Classification . In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3--6 , 2012 , Lake Tahoe, Nevada, United States. 665--673. http:\/\/papers.nips.cc\/paper\/4773-convolutional-recursive-deep-learning-for-3d-object-classification Richard Socher, Brody Huval, Bharath Putta Bath, Christopher D. Manning, and Andrew Y. Ng. 2012. Convolutional-Recursive Deep Learning for 3D Object Classification. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3--6, 2012, Lake Tahoe, Nevada, United States. 665--673. http:\/\/papers.nips.cc\/paper\/4773-convolutional-recursive-deep-learning-for-3d-object-classification"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298655"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123300"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.11226"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2017\/631"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2933728"},{"key":"e_1_3_2_1_21_1","volume-title":"Camilleri","author":"Tanti Marc","year":"2019","unstructured":"Marc Tanti , Albert Gatt , and Kenneth P . Camilleri . 2019 . On Architectures for Including Visual Information in Neural Language Models for Image Description. CoRR , Vol. abs\/ 1911 .03738 (2019). arxiv: 1911.03738 http:\/\/arxiv.org\/abs\/1911.03738 Marc Tanti, Albert Gatt, and Kenneth P. Camilleri. 2019. On Architectures for Including Visual Information in Neural Language Models for Image Description. CoRR, Vol. abs\/1911.03738 (2019). arxiv: 1911.03738 http:\/\/arxiv.org\/abs\/1911.03738"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-006-8614-1"},{"key":"e_1_3_2_1_24_1","volume-title":"Modality and Component Aware Feature Fusion for RGB-D Scene Classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016","author":"Wang Anran","year":"2016","unstructured":"Anran Wang , Jianfei Cai , Jiwen Lu , and Tat-Jen Cham . 2016 . Modality and Component Aware Feature Fusion for RGB-D Scene Classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 , Las Vegas, NV, USA, June 27--30 , 2016. 5995--6004. https:\/\/doi.org\/10.1109\/CVPR.2016.645 10.1109\/CVPR.2016.645 Anran Wang, Jianfei Cai, Jiwen Lu, and Tat-Jen Cham. 2016. Modality and Component Aware Feature Fusion for RGB-D Scene Classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 5995--6004. https:\/\/doi.org\/10.1109\/CVPR.2016.645"},{"key":"e_1_3_2_1_25_1","volume-title":"Dense Captioning with Joint Inference and Visual Context. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017","author":"Yang Linjie","year":"2017","unstructured":"Linjie Yang , Kevin D. Tang , Jianchao Yang , and Li-Jia Li . 2017 . Dense Captioning with Joint Inference and Visual Context. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 , Honolulu, HI, USA, July 21--26 , 2017. IEEE Computer Society, 1978--1987. https:\/\/doi.org\/10.1109\/CVPR.2017.214 10.1109\/CVPR.2017.214 Linjie Yang, Kevin D. Tang, Jianchao Yang, and Li-Jia Li. 2017. Dense Captioning with Joint Inference and Visual Context. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1978--1987. https:\/\/doi.org\/10.1109\/CVPR.2017.214"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2723009"},{"key":"e_1_3_2_1_27_1","volume-title":"Discriminative Multi-modal Feature Fusion for RGBD Indoor Scene Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016","author":"Zhu Hongyuan","year":"2016","unstructured":"Hongyuan Zhu , Jean-Baptiste Weibel , and Shijian Lu . 2016 . Discriminative Multi-modal Feature Fusion for RGBD Indoor Scene Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 , Las Vegas, NV, USA, June 27--30 , 2016. 2969--2976. https:\/\/doi.org\/10.1109\/CVPR.2016.324 10.1109\/CVPR.2016.324 Hongyuan Zhu, Jean-Baptiste Weibel, and Shijian Lu. 2016. Discriminative Multi-modal Feature Fusion for RGBD Indoor Scene Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 2969--2976. https:\/\/doi.org\/10.1109\/CVPR.2016.324"}],"event":{"name":"ICMR '21: International Conference on Multimedia Retrieval","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Taipei Taiwan","acronym":"ICMR '21"},"container-title":["Proceedings of the 2021 International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460426.3463603","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3460426.3463603","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:17:03Z","timestamp":1750191423000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460426.3463603"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,24]]},"references-count":27,"alternative-id":["10.1145\/3460426.3463603","10.1145\/3460426"],"URL":"https:\/\/doi.org\/10.1145\/3460426.3463603","relation":{},"subject":[],"published":{"date-parts":[[2021,8,24]]},"assertion":[{"value":"2021-09-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}