{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:32:26Z","timestamp":1750221146097,"version":"3.41.0"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"1s","license":[{"start":{"date-parts":[[2019,1,31]],"date-time":"2019-01-31T00:00:00Z","timestamp":1548892800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61572134, 61622204"],"award-info":[{"award-number":["61572134, 61622204"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003399","name":"Science and Technology Commission of Shanghai Municipality","doi-asserted-by":"publisher","award":["16QA1400500"],"award-info":[{"award-number":["16QA1400500"]}],"id":[{"id":"10.13039\/501100003399","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,1,31]]},"abstract":"<jats:p>Recent studies have shown that spatial relationships among objects are very important for visual recognition, since they can provide rich clues on object contexts within the images. In this article, we introduce a novel method to learn the Semantic Feature Map (SFM) with attention-based deep neural networks for image and video classification in an end-to-end manner, aiming to explicitly model the spatial object contexts within the images. In particular, we explicitly apply the designed gate units to the extracted object features for important objects selection and noise removal. These selected object features are then organized into the proposed SFM, which is a compact and discriminative representation with the spatial information among objects preserved. Finally, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) as the classifiers on top of the SFM for content recognition. A novel multi-task learning framework with image classification loss, object localization loss, and grid labeling loss are also introduced to help better learn the model parameters. We conduct extensive evaluations and comparative studies to verify the effectiveness of the proposed approach on Pascal VOC 2007\/2012 and MS-COCO benchmarks for image classification. In addition, the experimental results also show that the SFMs learned from the image domain can be successfully transferred to CCV and FCVID benchmarks for video classification.<\/jats:p>","DOI":"10.1145\/3231739","type":"journal-article","created":{"date-parts":[[2019,2,5]],"date-time":"2019-02-05T20:40:21Z","timestamp":1549399221000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning"],"prefix":"10.1145","volume":"15","author":[{"given":"Rui-Wei","family":"Zhao","sequence":"first","affiliation":[{"name":"Fudan University, Shanghai, China"}]},{"given":"Qi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}]},{"given":"Zuxuan","family":"Wu","sequence":"additional","affiliation":[{"name":"University of Maryland, Maryland, USA"}]},{"given":"Jianguo","family":"Li","sequence":"additional","affiliation":[{"name":"Intel Labs China, Beijing, China"}]},{"given":"Yu-Gang","family":"Jiang","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2019,2,5]]},"reference":[{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2874--2883","author":"Bell Sean","key":"e_1_2_1_1_1","unstructured":"Sean Bell , C. Lawrence Zitnick , Kavita Bala , and Ross B. Girshick . 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2874--2883 . Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross B. Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2874--2883."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.5244\/C.28.6"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.414"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.112"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-014-0733-5"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.81"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.169"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10584-0_26"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2015.2389824"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems.","author":"Jaderberg Max","year":"2015","unstructured":"Max Jaderberg , Karen Simonyan , Andrew Zisserman , and Koray Kavukcuoglu . 2015 . Spatial transformer networks . In Proceedings of the Conference on Advances in Neural Information Processing Systems. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2010.5540039"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00138-013-0567-0"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2015.2456412"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1282280.1282352"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2670560"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1991996.1992025"},{"volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105","author":"Krizhevsky Alex","key":"e_1_2_1_19_1","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey E. Hinton . 2012. ImageNet classification with deep convolutional neural networks . In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105 . Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.288"},{"key":"e_1_2_1_21_1","volume-title":"Deep learning. Nature 521, 7553","author":"LeCun Yann","year":"2015","unstructured":"Yann LeCun , Yoshua Bengio , and Geoffrey Hinton . 2015. Deep learning. Nature 521, 7553 ( 2015 ), 436--444. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0660-x"},{"volume-title":"Proceedings of the European Conference on Computer Vision. 740--755","author":"Lin Tsung-Yi","key":"e_1_2_1_23_1","unstructured":"Tsung-Yi Lin , Michael Maire , Serge J. Belongie , James Hays , Pietro Perona , Deva Ramanan , Piotr Doll\u00e1r , and C. Lawrence Zitnick . 2014. Microsoft COCO: Common objects in context . In Proceedings of the European Conference on Computer Vision. 740--755 . Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.109"},{"volume-title":"Proceedings of the European Conference on Computer Vision. Springer International Publishing","author":"Liu Wei","key":"e_1_2_1_25_1","unstructured":"Wei Liu , Dragomir Anguelov , Dumitru Erhan , Christian Szegedy , Scott E. Reed , Cheng-Yang Fu , and Alexander C. Berg . 2016. SSD: Single shot multibox detector . In Proceedings of the European Conference on Computer Vision. Springer International Publishing , Amsterdam, The Netherlands, 21--37. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Amsterdam, The Netherlands, 21--37."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"e_1_2_1_27_1","unstructured":"Jianwei Luo Jianguo Li Jun Wang Zhiguo Jiang and Yurong Chen. 2015. Deep attributes from context-aware regional neural codes. arXiv.org. arXiv:1509.02470v1  Jianwei Luo Jianguo Li Jun Wang Zhiguo Jiang and Yurong Chen. 2015. Deep attributes from context-aware regional neural codes. arXiv.org. arXiv:1509.02470v1"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-014-0723-7"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2016.07.008"},{"key":"e_1_2_1_30_1","first-page":"1","article-title":"Event fisher vectors: Robust encoding visual diversity of visual streams. In Proceedings of the British Machine Vision Conference. BMVA Press, Swansea","volume":"178","author":"Nagel Markus","year":"2015","unstructured":"Markus Nagel , Thomas Mensink , and Cees G. M. Snoek . 2015 . Event fisher vectors: Robust encoding visual diversity of visual streams. In Proceedings of the British Machine Vision Conference. BMVA Press, Swansea , UK , 178 . 1 -- 178 .12. Markus Nagel, Thomas Mensink, and Cees G. M. Snoek. 2015. Event fisher vectors: Robust encoding visual diversity of visual streams. In Proceedings of the British Machine Vision Conference. BMVA Press, Swansea, UK, 178.1--178.12.","journal-title":"UK"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.222"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2818711"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.91"},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems. 91--99","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross B. Girshick , and Jian Sun . 2015 . Faster R-CNN: Towards real-time object detection with region proposal networks . In Proceedings of the Conference on Advances in Neural Information Processing Systems. 91--99 . Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 91--99."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the International Conference on Learning Representations Workshop.","author":"Sharma Shikhar","year":"2015","unstructured":"Shikhar Sharma , Ryan Kiros , and Ruslan Salakhutdinov . 2015 . Action recognition using visual attention . In Proceedings of the International Conference on Learning Representations Workshop. Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. In Proceedings of the International Conference on Learning Representations Workshop."},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014 . Two-stream convolutional networks for action recognition in videos . In Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576 . Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576."},{"key":"e_1_2_1_38_1","volume-title":"Proceedings of the International Conference on Learning Representations.","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman . 2015 . Very deep convolutional networks for large scale image recognition . In Proceedings of the International Conference on Learning Representations. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large scale image recognition. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2967199"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0620-5"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.251"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2612829"},{"key":"e_1_2_1_44_1","volume-title":"HCP: A flexible CNN framework for multi-label image classification","author":"Wei Yunchao","year":"2015","unstructured":"Yunchao Wei , Wei Xia , Min Lin , Junshi Huang , Bingbing Ni , Jian Dong , Yao Zhao , and Shuicheng Yan . 2015 . HCP: A flexible CNN framework for multi-label image classification . IEEE Trans. Pattern Anal. Mach. Intell . (2015), 1--8. Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2015. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. (2015), 1--8."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.152"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.339"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2964328"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806222"},{"key":"e_1_2_1_49_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 842--850","author":"Xiao Tianjun","year":"2015","unstructured":"Tianjun Xiao , Yichong Xu , Kuiyuan Yang , Jiaxing Zhang , Yuxin Peng , and Zheng Zhang . 2015 . The application of two-level attention models in deep convolutional neural network for fine-grained image classification . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 842--850 . Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 842--850."},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the International Conference on Machine Learning. 2048--2057","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu , Jimmy Ba , Ryan Kiros , Aaron Courville , Ruslan Salakhutdinov , Richard Zemel , and Yoshua Bengio . 2015 . Show, attend and tell: Neural image caption generation with visual attention . In Proceedings of the International Conference on Machine Learning. 2048--2057 . Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.427"},{"key":"e_1_2_1_52_1","volume-title":"Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai.","author":"Yang Hao","year":"2015","unstructured":"Hao Yang , Joey Tianyi Zhou , Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2015 . Can partial strong labels boost multi-label object recognition? arXiv:1504.05843. Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2015. Can partial strong labels boost multi-label object recognition? arXiv:1504.05843."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2962719"},{"key":"e_1_2_1_54_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3021--3028","author":"Ye Guangnan","year":"2012","unstructured":"Guangnan Ye , Dong Liu , I- Hong Jhuo , and Shih-Fu Chang . 2012 . Robust late fusion with rank minimization . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3021--3028 . Guangnan Ye, Dong Liu, I-Hong Jhuo, and Shih-Fu Chang. 2012. Robust late fusion with rank minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3021--3028."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.5244\/C.29.60"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2648498"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.5244\/C.30.72"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123379"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.557"},{"key":"e_1_2_1_60_1","volume-title":"Proceedings of the European Conference on Computer Vision. 391--405","author":"Lawrence Zitnick C.","year":"2014","unstructured":"C. Lawrence Zitnick and Piotr Doll\u00e1r . 2014 . Edge boxes: Locating object proposals from edges . In Proceedings of the European Conference on Computer Vision. 391--405 . C. Lawrence Zitnick and Piotr Doll\u00e1r. 2014. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision. 391--405."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2548241"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3231739","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3231739","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:08:16Z","timestamp":1750208896000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3231739"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,1,31]]},"references-count":61,"journal-issue":{"issue":"1s","published-print":{"date-parts":[[2019,1,31]]}},"alternative-id":["10.1145\/3231739"],"URL":"https:\/\/doi.org\/10.1145\/3231739","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2019,1,31]]},"assertion":[{"value":"2017-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}