{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T15:36:46Z","timestamp":1774539406603,"version":"3.50.1"},"reference-count":155,"publisher":"Association for Computing Machinery (ACM)","issue":"1s","license":[{"start":{"date-parts":[[2019,1,24]],"date-time":"2019-01-24T00:00:00Z","timestamp":1548288000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,1,31]]},"abstract":"<jats:p>The multimedia community has witnessed the rise of deep learning\u2013based techniques in analyzing multimedia content more effectively. In the past decade, the convergence of deep-learning and multimedia analytics has boosted the performance of several traditional tasks, such as classification, detection, and regression, and has also fundamentally changed the landscape of several relatively new areas, such as semantic segmentation, captioning, and content generation. This article aims to review the development path of major tasks in multimedia analytics and take a look into future directions. We start by summarizing the fundamental deep techniques related to multimedia analytics, especially in the visual domain, and then review representative high-level tasks powered by recent advances. Moreover, the performance review of popular benchmarks gives a pathway to technology advancement and helps identify both milestone works and future directions.<\/jats:p>","DOI":"10.1145\/3279952","type":"journal-article","created":{"date-parts":[[2019,1,28]],"date-time":"2019-01-28T14:01:39Z","timestamp":1548684099000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":19,"title":["Deep Learning\u2013Based Multimedia Analytics"],"prefix":"10.1145","volume":"15","author":[{"given":"Wei","family":"Zhang","sequence":"first","affiliation":[{"name":"JD AI Research, Beijing, China"}]},{"given":"Ting","family":"Yao","sequence":"additional","affiliation":[{"name":"JD AI Research, Beijing, China"}]},{"given":"Shiai","family":"Zhu","sequence":"additional","affiliation":[{"name":"Ant Financial Group, Hangzhou, China"}]},{"given":"Abdulmotaleb El","family":"Saddik","sequence":"additional","affiliation":[{"name":"University of Ottawa, Ottawa, Canada"}]}],"member":"320","published-online":{"date-parts":[[2019,1,24]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_2_1_3_1","unstructured":"M. Arjovsky S. Chintala and L. Bottou. 2017. Wasserstein GAN. In arXiv:1701.07875.  M. Arjovsky S. Chintala and L. Bottou. 2017. Wasserstein GAN. In arXiv:1701.07875."},{"key":"e_1_2_1_4_1","unstructured":"Vijay Badrinarayanan Ankur Handa and Roberto Cipolla. 2015. SegNet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. In arXiv:1505.07293.  Vijay Badrinarayanan Ankur Handa and Roberto Cipolla. 2015. SegNet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. In arXiv:1505.07293."},{"key":"e_1_2_1_5_1","unstructured":"Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In arXiv:1409.0473.  Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In arXiv:1409.0473."},{"key":"e_1_2_1_6_1","unstructured":"Nicolas Ballas Li Yao Chris Pal and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. In arXiv:1511.06432.  Nicolas Ballas Li Yao Chris Pal and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. In arXiv:1511.06432."},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. Association for Computational Linguistics, 65--72","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie . 2005 . METEOR: An automatic metric for MT evaluation with improved correlation with human judgments . In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. Association for Computational Linguistics, 65--72 . Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. Association for Computational Linguistics, 65--72."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.339"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3578--3587","author":"Larry","unstructured":"Larry S. Davis and Bharat Singh. 2018. An analysis of scale invariance in object detection . In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3578--3587 . Larry S. Davis and Bharat Singh. 2018. An analysis of scale invariance in object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3578--3587."},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 190--200","author":"David","unstructured":"David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation . In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 190--200 . David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 190--200."},{"key":"e_1_2_1_11_1","volume-title":"See and chat: Automatically generating viewer-level comments on images. Multimedia Tools and Applications","author":"Chen Jingwen","unstructured":"Jingwen Chen , Ting Yao , and Hongyang Chao . 2018. See and chat: Automatically generating viewer-level comments on images. Multimedia Tools and Applications . ( In press) . Jingwen Chen, Ting Yao, and Hongyang Chao. 2018. See and chat: Automatically generating viewer-level comments on images. Multimedia Tools and Applications. (In press)."},{"key":"e_1_2_1_12_1","volume-title":"Yuille","author":"Chen Liang-Chieh","year":"2014","unstructured":"Liang-Chieh Chen , George Papandreou , Iasonas Kokkinos , Kevin Murphy , and Alan L . Yuille . 2014 . Semantic image segmentation with deep convolutional nets and fully connected CRFs. In arXiv:1412.7062. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2014. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In arXiv:1412.7062."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2699184"},{"key":"e_1_2_1_14_1","unstructured":"Liang-Chieh Chen George Papandreou Florian Schroff and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. In arXiv: 1706.05587.  Liang-Chieh Chen George Papandreou Florian Schroff and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. In arXiv: 1706.05587."},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Liang-Chieh Chen Yukun Zhu George Papandreou Florian Schroff and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In arXiv:1802.02611.  Liang-Chieh Chen Yukun Zhu George Papandreou Florian Schroff and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In arXiv:1802.02611.","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"e_1_2_1_16_1","unstructured":"Xinlei Chen Hao Fang Tsung-Yi Lin Ramakrishna Vedantam Saurabh Gupta Piotr Doll\u00e1r and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. In arXiv:1504.00325.  Xinlei Chen Hao Fang Tsung-Yi Lin Ramakrishna Vedantam Saurabh Gupta Piotr Doll\u00e1r and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. In arXiv:1504.00325."},{"key":"e_1_2_1_17_1","volume-title":"Xception: Deep learning with depthwise separable convolutions. In arXiv:1610.02357.","author":"Chollet Fran\u00e7ois","year":"2016","unstructured":"Fran\u00e7ois Chollet . 2016 . Xception: Deep learning with depthwise separable convolutions. In arXiv:1610.02357. Fran\u00e7ois Chollet. 2016. Xception: Deep learning with depthwise separable convolutions. In arXiv:1610.02357."},{"key":"e_1_2_1_18_1","unstructured":"Djork-Arn\u00e3l' Clevert Thomas Unterthiner and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (ELUs). In arXiv:1511.07289.  Djork-Arn\u00e3l' Clevert Thomas Unterthiner and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (ELUs). In arXiv:1511.07289."},{"key":"e_1_2_1_19_1","doi-asserted-by":"crossref","unstructured":"Marius Cordts Mohamed Omran Sebastian Ramos Timo Rehfeld Markus Enzweiler Rodrigo Benenson Uwe Franke Stefan Roth and Bernt Schiele. 2016. The Cityscapes dataset for semantic urban scene understanding. In arXiv:1604.01685.  Marius Cordts Mohamed Omran Sebastian Ramos Timo Rehfeld Markus Enzweiler Rodrigo Benenson Uwe Franke Stefan Roth and Bernt Schiele. 2016. The Cityscapes dataset for semantic urban scene understanding. In arXiv:1604.01685.","DOI":"10.1109\/CVPR.2016.350"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.191"},{"key":"e_1_2_1_21_1","volume-title":"Advances in Neural Information Processing Systems","author":"Denton Emily L.","unstructured":"Emily L. Denton , Soumith Chintala , Arthur Szlam , and Rob Fergus . 2015. Deep generative image models using a Laplacian pyramid of adversarial networks . In Advances in Neural Information Processing Systems . The MIT Press , 1486--1494. Emily L. Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems. The MIT Press, 1486--1494."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P15-2017"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-014-0733-5"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/1888089.1888092"},{"key":"e_1_2_1_26_1","unstructured":"Mostafa Gamal Mennatullah Siam and Moemen Abdel-Razek. 2018. ShuffleSeg: Real-time semantic segmentation network. In arXiv:1803.03816.  Mostafa Gamal Mennatullah Siam and Moemen Abdel-Razek. 2018. ShuffleSeg: Real-time semantic segmentation network. In arXiv:1803.03816."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.169"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.81"},{"key":"e_1_2_1_29_1","volume-title":"JMLR W&CP: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics","volume":"9","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio . 2010 . Understanding the difficulty of training deep feedforward neural networks . JMLR W&CP: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics , Vol. 9 . 249--256. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. JMLR W&CP: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Vol. 9. 249--256."},{"key":"e_1_2_1_30_1","first-page":"2672","article-title":"Generative adversarial nets","volume":"2","author":"Goodfellow Ian J.","year":"2014","unstructured":"Ian J. Goodfellow , Jean Pouget-Abadie , Mehdi Mirza , Bing Xu , David Warde-Farley , Sherjil Ozair , Aaron C. Courville , and Yoshua Bengio . 2014 . Generative adversarial nets . In Advances in Neural Information Processing Systems , Vol. 2. 2672 -- 2680 . Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, Vol. 2. 2672--2680.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_31_1","unstructured":"Alex Graves. 2013. Generating sequences with recurrent neural networks. In arXiv:1308.0850.  Alex Graves. 2013. Generating sequences with recurrent neural networks. In arXiv:1308.0850."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.337"},{"key":"e_1_2_1_33_1","volume-title":"Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789","author":"Hahnloser Richard H. R.","year":"2000","unstructured":"Richard H. R. Hahnloser , Rahul Sarpeshkar , Misha Mahowald , Rodney J. Douglas , and H. Sebastian Seung . 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789 ( 2000 ), 947--51. Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha Mahowald, Rodney J. Douglas, and H. Sebastian Seung. 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789 (2000), 947--51."},{"key":"e_1_2_1_34_1","series-title":"Lecture Notes in Computer Science","volume-title":"The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning","author":"Han Jun","unstructured":"Jun Han and Claudio Moraga . 1995. The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning , Vol. 930 , Lecture Notes in Computer Science . Springer , 195--201. Jun Han and Claudio Moraga. 1995. The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning, Vol. 930, Lecture Notes in Computer Science. Springer, 195--201."},{"key":"e_1_2_1_35_1","volume-title":"Mask R-CNN. In IEEE International Conference on Computer Vision. IEEE Computer Society, 2980--2988","author":"He Kaiming","unstructured":"Kaiming He , Georgia Gkioxari , Piotr Doll\u00e1r , and Ross B. Girshick . 2017 . Mask R-CNN. In IEEE International Conference on Computer Vision. IEEE Computer Society, 2980--2988 . Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross B. Girshick. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision. IEEE Computer Society, 2980--2988."},{"key":"e_1_2_1_36_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In arXiv:1406.4729.  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In arXiv:1406.4729."},{"key":"e_1_2_1_37_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep residual learning for image recognition. In arXiv:1512.03385.  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep residual learning for image recognition. In arXiv:1512.03385."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.123"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.450"},{"key":"e_1_2_1_40_1","unstructured":"Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. In arXiv:1704.04861.  Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. In arXiv:1704.04861."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00378"},{"key":"e_1_2_1_42_1","volume-title":"Weinberger","author":"Huang Gao","year":"2017","unstructured":"Gao Huang , Shichen Liu , Laurens van der Maaten , and Kilian Q . Weinberger . 2017 . CondenseNet: An efficient DenseNet using learned group convolutions. In arXiv:1711.09224. Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. CondenseNet: An efficient DenseNet using learned group convolutions. In arXiv:1711.09224."},{"key":"e_1_2_1_43_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2261--2269","author":"Huang Gao","unstructured":"Gao Huang , Zhuang Liu , Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks . In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2261--2269 . Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2261--2269."},{"key":"e_1_2_1_44_1","unstructured":"Forrest N. Iandola Song Han Matthew W. Moskewicz Khalid Ashraf William J. Dally and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5MB model size. In arXiv:1602.07360.  Forrest N. Iandola Song Han Matthew W. Moskewicz Khalid Ashraf William J. Dally and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5MB model size. In arXiv:1602.07360."},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 32nd International Conference on Machine Learning","volume":"37","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy . 2015 . Batch normalization: Accelerating deep network training by reducing internal covariate shift . In Proceedings of the 32nd International Conference on Machine Learning , Vol. 37 . Omnipress, 448--456. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. Omnipress, 448--456."},{"key":"e_1_2_1_46_1","volume-title":"Efros","author":"Isola Phillip","year":"2016","unstructured":"Phillip Isola , Jun-Yan Zhu , Tinghui Zhou , and Alexei A . Efros . 2016 . Image-to-image translation with conditional adversarial networks. In arXiv:1611.07004. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. Image-to-image translation with conditional adversarial networks. In arXiv:1611.07004."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2823900"},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 7132--7141","author":"Jie Hu Li Shen","year":"2018","unstructured":"Li Shen Jie Hu and Gang Sun . 2018 . Squeeze-and-excitation networks . In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 7132--7141 . Li Shen Jie Hu and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 7132--7141."},{"key":"e_1_2_1_49_1","unstructured":"Tero Karras Timo Aila Samuli Laine and Jaakko Lehtinen. 2017. Progressive growing of GANs for improved quality stability and variation. In arXiv:1710.10196v2.  Tero Karras Timo Aila Samuli Laine and Jaakko Lehtinen. 2017. Progressive growing of GANs for improved quality stability and variation. In arXiv:1710.10196v2."},{"key":"e_1_2_1_50_1","unstructured":"Alex Kendall Vijay Badrinarayanan and Roberto Cipolla. 2015. Bayesian SegNet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In arXiv:1511.02680.  Alex Kendall Vijay Badrinarayanan and Roberto Cipolla. 2015. Bayesian SegNet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In arXiv:1511.02680."},{"key":"e_1_2_1_51_1","volume-title":"Proceedings of the International Conference on Machine Learning","volume":"70","author":"Kim Taeksoo","year":"2017","unstructured":"Taeksoo Kim , Moonsu Cha , Hyunsoo Kim , Jung Kwon Lee , and Jiwon Kim . 2017 . Learning to discover cross-domain relations with generative adversarial networks . In Proceedings of the International Conference on Machine Learning , Vol. 70 . 1857--1865. Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Vol. 70. 1857--1865."},{"key":"e_1_2_1_52_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba . 2014 . Adam : A method for stochastic optimization. In arXiv:1412.6980. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In arXiv:1412.6980."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1020346032608"},{"key":"e_1_2_1_54_1","volume-title":"RON: Reverse connection with objectness prior networks for object detection. In arXiv:1707.01691.","author":"Kong Tao","year":"2017","unstructured":"Tao Kong , Fuchun Sun , Anbang Yao , Huaping Liu , Ming Lu , and Yurong Chen . 2017 . RON: Reverse connection with objectness prior networks for object detection. In arXiv:1707.01691. Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, and Yurong Chen. 2017. RON: Reverse connection with objectness prior networks for object detection. In arXiv:1707.01691."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"e_1_2_1_56_1","volume-title":"Hinton","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey E . Hinton . 2012 . ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25. 1097--1105. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25. 1097--1105."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.162"},{"key":"e_1_2_1_58_1","unstructured":"Wei-Sheng Lai Jia-Bin Huang Narendra Ahuja and Ming-Hsuan Yang. 2017. Deep Laplacian pyramid networks for fast and accurate super-resolution. In arXiv:1704.03915.  Wei-Sheng Lai Jia-Bin Huang Narendra Ahuja and Ming-Hsuan Yang. 2017. Deep Laplacian pyramid networks for fast and accurate super-resolution. In arXiv:1704.03915."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1989.1.4.541"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01231-1_19"},{"key":"e_1_2_1_61_1","volume-title":"Unified spatio-temporal attention networks for action recognition in videos","author":"Li Dong","unstructured":"Dong Li , Ting Yao , Lingyu Duan , Tao Mei , and Yong Rui . 2018. Unified spatio-temporal attention networks for action recognition in videos . IEEE Transactions on Multimedia. (In Press) . Dong Li, Ting Yao, Lingyu Duan, Tao Mei, and Yong Rui. 2018. Unified spatio-temporal attention networks for action recognition in videos. IEEE Transactions on Multimedia. (In Press)."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911996.2912001"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1007\/s13735-016-0117-4"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2964320"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00782"},{"key":"e_1_2_1_66_1","volume-title":"Proceedings of the ACL Workshop on Text Summarization Branches Out. 10 pages.","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin . 2004 . Rouge: A package for automatic evaluation of summaries . In Proceedings of the ACL Workshop on Text Summarization Branches Out. 10 pages. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop on Text Summarization Branches Out. 10 pages."},{"key":"e_1_2_1_67_1","volume-title":"Reid","author":"Lin Guosheng","year":"2016","unstructured":"Guosheng Lin , Anton Milan , Chunhua Shen , and Ian D . Reid . 2016 . RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In arXiv:1611.06612. Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D. Reid. 2016. RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In arXiv:1611.06612."},{"key":"e_1_2_1_68_1","unstructured":"Min Lin Qiang Chen and Shuicheng Yan. 2013. Network in network. In arXiv:1312.4400.  Min Lin Qiang Chen and Shuicheng Yan. 2013. Network in network. In arXiv:1312.4400."},{"key":"e_1_2_1_69_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 936--944","author":"Lin Tsung-Yi","unstructured":"Tsung-Yi Lin , Piotr Doll\u00e1r , Ross B. Girshick , Kaiming He , Bharath Hariharan , and Serge J. Belongie . 2017. Feature pyramid networks for object detection . In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 936--944 . Tsung-Yi Lin, Piotr Doll\u00e1r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 936--944."},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.324"},{"key":"e_1_2_1_71_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 740--755","author":"Lin Tsung-Yi","unstructured":"Tsung-Yi Lin , Michael Maire , Serge Belongie , James Hays , Pietro Perona , Deva Ramanan , Piotr Doll\u00e1r , and C. Lawrence Zitnick . 2014. Microsoft COCO: Common objects in context . In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 740--755 . Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 740--755."},{"key":"e_1_2_1_72_1","unstructured":"Siqi Liu Zhenhai Zhu Ning Ye Sergio Guadarrama and Kevin Murphy. 2017. Optimization of image description metrics using policy gradient methods. In arXiv:1612.00370.  Siqi Liu Zhenhai Zhu Ning Ye Sergio Guadarrama and Kevin Murphy. 2017. Optimization of image description metrics using policy gradient methods. In arXiv:1612.00370."},{"key":"e_1_2_1_73_1","volume-title":"Reed","author":"Liu Wei","year":"2015","unstructured":"Wei Liu , Dragomir Anguelov , Dumitru Erhan , Christian Szegedy , and Scott E . Reed . 2015 . SSD : Single shot MultiBox detector. In arXiv:1512.02325. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott E. Reed. 2015. SSD: Single shot MultiBox detector. In arXiv:1512.02325."},{"key":"e_1_2_1_74_1","doi-asserted-by":"crossref","unstructured":"Jonathan Long Evan Shelhamer and Trevor Darrell. 2014. Fully convolutional networks for semantic segmentation. In arXiv:1411.4038.  Jonathan Long Evan Shelhamer and Trevor Darrell. 2014. Fully convolutional networks for semantic segmentation. In arXiv:1411.4038.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_8"},{"key":"e_1_2_1_77_1","volume-title":"ICML Workshop on Deep Learning for Audio, Speech and Language Processing","volume":"28","author":"Maas Andrew L.","unstructured":"Andrew L. Maas , Awni Y. Hannun , and Andrew Y. Ng . 2013. Rectifier nonlinearities improve neural network acoustic models . In ICML Workshop on Deep Learning for Audio, Speech and Language Processing , Vol. 28 . Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Vol. 28."},{"key":"e_1_2_1_78_1","volume-title":"Yuille","author":"Mao Junhua","year":"2014","unstructured":"Junhua Mao , Wei Xu , Yi Yang , Jiang Wang , and Alan L . Yuille . 2014 . Explain images with multimodal recurrent neural networks. In arXiv:arXiv:1410.1090. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. In arXiv:arXiv:1410.1090."},{"key":"e_1_2_1_79_1","unstructured":"Xiao-Jiao Mao Chunhua Shen and Yu-Bin Yang. 2016. Image restoration using very deep fully convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems. Curran Associates 2810--2818.   Xiao-Jiao Mao Chunhua Shen and Yu-Bin Yang. 2016. Image restoration using very deep fully convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems. Curran Associates 2810--2818."},{"key":"e_1_2_1_80_1","unstructured":"M. Mirza and S. Osindero. 2014. Conditional generative adversarial nets. In arXiv:1411.1784.  M. Mirza and S. Osindero. 2014. Conditional generative adversarial nets. In arXiv:1411.1784."},{"key":"e_1_2_1_81_1","volume-title":"Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computer Linguistics, 747--756","author":"Mitchell Margaret","year":"2012","unstructured":"Margaret Mitchell , Xufeng Han , Amit Goyal , 2012 . Midge: Generating image descriptions from computer vision detections . In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computer Linguistics, 747--756 . Margaret Mitchell, Xufeng Han, Amit Goyal, et al. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computer Linguistics, 747--756."},{"key":"e_1_2_1_82_1","volume-title":"Proceedings of the 27th International Conference on International Conference on Machine Learning. Omnipress, 807--814","author":"Nair Vinod","unstructured":"Vinod Nair and Geoffrey E. Hinton . 2010. Rectified linear units improve restricted Boltzmann machines . In Proceedings of the 27th International Conference on International Conference on Machine Learning. Omnipress, 807--814 . Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning. Omnipress, 807--814."},{"key":"e_1_2_1_83_1","unstructured":"Augustus Odena Christopher Olah and Jonathon Shlens. 2016. Conditional image synthesis with auxiliary classifier GANs. In arXiv:1610.09585.  Augustus Odena Christopher Olah and Jonathon Shlens. 2016. Conditional image synthesis with auxiliary classifier GANs. In arXiv:1610.09585."},{"key":"e_1_2_1_84_1","volume-title":"Proceedings of the International Joint Conference on Artificial Intelligence. AAAI Press, 3832--3838","author":"Pan Yingwei","year":"2016","unstructured":"Yingwei Pan , Yehao Li , Ting Yao , Tao Mei , Houqiang Li , and Yong Rui . 2016 . Learning deep intrinsic video representation by exploring temporal coherence and graph structure . In Proceedings of the International Joint Conference on Artificial Intelligence. AAAI Press, 3832--3838 . Yingwei Pan, Yehao Li, Ting Yao, Tao Mei, Houqiang Li, and Yong Rui. 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In Proceedings of the International Joint Conference on Artificial Intelligence. AAAI Press, 3832--3838."},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.497"},{"key":"e_1_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3084144"},{"key":"e_1_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3127905"},{"key":"e_1_2_1_88_1","unstructured":"Yingwei Pan Ting Yao Houqiang Li and Tao Mei. 2017. Video captioning with transferred semantic attributes. In arXiv:1611.07675.  Yingwei Pan Ting Yao Houqiang Li and Tao Mei. 2017. Video captioning with transferred semantic attributes. In arXiv:1611.07675."},{"key":"e_1_2_1_89_1","doi-asserted-by":"publisher","DOI":"10.1145\/2600428.2609568"},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_2_1_91_1","unstructured":"Adam Paszke Abhishek Chaurasia Sangpil Kim and Eugenio Culurciello. 2016. ENet: A deep neural network architecture for real-time semantic segmentation. In arXiv:1606.02147.  Adam Paszke Abhishek Chaurasia Sangpil Kim and Eugenio Culurciello. 2016. ENet: A deep neural network architecture for real-time semantic segmentation. In arXiv:1606.02147."},{"key":"e_1_2_1_92_1","unstructured":"Pauline Luc Camille Couprie Soumith Chintala and Jakob Verbeek. 2016. Semantic segmentation using adversarial networks. In arXiv:1611.08408.  Pauline Luc Camille Couprie Soumith Chintala and Jakob Verbeek. 2016. Semantic segmentation using adversarial networks. In arXiv:1611.08408."},{"key":"e_1_2_1_93_1","doi-asserted-by":"crossref","unstructured":"Chao Peng Xiangyu Zhang Gang Yu Guiming Luo and Jian Sun. 2017. Large kernel matters - improve semantic segmentation by global convolutional network. In arXiv:1703.02719.  Chao Peng Xiangyu Zhang Gang Yu Guiming Luo and Jian Sun. 2017. Large kernel matters - improve semantic segmentation by global convolutional network. In arXiv:1703.02719.","DOI":"10.1109\/CVPR.2017.189"},{"key":"e_1_2_1_94_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1713--1721","author":"Pedro H.","unstructured":"Pedro H. O. Pinheiro and Ronan Collobert. 2015. From image-level to pixel-level labeling with Convolutional Networks . In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1713--1721 . Pedro H. O. Pinheiro and Ronan Collobert. 2015. From image-level to pixel-level labeling with Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1713--1721."},{"key":"e_1_2_1_95_1","doi-asserted-by":"publisher","DOI":"10.1137\/0330046"},{"key":"e_1_2_1_96_1","volume-title":"THUMOS Challenge Workshop of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society.","author":"Qiu Zhaofan","year":"2015","unstructured":"Zhaofan Qiu , Qing Li , Ting Yao , Tao Mei , and Yong Rui . 2015 . MSR Asia MSM at THUMOS Challenge 2015 . In THUMOS Challenge Workshop of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society. Zhaofan Qiu, Qing Li, Ting Yao, Tao Mei, and Yong Rui. 2015. MSR Asia MSM at THUMOS Challenge 2015. In THUMOS Challenge Workshop of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society."},{"key":"e_1_2_1_97_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.435"},{"key":"e_1_2_1_98_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.590"},{"key":"e_1_2_1_99_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2759504"},{"key":"e_1_2_1_100_1","unstructured":"Alec Radford Luke Metz and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. In arXiv:1511.06434.  Alec Radford Luke Metz and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. In arXiv:1511.06434."},{"key":"e_1_2_1_101_1","volume-title":"Ross B. Girshick, and Ali Farhadi.","author":"Redmon Joseph","year":"2015","unstructured":"Joseph Redmon , Santosh Kumar Divvala , Ross B. Girshick, and Ali Farhadi. 2015 . You only look once: Unified, real-time object detection. In arXiv:1506.02640. Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You only look once: Unified, real-time object detection. In arXiv:1506.02640."},{"key":"e_1_2_1_102_1","doi-asserted-by":"crossref","unstructured":"Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better faster stronger. In arXiv:1612.08242.  Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better faster stronger. In arXiv:1612.08242.","DOI":"10.1109\/CVPR.2017.690"},{"key":"e_1_2_1_103_1","volume-title":"Proceedings of the International Conference on International Conference on Machine Learning","volume":"48","author":"Reed Scott","year":"2016","unstructured":"Scott Reed , Zeynep Akata , Xinchen Yan , Lajanugen Logeswaran , Bernt Schiele , and Honglak Lee . 2016 . Generative adversarial text-to-image synthesis . In Proceedings of the International Conference on International Conference on Machine Learning , Vol. 48 . JMLR, 1060--1069. Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text-to-image synthesis. In Proceedings of the International Conference on International Conference on Machine Learning, Vol. 48. JMLR, 1060--1069."},{"key":"e_1_2_1_104_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"e_1_2_1_105_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_2_1_106_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.61"},{"key":"e_1_2_1_107_1","doi-asserted-by":"crossref","unstructured":"Olaf Ronneberger Philipp Fischer and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In arXiv:1505.04597.  Olaf Ronneberger Philipp Fischer and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In arXiv:1505.04597.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"e_1_2_1_108_1","volume-title":"Williams","author":"Rumelhart David E.","year":"1988","unstructured":"David E. Rumelhart , Geoffrey E. Hinton , and Ronald J . Williams . 1988 . Learning Representations by Back-propagating Errors. In Neurocomputing : Foundations of Research. MIT Press , 696--699. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Learning Representations by Back-propagating Errors. In Neurocomputing: Foundations of Research. MIT Press, 696--699."},{"key":"e_1_2_1_109_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_2_1_110_1","volume-title":"On-line Learning in Neural Networks","author":"Saad David","unstructured":"David Saad . 1998. On-line Learning in Neural Networks . Cambridge University Press, New York , NY. David Saad. 1998. On-line Learning in Neural Networks. Cambridge University Press, New York, NY."},{"key":"e_1_2_1_111_1","doi-asserted-by":"publisher","DOI":"10.5555\/1886436.1886447"},{"key":"e_1_2_1_112_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.212"},{"key":"e_1_2_1_113_1","volume-title":"Proceedings of the International Conference on Neural Information Processing Systems, Vol 1. MIT Press","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014 . Two-stream convolutional networks for action recognition in videos . In Proceedings of the International Conference on Neural Information Processing Systems, Vol 1. MIT Press , Cambridge, 568--576. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the International Conference on Neural Information Processing Systems, Vol 1. MIT Press, Cambridge, 568--576."},{"key":"e_1_2_1_114_1","unstructured":"K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In arXiv:1409.1556.  K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In arXiv:1409.1556."},{"key":"e_1_2_1_115_1","unstructured":"Shuochen Su Mauricio Delbracio Jue Wang Guillermo Sapiro Wolfgang Heidrich and Oliver Wang. 2016. Deep video deblurring. In arXiv:1611.08387.  Shuochen Su Mauricio Delbracio Jue Wang Guillermo Sapiro Wolfgang Heidrich and Oliver Wang. 2016. Deep video deblurring. In arXiv:1611.08387."},{"key":"e_1_2_1_116_1","volume-title":"Le","author":"Sutskever Ilya","year":"2014","unstructured":"Ilya Sutskever , Oriol Vinyals , and Quoc V . Le . 2014 . Sequence to sequence learning with neural networks. In NIPS. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS."},{"key":"e_1_2_1_117_1","doi-asserted-by":"crossref","unstructured":"Christian Szegedy Sergey Ioffe and Vincent Vanhoucke. 2016. Inception-v4 inception-ResNet and the impact of residual connections on learning. In arXiv:1602.07261.  Christian Szegedy Sergey Ioffe and Vincent Vanhoucke. 2016. Inception-v4 inception-ResNet and the impact of residual connections on learning. In arXiv:1602.07261.","DOI":"10.1609\/aaai.v31i1.11231"},{"key":"e_1_2_1_118_1","doi-asserted-by":"crossref","unstructured":"Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E. Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich. 2014. Going deeper with convolutions. In arXiv:1409.4842.  Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E. Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich. 2014. Going deeper with convolutions. In arXiv:1409.4842.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_119_1","doi-asserted-by":"crossref","unstructured":"Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jonathon Shlens and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. In arXiv:1512.00567.  Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jonathon Shlens and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. In arXiv:1512.00567.","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_2_1_120_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_2_1_121_1","volume-title":"Lempitsky","author":"Ulyanov Dmitry","year":"2016","unstructured":"Dmitry Ulyanov , Andrea Vedaldi , and Victor S . Lempitsky . 2016 . Instance normalization: The missing ingredient for fast stylization. In arXiv:1607.08022. Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. 2016. Instance normalization: The missing ingredient for fast stylization. In arXiv:1607.08022."},{"key":"e_1_2_1_122_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_2_1_123_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.515"},{"key":"e_1_2_1_124_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/N15-1173"},{"key":"e_1_2_1_125_1","unstructured":"Dumoulin Vincent and Visin Francesco. 2016. A guide to convolution arithmetic for deep learning. In arXiv:1603.07285.  Dumoulin Vincent and Visin Francesco. 2016. A guide to convolution arithmetic for deep learning. In arXiv:1603.07285."},{"key":"e_1_2_1_126_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_2_1_127_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 203--212","author":"Wu Qi","unstructured":"Qi Wu , Chunhua Shen , Lingqiao Liu , Anthony Dick , and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? . In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 203--212 . Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems?. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 203--212."},{"key":"e_1_2_1_128_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2964328"},{"key":"e_1_2_1_129_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806222"},{"key":"e_1_2_1_130_1","unstructured":"Saining Xie Ross B. Girshick Piotr Doll\u00e1r Zhuowen Tu and Kaiming He. 2016. Aggregated residual transformations for deep neural networks. In arXiv:1611.05431.  Saining Xie Ross B. Girshick Piotr Doll\u00e1r Zhuowen Tu and Kaiming He. 2016. Aggregated residual transformations for deep neural networks. In arXiv:1611.05431."},{"key":"e_1_2_1_131_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.571"},{"key":"e_1_2_1_132_1","volume-title":"Proceedings of the International Conference on Machine Learning. PMLR","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu , Jimmy Ba , Ryan Kiros , Kyunghyun Cho , 2015 . Show, attend and tell: Neural image caption generation with visual attention . In Proceedings of the International Conference on Machine Learning. PMLR , 2048--2057. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. PMLR, 2048--2057."},{"key":"e_1_2_1_133_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.41"},{"key":"e_1_2_1_134_1","doi-asserted-by":"publisher","DOI":"10.5555\/2145432.2145484"},{"key":"e_1_2_1_135_1","volume-title":"Salakhutdinov","author":"Yang Zhilin","year":"2016","unstructured":"Zhilin Yang , Ye Yuan , Yuexin Wu , William W. Cohen , and Ruslan R . Salakhutdinov . 2016 . Review networks for caption generation. In Proceedings of the Advances in Neural Information Processing Systems . 2361--2369. Zhilin Yang, Ye Yuan, Yuexin Wu, William W. Cohen, and Ruslan R. Salakhutdinov. 2016. Review networks for caption generation. In Proceedings of the Advances in Neural Information Processing Systems. 2361--2369."},{"key":"e_1_2_1_136_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.512"},{"key":"e_1_2_1_137_1","volume-title":"CVPR ActivityNet Challenge Workshop.","author":"Yao Ting","year":"2017","unstructured":"Ting Yao , Yehao Li , Zhaofan Qiu , Fuchen Long , Yingwei Pan , Dong Li , and Tao Mei . 2017 . MSR Asia MSM at ActivityNet challenge 2017: Trimmed action recognition, temporal action proposals and dense-captioning events in videos . In CVPR ActivityNet Challenge Workshop. Ting Yao, Yehao Li, Zhaofan Qiu, Fuchen Long, Yingwei Pan, Dong Li, and Tao Mei. 2017. MSR Asia MSM at ActivityNet challenge 2017: Trimmed action recognition, temporal action proposals and dense-captioning events in videos. In CVPR ActivityNet Challenge Workshop."},{"key":"e_1_2_1_138_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.12"},{"key":"e_1_2_1_139_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.559"},{"key":"e_1_2_1_140_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_2_1_141_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.524"},{"key":"e_1_2_1_142_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.310"},{"key":"e_1_2_1_143_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00256"},{"key":"e_1_2_1_144_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_2_1_145_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01261-8_20"},{"key":"e_1_2_1_146_1","unstructured":"Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. In arXiv:1511.07122.  Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. In arXiv:1511.07122."},{"key":"e_1_2_1_147_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.496"},{"key":"e_1_2_1_148_1","doi-asserted-by":"crossref","unstructured":"H. Zhang T. Xu H. Li S. Zhang X. Huang X. Wang and D. Metaxas. 2017. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In arXiv:1612.03242.  H. Zhang T. Xu H. Li S. Zhang X. Huang X. Wang and D. Metaxas. 2017. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In arXiv:1612.03242.","DOI":"10.1109\/ICCV.2017.629"},{"key":"e_1_2_1_149_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6848--6856","author":"Zhang Xiangyu","year":"2017","unstructured":"Xiangyu Zhang , Xinyu Zhou , Mengxiao Lin , and Jian Sun . 2017 . ShuffleNet: An extremely efficient convolutional neural network for mobile devices . Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6848--6856 . Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2017. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6848--6856."},{"key":"e_1_2_1_150_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00712"},{"key":"e_1_2_1_151_1","unstructured":"Hengshuang Zhao Jianping Shi Xiaojuan Qi Xiaogang Wang and Jiaya Jia. 2016. Pyramid scene parsing network. In arXiv:1612.01105.  Hengshuang Zhao Jianping Shi Xiaojuan Qi Xiaogang Wang and Jiaya Jia. 2016. Pyramid scene parsing network. In arXiv:1612.01105."},{"key":"e_1_2_1_152_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6154--6162","author":"Zhaowei Cai Nuno Vasconcelos","year":"2018","unstructured":"Nuno Vasconcelos Zhaowei Cai . 2018 . Cascade R-CNN: Delving into high quality object detection . In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6154--6162 . Nuno Vasconcelos Zhaowei Cai. 2018. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6154--6162."},{"key":"e_1_2_1_153_1","unstructured":"Luowei Zhou Chenliang Xu Parker Koch and Jason J Corso. 2016. Image caption generation with text-conditional semantic attention. In arXiv:1606.04621.  Luowei Zhou Chenliang Xu Parker Koch and Jason J Corso. 2016. Image caption generation with text-conditional semantic attention. In arXiv:1606.04621."},{"key":"e_1_2_1_154_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.244"},{"key":"e_1_2_1_155_1","doi-asserted-by":"crossref","unstructured":"Wentao Zhu Xiang Xiang Trac D. Tran and Xiaohui Xie. 2016. Adversarial deep structural networks for mammographic mass segmentation. In arXiv:1612.05970.  Wentao Zhu Xiang Xiang Trac D. Tran and Xiaohui Xie. 2016. Adversarial deep structural networks for mammographic mass segmentation. In arXiv:1612.05970.","DOI":"10.1101\/095786"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3279952","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3279952","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:01:50Z","timestamp":1750208510000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3279952"}},"subtitle":["A Review"],"short-title":[],"issued":{"date-parts":[[2019,1,24]]},"references-count":155,"journal-issue":{"issue":"1s","published-print":{"date-parts":[[2019,1,31]]}},"alternative-id":["10.1145\/3279952"],"URL":"https:\/\/doi.org\/10.1145\/3279952","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,1,24]]},"assertion":[{"value":"2018-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-09-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-01-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}