{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T06:57:09Z","timestamp":1760597829461,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":44,"publisher":"ACM","license":[{"start":{"date-parts":[[2018,10,15]],"date-time":"2018-10-15T00:00:00Z","timestamp":1539561600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Natural Science Foundation of China (NSFC)","award":["No. 61628206"],"award-info":[{"award-number":["No. 61628206"]}]},{"name":"Australian Research Council (ARC)","award":["FT130101530"],"award-info":[{"award-number":["FT130101530"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2018,10,15]]},"DOI":"10.1145\/3240508.3240583","type":"proceedings-article","created":{"date-parts":[[2018,10,18]],"date-time":"2018-10-18T17:52:08Z","timestamp":1539885128000},"page":"672-680","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":23,"title":["Look Deeper See Richer"],"prefix":"10.1145","author":[{"given":"Ziwei","family":"Wang","sequence":"first","affiliation":[{"name":"The University of Queensland, Brisbane, Australia"}]},{"given":"Yadan","family":"Luo","sequence":"additional","affiliation":[{"name":"The University of Queensland, Brisbane, Australia"}]},{"given":"Yang","family":"Li","sequence":"additional","affiliation":[{"name":"The University of Queensland, Brisbane, Australia"}]},{"given":"Zi","family":"Huang","sequence":"additional","affiliation":[{"name":"The University of Queensland, Brisbane, Australia"}]},{"given":"Hongzhi","family":"Yin","sequence":"additional","affiliation":[{"name":"The University of Queensland, Brisbane, Australia"}]}],"member":"320","published-online":{"date-parts":[[2018,10,15]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization@ACL","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie . 2005 . METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments . In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization@ACL 2005, Ann Arbor, Michigan. 65--72. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization@ACL 2005, Ann Arbor, Michigan. 65--72."},{"key":"e_1_3_2_1_2_1","volume-title":"Heng Tao Shen, and Xuelong Li","author":"Bin Yi","year":"2018","unstructured":"Yi Bin , Yang Yang , Fumin Shen , Ning Xie , Heng Tao Shen, and Xuelong Li . 2018 . Describing Video with Attention based Bidirectional LSTM. IEEE Transactions on Cybernetics ( 2018). Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2018. Describing Video with Attention based Bidirectional LSTM. IEEE Transactions on Cybernetics (2018)."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_3_1","DOI":"10.1145\/3123266.3123391"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_4_1","DOI":"10.1109\/TIP.2016.2621673"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_5_1","DOI":"10.1145\/3219819.3219986"},{"unstructured":"X. Chen H. Fang TY Lin R. Vedantam S. Gupta P. Doll\u00c3r and C. L. Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325 (2015).  X. Chen H. Fang TY Lin R. Vedantam S. Gupta P. Doll\u00c3r and C. L. Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325 (2015).","key":"e_1_3_2_1_6_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_7_1","DOI":"10.1109\/TPAMI.2016.2599174"},{"key":"e_1_3_2_1_8_1","volume-title":"Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems","author":"Eigen David","year":"2014","unstructured":"David Eigen , Christian Puhrsch , and Rob Fergus . 2014 . Depth Map Prediction from a Single Image using a Multi-Scale Deep Network . In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems , Montreal, Quebec, Canada. 2366--2374. David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems , Montreal, Quebec, Canada. 2366--2374."},{"volume-title":"11th European Conference on Computer Vision, Heraklion, Crete, Greece, September, Proceedings, Part IV. 15--29","author":"Farhadi Ali","unstructured":"Ali Farhadi , Seyyed Mohammad Mohsen Hejrati , Mohammad Amin Sadeghi , Peter Young , Cyrus Rashtchian , Julia Hockenmaier , and David A. Forsyth . 2010. Every Picture Tells a Story: Generating Sentences from Images. In Computer Vision - ECCV , 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September, Proceedings, Part IV. 15--29 . Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every Picture Tells a Story: Generating Sentences from Images. In Computer Vision - ECCV, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September, Proceedings, Part IV. 15--29.","key":"e_1_3_2_1_9_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_10_1","DOI":"10.1109\/TMM.2017.2729019"},{"key":"e_1_3_2_1_11_1","volume-title":"Reid","author":"Garg Ravi","year":"2016","unstructured":"Ravi Garg , B. G. Vijay Kumar , Gustavo Carneiro , and Ian D . Reid . 2016 . Unsupervised CNN for Single View Depth Estimation : Geometry to the Rescue. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, Proceedings, Part VIII. 740--756. Ravi Garg, B. G. Vijay Kumar, Gustavo Carneiro, and Ian D. Reid. 2016. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, Proceedings, Part VIII. 740--756."},{"volume-title":"Unsupervised Monocular Depth Estimation with Left-Right Consistency. In IEEE Conference on Computer Vision and Pattern Recognition","author":"Godard Cl\u00e9ment","unstructured":"Cl\u00e9ment Godard , Oisin Mac Aodha , and Gabriel J. Brostow . 2017 . Unsupervised Monocular Depth Estimation with Left-Right Consistency. In IEEE Conference on Computer Vision and Pattern Recognition , Honolulu, HI, USA. 6602--6611. Cl\u00e9ment Godard, Oisin Mac Aodha, and Gabriel J. Brostow. 2017. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA. 6602--6611.","key":"e_1_3_2_1_12_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_13_1","DOI":"10.1145\/2964284.2967242"},{"key":"e_1_3_2_1_14_1","volume-title":"Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition","author":"He Kaiming","year":"2016","unstructured":"Kaiming He , Xiangyu Zhang , Shaoqing Ren , and Jian Sun . 2016 . Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition , Las Vegas, NV, USA. 770--778. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 770--778."},{"key":"e_1_3_2_1_15_1","volume-title":"Proceedings of the Twenty-Fourth International Joint Conference on Arti!cial Intelligence","author":"Hodosh Micah","year":"2015","unstructured":"Micah Hodosh , Peter Young , and Julia Hockenmaier . 2015 . Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract) . In Proceedings of the Twenty-Fourth International Joint Conference on Arti!cial Intelligence , Buenos Aires, Argentina. 4188--4192. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2015. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract). In Proceedings of the Twenty-Fourth International Joint Conference on Arti!cial Intelligence, Buenos Aires, Argentina. 4188--4192."},{"key":"e_1_3_2_1_16_1","volume-title":"DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In IEEE Conference on Computer Vision and Pattern Recognition","author":"Johnson Justin","year":"2016","unstructured":"Justin Johnson , Andrej Karpathy , and Li Fei-Fei . 2016 . DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In IEEE Conference on Computer Vision and Pattern Recognition , Las Vegas, NV, USA. 4565--4574. Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 4565--4574."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_17_1","DOI":"10.1109\/TPAMI.2016.2598339"},{"key":"e_1_3_2_1_18_1","volume-title":"Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems","author":"Karpathy Andrej","year":"2014","unstructured":"Andrej Karpathy , Armand Joulin , and Fei-Fei Li . 2014 . Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems , Montreal, Quebec, Canada. 1889--1897. Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, Montreal, Quebec, Canada. 1889--1897."},{"key":"e_1_3_2_1_19_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba . 2014 . Adam : A Method for Stochastic Optimization. CoRR abs\/1412.6980 (2014). Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs\/1412.6980 (2014)."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_20_1","DOI":"10.1109\/CVPR.2017.356"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_21_1","DOI":"10.1109\/ICCV.2017.142"},{"volume-title":"Recurrent Topic-Transition GAN for Visual Paragraph Generation. In IEEE International Conference on Computer Vision","author":"Liang Xiaodan","unstructured":"Xiaodan Liang , Zhiting Hu , Hao Zhang , Chuang Gan , and Eric P. Xing . 2017 . Recurrent Topic-Transition GAN for Visual Paragraph Generation. In IEEE International Conference on Computer Vision , Venice, Italy. 3382--3391. Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, and Eric P. Xing. 2017. Recurrent Topic-Transition GAN for Visual Paragraph Generation. In IEEE International Conference on Computer Vision, Venice, Italy. 3382--3391.","key":"e_1_3_2_1_22_1"},{"volume-title":"Computer Vision -- ECCV, David Fleet, Tomas Pajdla","author":"Lin Tsung-Yi","unstructured":"Tsung-Yi Lin , Michael Maire , Serge Belongie , James Hays , Pietro Perona , Deva Ramanan , Piotr Doll\u00e1r , and C. Lawrence Zitnick . 2014. Microsoft COCO: Common Objects in Context . In Computer Vision -- ECCV, David Fleet, Tomas Pajdla , Bernt Schiele , and Tinne Tuytelaars (Eds.). Springer International Publishing , Cham, 740--755. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision -- ECCV, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.","key":"e_1_3_2_1_23_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_24_1","DOI":"10.1109\/TPAMI.2015.2505283"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_25_1","DOI":"10.1109\/ICCV.2017.100"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_26_1","DOI":"10.1109\/CVPR.2017.345"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_27_1","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_2_1_28_1","volume-title":"Areas of Attention for Image Captioning. In IEEE International Conference on Computer Vision","author":"Pedersoli Marco","year":"2017","unstructured":"Marco Pedersoli , Thomas Lucas , Cordelia Schmid , and Jakob Verbeek . 2017 . Areas of Attention for Image Captioning. In IEEE International Conference on Computer Vision , Venice, Italy. 1251--1259. Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. 2017. Areas of Attention for Image Captioning. In IEEE International Conference on Computer Vision, Venice, Italy. 1251--1259."},{"key":"e_1_3_2_1_29_1","volume-title":"Sequence Level Training with Recurrent Neural Networks. CoRR abs\/1511.06732","author":"Ranzato Marc'Aurelio","year":"2015","unstructured":"Marc'Aurelio Ranzato , Sumit Chopra , Michael Auli , and Wojciech Zaremba . 2015. Sequence Level Training with Recurrent Neural Networks. CoRR abs\/1511.06732 ( 2015 ). Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence Level Training with Recurrent Neural Networks. CoRR abs\/1511.06732 (2015)."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_30_1","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"e_1_3_2_1_31_1","volume-title":"Self-Critical Sequence Training for Image Captioning. In IEEE Conference on Computer Vision and Pattern Recognition","author":"Rennie Steven J.","year":"2017","unstructured":"Steven J. Rennie , Etienne Marcheret , Youssef Mroueh , Jarret Ross , and Vaibhava Goel . 2017 . Self-Critical Sequence Training for Image Captioning. In IEEE Conference on Computer Vision and Pattern Recognition , Honolulu, HI, USA. 1179--1195. Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-Critical Sequence Training for Image Captioning. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA. 1179--1195."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_32_1","DOI":"10.5555\/3172077.3172270"},{"volume-title":"Graph-Structured Representations for Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition","author":"Teney Damien","unstructured":"Damien Teney , Lingqiao Liu , and Anton van den Hengel. 2017 . Graph-Structured Representations for Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition , Honolulu, HI, USA. Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-Structured Representations for Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","key":"e_1_3_2_1_33_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_34_1","DOI":"10.1109\/CVPR.2015.7299087"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_35_1","DOI":"10.1109\/CVPR.2015.7298935"},{"volume-title":"IEEE Conference on Computer Vision and Pattern Recognition","author":"Shen Chunhua","unstructured":"QiWu, Chunhua Shen , Lingqiao Liu , Anthony R. Dick , and Anton van den Hengel. 2016. What Value Do Explicit High Level Concepts Have in Vision to Language Problems? . In IEEE Conference on Computer Vision and Pattern Recognition , Las Vegas, NV, USA. 203--212. QiWu, Chunhua Shen, Lingqiao Liu, Anthony R. Dick, and Anton van den Hengel. 2016. What Value Do Explicit High Level Concepts Have in Vision to Language Problems?. In IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 203--212.","key":"e_1_3_2_1_36_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_37_1","DOI":"10.1145\/3123266.3123448"},{"key":"e_1_3_2_1_38_1","volume-title":"Proceedings of the 32nd International Conference on Machine Learning","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu , Jimmy Ba , Ryan Kiros , Kyunghyun Cho , Aaron C. Courville , Ruslan Salakhutdinov , Richard S. Zemel , and Yoshua Bengio . 2015 . Show, Attend and Tell: Neural Image Caption Generation with Visual Attention . In Proceedings of the 32nd International Conference on Machine Learning , Lille, France. 2048--2057. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France. 2048--2057."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_39_1","DOI":"10.1109\/TIP.2018.2855422"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_40_1","DOI":"10.1109\/ICCV.2015.512"},{"key":"e_1_3_2_1_41_1","volume-title":"Boosting Image Captioning with Attributes. In IEEE International Conference on Computer Vision","author":"Yao Ting","year":"2017","unstructured":"Ting Yao , Yingwei Pan , Yehao Li , Zhaofan Qiu , and Tao Mei . 2017 . Boosting Image Captioning with Attributes. In IEEE International Conference on Computer Vision , Venice, Italy. 4904--4912. Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting Image Captioning with Attributes. In IEEE International Conference on Computer Vision, Venice, Italy. 4904--4912."},{"key":"e_1_3_2_1_42_1","volume-title":"SPTF: A Scalable Probabilistic Tensor Factorization Model for Semantic-Aware Behavior Prediction. In IEEE International Conference on Data Mining","author":"Yin Hongzhi","year":"2017","unstructured":"Hongzhi Yin , Hongxu Chen , Xiaoshuai Sun , Hao Wang , Yang Wang , and Quoc Viet Hung Nguyen . 2017 . SPTF: A Scalable Probabilistic Tensor Factorization Model for Semantic-Aware Behavior Prediction. In IEEE International Conference on Data Mining , New Orleans, LA, USA,. 585--594. Hongzhi Yin, Hongxu Chen, Xiaoshuai Sun, Hao Wang, Yang Wang, and Quoc Viet Hung Nguyen. 2017. SPTF: A Scalable Probabilistic Tensor Factorization Model for Semantic-Aware Behavior Prediction. In IEEE International Conference on Data Mining, New Orleans, LA, USA,. 585--594."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_43_1","DOI":"10.1007\/978-3-642-25853-4_29"},{"key":"e_1_3_2_1_44_1","volume-title":"Image Captioning with Semantic Attention. In IEEE Conference on Computer Vision and Pattern Recognition","author":"You Quanzeng","year":"2016","unstructured":"Quanzeng You , Hailin Jin , ZhaowenWang, Chen Fang , and Jiebo Luo . 2016 . Image Captioning with Semantic Attention. In IEEE Conference on Computer Vision and Pattern Recognition , Las Vegas, NV, USA. 4651--4659. Quanzeng You, Hailin Jin, ZhaowenWang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. In IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 4651--4659."}],"event":{"sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"acronym":"MM '18","name":"MM '18: ACM Multimedia Conference","location":"Seoul Republic of Korea"},"container-title":["Proceedings of the 26th ACM international conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3240508.3240583","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3240508.3240583","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:57:34Z","timestamp":1750208254000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3240508.3240583"}},"subtitle":["Depth-aware Image Paragraph Captioning"],"short-title":[],"issued":{"date-parts":[[2018,10,15]]},"references-count":44,"alternative-id":["10.1145\/3240508.3240583","10.1145\/3240508"],"URL":"https:\/\/doi.org\/10.1145\/3240508.3240583","relation":{},"subject":[],"published":{"date-parts":[[2018,10,15]]},"assertion":[{"value":"2018-10-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}