{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T18:26:14Z","timestamp":1777487174650,"version":"3.51.4"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2019,8,8]],"date-time":"2019-08-08T00:00:00Z","timestamp":1565222400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61673402, 61273270, and 60802069"],"award-info":[{"award-number":["61673402, 61273270, and 60802069"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100003453","name":"Natural Science Foundation of Guangdong","doi-asserted-by":"crossref","award":["2017A030311029"],"award-info":[{"award-number":["2017A030311029"]}],"id":[{"id":"10.13039\/501100003453","id-type":"DOI","asserted-by":"crossref"}]},{"name":"National Key R8D Program of China","award":["2018YFB1601101"],"award-info":[{"award-number":["2018YFB1601101"]}]},{"name":"Science and Technology Program of Guangzhou","award":["201704020180 and 201604020024"],"award-info":[{"award-number":["201704020180 and 201604020024"]}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities of China","doi-asserted-by":"crossref","award":["17lgzd08"],"award-info":[{"award-number":["17lgzd08"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,8,31]]},"abstract":"<jats:p>In this article, we propose a novel Pseudo-3D Attention Transfer network with Content-aware Strategy (P3DAT-CAS) for the image captioning task. Our model is composed of three parts: the Pseudo-3D Attention (P3DA) network, the P3DA-based Transfer (P3DAT) network, and the Content-aware Strategy (CAS). First, we propose P3DA to take full advantage of three-dimensional (3D) information in convolutional feature maps and capture more details. Most existing attention-based models only extract the 2D spatial representation from convolutional feature maps to decide which area should be paid more attention to. However, convolutional feature maps are 3D and different channel features can detect diverse semantic attributes associated with images. P3DA is proposed to combine 2D spatial maps with 1D semantic-channel attributes and generate more informative captions. Second, we design the transfer network to maintain and transfer the key previous attention information. The traditional attention-based approaches only utilize the current attention information to predict words directly, whereas transfer network is able to learn long-term attention dependencies and explore global modeling pattern. Finally, we present CAS to provide a more relevant and distinct caption for each image. The captioning model trained by maximum likelihood estimation may generate the captions that have a weak correlation with image contents, resulting in the cross-modal gap between vision and linguistics. However, CAS is helpful to convey the meaningful visual contents accurately. P3DAT-CAS is evaluated on Flickr30k and MSCOCO, and it achieves very competitive performance among the state-of-the-art models.<\/jats:p>","DOI":"10.1145\/3336495","type":"journal-article","created":{"date-parts":[[2019,8,8]],"date-time":"2019-08-08T12:30:31Z","timestamp":1565267431000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Pseudo-3D Attention Transfer Network with Content-aware Strategy for Image Captioning"],"prefix":"10.1145","volume":"15","author":[{"given":"Jie","family":"Wu","sequence":"first","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4884-323X","authenticated-orcid":false,"given":"Haifeng","family":"Hu","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China"}]},{"given":"Liang","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2019,8,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_2_1_3_1","volume-title":"Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473","author":"Bahdanau Dzmitry","year":"2014"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","volume":"29","author":"Banerjee Satanjeev","year":"2005"},{"key":"e_1_2_1_5_1","unstructured":"Samy Bengio Oriol Vinyals Navdeep Jaitly and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. 1171--1179.   Samy Bengio Oriol Vinyals Navdeep Jaitly and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. 1171--1179."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 6706--6713","author":"Chen Hui","year":"2018"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.667"},{"key":"e_1_2_1_8_1","volume-title":"Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325","author":"Chen Xinlei","year":"2015"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2422--2431","author":"Chen Xinlei"},{"key":"e_1_2_1_10_1","volume-title":"Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467","author":"Devlin Jacob","year":"2015"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708","author":"Huang Gao"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.277"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_2_1_18_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik","year":"2014"},{"key":"e_1_2_1_19_1","volume-title":"Zemel","author":"Kiros Ryan","year":"2014"},{"key":"e_1_2_1_20_1","volume-title":"Hinton","author":"Krizhevsky Alex","year":"2012"},{"key":"e_1_2_1_21_1","volume-title":"Text Summarization Branches Out: Proceedings of the ACL-04 Workshop","volume":"8","author":"Lin Chin-Yew","year":"2004"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.443"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.100"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_2_1_25_1","volume-title":"Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632","author":"Mao Junhua","year":"2014"},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the Annual Conference of the International Speech Communication Association. 1045--1048","author":"Mikolov Tom\u00e1\u0161"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the 27th International Conference on Machine Learning (ICML-10)","author":"Nair Vinod"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.303"},{"key":"e_1_2_1_30_1","volume-title":"Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732","author":"Ranzato Marc\u2019Aurelio","year":"2015"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_2_1_33_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2393347.2393424"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3226037"},{"key":"e_1_2_1_39_1","first-page":"1","article-title":"Image captioning using region-based attention joint with time-varying attention","volume":"49","author":"Wang Weixuan","year":"2019","journal-title":"Neural Process. Lett."},{"key":"e_1_2_1_40_1","unstructured":"Yingce Xia Fei Tian Lijun Wu Jianxin Lin Tao Qin Nenghai Yu and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems. 1782--1792.   Yingce Xia Fei Tian Lijun Wu Jianxin Lin Tao Qin Nenghai Yu and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems. 1782--1792."},{"key":"e_1_2_1_41_1","volume-title":"Proceedings of the International Conference on Machine Learning. 2048--2057","author":"Xu Kelvin","year":"2015"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1049\/el.2017.2351"},{"key":"e_1_2_1_43_1","first-page":"1","article-title":"Adaptive syncretic attention for constrained image captioning","volume":"49","author":"Yang Liang","year":"2019","journal-title":"Neural Process. Lett."},{"key":"e_1_2_1_44_1","volume-title":"Salakhutdinov","author":"Yang Zhilin","year":"2016"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3336495","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3336495","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T19:07:25Z","timestamp":1750273645000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3336495"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,8,8]]},"references-count":46,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2019,8,31]]}},"alternative-id":["10.1145\/3336495"],"URL":"https:\/\/doi.org\/10.1145\/3336495","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,8,8]]},"assertion":[{"value":"2018-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-08-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}