{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T21:20:37Z","timestamp":1767907237479,"version":"3.49.0"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"2s","license":[{"start":{"date-parts":[[2019,4,30]],"date-time":"2019-04-30T00:00:00Z","timestamp":1556582400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["No. 2017YFB1002203"],"award-info":[{"award-number":["No. 2017YFB1002203"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,4,30]]},"abstract":"<jats:p>In recent years, Visual Question Answering (VQA) has attracted increasing attention due to its requirement on cross-modal understanding and reasoning of vision and language. VQA is proposed to automatically answer natural language questions with reference to a given image. VQA is challenging, because the reasoning process on a visual domain needs a full understanding of the spatial relationship, semantic concepts, as well as the common sense for a real image. However, most existing approaches jointly embed the abstract low-level visual features and high-level question features to infer answers. These works have limited reasoning ability due to the lack of modeling of the rich spatial context of regions, high-level semantics of images, and knowledge across multiple sources. To solve the challenges, we propose multi-source multi-level attention networks for visual question answering that can benefit both spatial inferences by visual attention on context-aware region representation and reasoning by semantic attention on concepts as well as external knowledge. Indeed, we learn to reason on image representation by question-guided attention at different levels across multiple sources, including region and concept level representation from image source as well as sentence level representation from the external knowledge base. First, we encode region-based middle-level outputs from Convolutional Neural Networks (CNNs) into spatially embedded representation by a multi-directional two-dimensional recurrent neural network and, further, locate the answer-related regions by Multiple Layer Perceptron as visual attention. Second, we generate semantic concepts from high-level semantics in CNNs and select those question-related concepts as concept attention. Third, we query semantic knowledge from the general knowledge base by concepts and selected question-related knowledge as knowledge attention. Finally, we jointly optimize visual attention, concept attention, knowledge attention, and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach achieved significant improvement on two very challenging VQA datasets.<\/jats:p>","DOI":"10.1145\/3316767","type":"journal-article","created":{"date-parts":[[2019,7,19]],"date-time":"2019-07-19T13:17:14Z","timestamp":1563542234000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":24,"title":["Multi-source Multi-level Attention Networks for Visual Question Answering"],"prefix":"10.1145","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1294-3722","authenticated-orcid":false,"given":"Dongfei","family":"Yu","sequence":"first","affiliation":[{"name":"University of Science and Technology of China, Anhui Province, China"}]},{"given":"Jianlong","family":"Fu","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"given":"Xinmei","family":"Tian","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, Anhui Province, China"}]},{"given":"Tao","family":"Mei","sequence":"additional","affiliation":[{"name":"JD AI Research, Chaoyang District Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2019,7,19]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1955--1960","author":"Agrawal A.","unstructured":"A. Agrawal, D. Batra, and D. Parikh. 2016. Analyzing the behavior of visual question answering models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1955--1960."},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086","author":"Anderson P.","unstructured":"P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision. 2612--2620","author":"H.","unstructured":"H. Ben-younes, R. Cadene, M. Cord, and N. Thome. 2017. MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2612--2620."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","unstructured":"A. Das H. Agrawal C. L. Zitnick D. Parikh and D. Batra. 2017. Human attention in visual question answering: Do humans and deep networks look at the same regions? Comput. Vis. Image Understand. 163 C (2017) 90--100. 10.1016\/j.cviu.2017.10.001","DOI":"10.1016\/j.cviu.2017.10.001"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634","author":"Donahue J.","unstructured":"J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634."},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482","author":"Fang H.","unstructured":"H. Fang, S. Gupta, F. Landola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2014.2380211"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.230"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 457--468","author":"Fukui A.","unstructured":"A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 457--468."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","unstructured":"H. Gao J. Mao J. Zhou Z. Huang L. Wang and W. Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question answering. In Advances in Neural Information Processing Systems. 2296--2304.","DOI":"10.5555\/2969442.2969496"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904--6913","author":"Goyal Y.","unstructured":"Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904--6913."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/1776814.1776875"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778","author":"He K.","unstructured":"K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2614132"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2710635"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2013.2265601"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588--3597","author":"Hu H.","unstructured":"H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588--3597."},{"key":"e_1_2_1_21_1","unstructured":"I. Ilievski S. Yan and J. Feng. 2016. A focused dynamic attention model for visual question answering. In arXiv preprint arXiv:1604.01485."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_44"},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision. 1965--1973","author":"Kafle K.","unstructured":"K. Kafle and C. Kanan. 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE International Conference on Computer Vision. 1965--1973."},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 5th International Conference on Learning Representations.","author":"Kim J.","unstructured":"J. Kim, K. On, W. Lim, J. Kim, J. Ha, and B. Zhang. 2017. Hadamard product for low-rank bilinear pooling. In Proceedings of the 5th International Conference on Learning Representations."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","unstructured":"R. Kiros Y. Zhu R. Salakhutdinov R. S. Zemel A. Torralba R. Urtasun and S. Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3294--3302.","DOI":"10.5555\/2969442.2969607"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1338--1346","author":"Li Q.","unstructured":"Q. Li, J. Fu, D. Yu, T. Mei, and J. Luo. 2018. Tell-and-answer: Towards explainable visual question answering using attributes and captions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1338--1346."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6135--6143","author":"Liang L.","unstructured":"L. Liang, L. Jiang, L. Gao, L. Li, and A. Hauptmann. 2018. Facal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6135--6143."},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the European Conference on Computer Vision. 740--755","author":"Lin T.","unstructured":"T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. 2014. Microsoft COCO: Common objects in ConText. In Proceedings of the European Conference on Computer Vision. 740--755."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/3298239.3298450"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","unstructured":"J. Lu J. Yang D. Batra and D. Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems. 289--297.","DOI":"10.5555\/3157096.3157129"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.9"},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the 3rd International Conference on Learning Representations.","author":"Mao J.","unstructured":"J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the 3rd International Conference on Learning Representations."},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299--307","author":"Nam H.","unstructured":"H. Nam, J. Ha, and J. Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299--307."},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30--38","author":"Noh H.","unstructured":"H. Noh, P. H. Seo, and B. Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30--38."},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594--4602","author":"Pan Y.","unstructured":"Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594--4602."},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6504--6512","author":"Pan Y.","unstructured":"Y. Pan, T. Yao, H. Li, and T. Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6504--6512."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","unstructured":"M. Ren R. Kiros and R. S. Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems. 2953--2961.","DOI":"10.5555\/2969442.2969570"},{"key":"e_1_2_1_39_1","doi-asserted-by":"crossref","unstructured":"M. Rohrbach. 2017. Attributes as semantic units between natural language and visual recognition. In Visual Attributes. 301--330.","DOI":"10.1007\/978-3-319-50077-5_12"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","unstructured":"A. Santoro D. Raposo D. G. T. Barrett M. Malinowski R. Pascanu P. Battaglia and T. Lillicrap. 2017. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems. 4967--4976.","DOI":"10.5555\/3295222.3295250"},{"key":"e_1_2_1_41_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4613--4621","author":"Shih K. J.","unstructured":"K. J. Shih, S. Singh, and D. Hoiem. 2016. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4613--4621."},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the 3rd International Conference on Learning Representations.","author":"Simonyan K.","unstructured":"K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations."},{"key":"e_1_2_1_43_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164","author":"Vinyals O.","unstructured":"O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3115432"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.5555\/3061053.3061108"},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212","author":"Wu Q.","unstructured":"Q. Wu, C. Shen, L. Liu, A. Dick, and A. Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2708709"},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622--4630","author":"Wu Q.","unstructured":"Q. Wu, P. Wang, C. Shen, A. Dick, and A. Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622--4630."},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045390.3045643"},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the European Conference on Computer Vision. 451--466","author":"Xu H.","unstructured":"H. Xu and K. Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the European Conference on Computer Vision. 451--466."},{"key":"e_1_2_1_51_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29","author":"Yang Z.","unstructured":"Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29."},{"key":"e_1_2_1_52_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision. 4894--4902","author":"Yao T.","unstructured":"T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894--4902."},{"key":"e_1_2_1_53_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659","author":"You Q.","unstructured":"Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659."},{"key":"e_1_2_1_54_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4709--4717","author":"Yu D.","unstructured":"D. Yu, J. Fu, T. Mei, and Y. Rui. 2017. Multi-level attention network for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4709--4717."},{"key":"e_1_2_1_55_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision. 1291--1300","author":"Zhu C.","unstructured":"C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma. 2017. Structured attentions for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1291--1300."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3316767","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3316767","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:08:02Z","timestamp":1750208882000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3316767"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,4,30]]},"references-count":55,"journal-issue":{"issue":"2s","published-print":{"date-parts":[[2019,4,30]]}},"alternative-id":["10.1145\/3316767"],"URL":"https:\/\/doi.org\/10.1145\/3316767","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,4,30]]},"assertion":[{"value":"2018-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-07-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}