{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:29:50Z","timestamp":1750220990943,"version":"3.41.0"},"reference-count":57,"publisher":"Association for Computing Machinery (ACM)","issue":"2s","license":[{"start":{"date-parts":[[2019,4,30]],"date-time":"2019-04-30T00:00:00Z","timestamp":1556582400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100002858","name":"Postdoctoral Science Foundation of China","doi-asserted-by":"crossref","award":["2017M621795"],"award-info":[{"award-number":["2017M621795"]}],"id":[{"id":"10.13039\/501100002858","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Nanjing University of Posts and Telecommunications Program","award":["NY218026"],"award-info":[{"award-number":["NY218026"]}]},{"name":"Nature Science Foundation of Jiangsu for Distinguished Young Scientist","award":["BK20170039"],"award-info":[{"award-number":["BK20170039"]}]},{"name":"Postdoctoral Research Plan of Jiangsu Province","award":["1701167B"],"award-info":[{"award-number":["1701167B"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61401218 and 61571238"],"award-info":[{"award-number":["61401218 and 61571238"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"crossref","award":["BK20150856"],"award-info":[{"award-number":["BK20150856"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,4,30]]},"abstract":"<jats:p>Visual Question Answering (VQA) is a hot-spot in the intersection of computer vision and natural language processing research and its progress has enabled many in high-level applications. This work aims to describe a novel VQA model based on semantic concept network construction and deep walk. Extracting visual image semantic representation is a significant and effective method for spanning the semantic gap. Moreover, current research has shown that co-occurrence patterns of concepts can enhance semantic representation. This work is motivated by the challenge that semantic concepts have complex interrelations and the relationships are similar to a network. Therefore, we construct a semantic concept network adopted by leveraging Word Activation Forces (WAFs), and mine the co-occurrence patterns of semantic concepts using deep walk. Then the model performs polynomial logistic regression on the basis of the extracted deep walk vector along with the visual image feature and question feature. The proposed model effectively integrates visual and semantic features of the image and natural language question. The experimental results show that our algorithm outperforms competitive baselines on three benchmark image QA datasets. Furthermore, through experiments in image annotation refinement and semantic analysis on pre-labeled LabelMe dataset, we test and verify the effectiveness of our constructed concept network for mining concept co-occurrence patterns, sensible concept clusters, and hierarchies.<\/jats:p>","DOI":"10.1145\/3300938","type":"journal-article","created":{"date-parts":[[2019,7,3]],"date-time":"2019-07-03T13:47:53Z","timestamp":1562161673000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Semantic Concept Network and Deep Walk-based Visual Question Answering"],"prefix":"10.1145","volume":"15","member":"320","published-online":{"date-parts":[[2019,7,3]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00522"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918)","author":"Anderson Peter","year":"2018","unstructured":"Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and VQA. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918). 1--15."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201910)","author":"Choi Myung Jin","key":"e_1_2_1_4_1","unstructured":"Myung Jin Choi, Joseph J. Lim, Antonio Torralba, and Alan S. Willsky. 2010. Exploiting hierarchical context on a large database of object categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201910). 129--136."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.48"},{"volume-title":"Encyclopedia of Language and Linguistics","author":"Fellbaum Christiane","key":"e_1_2_1_6_1","unstructured":"Christiane Fellbaum. 2005. WordNet and wordnets. In Encyclopedia of Language and Linguistics, Alex Barber (Ed.). Elsevier, 665--670."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2015.2469281"},{"key":"e_1_2_1_8_1","volume-title":"Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach.","author":"Fukui Akira","year":"2016","unstructured":"Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. Retrieved from http:\/\/arxiv.org\/abs\/1606.01847."},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Jun Guo Hanliang Guo and Zhanyi Wang. {n.d.}. An activation force-based affinity measure for analyzing complex networks. Retrieved from http:\/\/www.nature.com\/srep\/2011\/111012\/srep00113\/full\/srep00113.html.","DOI":"10.1038\/srep00113"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2614132"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2710635"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2043612.2043613"},{"key":"e_1_2_1_13_1","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1109\/TCYB.2013.2265601","article-title":"Image annotation by multiple-instance learning with discriminative feature mapping and selection","volume":"44","author":"Hong Richang","year":"2014","unstructured":"Richang Hong, Meng Wang, Yue Gao, Dacheng Tao, Xuelong Li, and Xindong Wu. 2014. Image annotation by multiple-instance learning with discriminative feature mapping and selection. IEEE Trans. Cybernet. 44, 5 (May 2014), 669--680.","journal-title":"IEEE Trans. Cybernet."},{"key":"e_1_2_1_14_1","unstructured":"Ilija Ilievski Shuicheng Yan and Jiashi Feng. 2016. A focused dynamic attention model for visual question answering. Retrieved from http:\/\/arxiv.org\/abs\/1604.01485."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.538"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.06.005"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 5th International Conference on Learning Representations (ICLR\u201917)","author":"Kim Jin Hwa","year":"2017","unstructured":"Jin Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung Woo Ha, and Byoung Tak Zhang. 2017. Hadamard product for low-rank bilinear pooling. In Proceedings of the 5th International Conference on Learning Representations (ICLR\u201917). 1--14."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969607"},{"volume-title":"CognitiveCam: A Visual Question Answering Application","author":"Kolluru S. K.","key":"e_1_2_1_19_1","unstructured":"S. K. Kolluru, Shreyans Shrimal, and Sudharsan Krishnaswamy. 2017. CognitiveCam: A Visual Question Answering Application. Springer, Singapore, 85--90."},{"key":"e_1_2_1_20_1","volume-title":"Tell-and-answer: Towards explainable visual question answering using attributes and captions.","author":"Li Qing","year":"2018","unstructured":"Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. 2018. Tell-and-answer: Towards explainable visual question answering using attributes and captions. Retrieved from http:\/\/arxiv.org\/abs\/1801.09041."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2015.2477035"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2624140"},{"key":"e_1_2_1_23_1","first-page":"1","article-title":"Deep collaborative embedding for social image understanding","volume":"99","author":"Li Zechao","year":"2018","unstructured":"Zechao Li, Jinhui Tang, and Tao Mei. 2018. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 99 (2018), 1--14.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2723400"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298917"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TII.2014.2308433"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","unstructured":"Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Retrieved from http:\/\/arxiv.org\/abs\/1410.0210.","DOI":"10.5555\/2968826.2969014"},{"key":"e_1_2_1_28_1","volume-title":"Yuille","author":"Mao Junhua","year":"2014","unstructured":"Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). Retrieved from http:\/\/arxiv.org\/abs\/1412.6632."},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201913)","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov, Greg Corrado, Kai Chen, Jeffrey Dean, Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations (ICLR\u201913). 1--12."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999792.2999959"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.11"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","unstructured":"Bryan Perozzi Rami Al-Rfou and Steven Skiena. 2014. DeepWalk: Online learning of social representations. Retrieved from http:\/\/arxiv.org\/abs\/1403.6652. 10.1145\/2623330.2623732","DOI":"10.1145\/2623330.2623732"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969570"},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201913)","author":"Richardson Matthew","year":"2013","unstructured":"Matthew Richardson, Christopher J. C. Burges, and Erin Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201913). 193--203."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-007-0090-8"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.5555\/3294996.3295124"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","unstructured":"Evan Shelhamer Jonathan Long and Trevor Darrell. 2016. Fully convolutional networks for semantic segmentation. Retrieved from http:\/\/arxiv.org\/abs\/1605.06211. 10.1109\/TPAMI.2016.2572683","DOI":"10.1109\/TPAMI.2016.2572683"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.499"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33715-4_54"},{"key":"e_1_2_1_40_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from http:\/\/arxiv.org\/abs\/1409.1556."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-2034"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00444"},{"key":"e_1_2_1_44_1","first-page":"1","article-title":"Biometric surveillance using visual question answering","volume":"13","author":"Toor Andeep S.","year":"2018","unstructured":"Andeep S. Toor, Harry Wechsler, and Michele Nappi. 2018. Biometric surveillance using visual question answering. Pattern Recogn. Lett. 13, 33 (2018), 1--8.","journal-title":"Pattern Recogn. Lett."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_2_1_46_1","first-page":"325","article-title":"Verb semantics and lexical selection","volume":"14","author":"Wang Ye Yi","year":"1994","unstructured":"Ye Yi Wang. 1994. Verb semantics and lexical selection. Comput. Sci. 14, 101 (1994), 325--327.","journal-title":"Comput. Sci."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.29"},{"key":"e_1_2_1_48_1","volume-title":"Dick","author":"Wu Qi","year":"2015","unstructured":"Qi Wu, Chunhua Shen, Anton van den Hengel, Lingqiao Liu, and Anthony R. Dick. 2015. Image captioning with an intermediate attributes layer. Retrieved from http:\/\/arxiv.org\/abs\/1506.01144."},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.500"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2010.5540015"},{"key":"e_1_2_1_51_1","unstructured":"Kelvin Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C. Courville Ruslan Salakhutdinov Richard S. Zemel and Yoshua Bengio. 2015. Show attend and tell: Neural image caption generation with visual attention. Retrieved from http:\/\/arxiv.org\/abs\/1502.03044."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2846664"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.202"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1029\/2018GL077787"},{"key":"e_1_2_1_55_1","volume-title":"Proceedings of the Workshop at the European Conference on Computer Vision (ECCV\u201914)","author":"Zhang Ziming","year":"2014","unstructured":"Ziming Zhang, Yuting Chen, and Venkatesh Saligrama. 2014. A novel visual word co-occurrence model for person re-identification. In Proceedings of the Workshop at the European Conference on Computer Vision (ECCV\u201914). 122--133."},{"key":"e_1_2_1_56_1","unstructured":"Bolei Zhou Yuandong Tian Sainbayar Sukhbaatar Arthur Szlam and Rob Fergus. 2015. Simple baseline for visual question answering. Retrieved from http:\/\/arxiv.org\/abs\/1512.02167."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2010.204"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3300938","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3300938","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:25:23Z","timestamp":1750206323000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3300938"}},"subtitle":[],"editor":[{"given":"Qun","family":"Li","sequence":"first","affiliation":[]},{"given":"Fu","family":"Xiao","sequence":"additional","affiliation":[]},{"given":"Le","family":"An","sequence":"additional","affiliation":[]},{"given":"Xianzhong","family":"Long","sequence":"additional","affiliation":[]},{"given":"Xiaochuan","family":"Sun","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2019,4,30]]},"references-count":57,"journal-issue":{"issue":"2s","published-print":{"date-parts":[[2019,4,30]]}},"alternative-id":["10.1145\/3300938"],"URL":"https:\/\/doi.org\/10.1145\/3300938","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2019,4,30]]},"assertion":[{"value":"2018-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-07-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}