{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T03:11:07Z","timestamp":1760238667624,"version":"build-2065373602"},"reference-count":46,"publisher":"MDPI AG","issue":"17","license":[{"start":{"date-parts":[[2020,8,30]],"date-time":"2020-08-30T00:00:00Z","timestamp":1598745600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61672338","61873160"],"award-info":[{"award-number":["61672338","61873160"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>At present, the state-of-the-art approaches of Visual Question Answering (VQA) mainly use the co-attention model to relate each visual object with text objects, which can achieve the coarse interactions between multimodalities. However, they ignore the dense self-attention within question modality. In order to solve this problem and improve the accuracy of VQA tasks, in the present paper, an effective Dense Co-Attention Networks (DCAN) is proposed. First, to better capture the relationship between words that are relatively far apart and make the extracted semantics more robust, the Bidirectional Long Short-Term Memory (Bi-LSTM) neural network is introduced to encode questions and answers; second, to realize the fine-grained interactions between the question words and image regions, a dense multimodal co-attention model is proposed. The model\u2019s basic components include the self-attention unit and the guided-attention unit, which are cascaded in depth to form a hierarchical structure. The experimental results on the VQA-v2 dataset show that DCAN has obvious performance advantages, which makes VQA applicable to a wider range of AI scenarios.<\/jats:p>","DOI":"10.3390\/s20174897","type":"journal-article","created":{"date-parts":[[2020,8,30]],"date-time":"2020-08-30T06:06:22Z","timestamp":1598767582000},"page":"4897","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":18,"title":["An Effective Dense Co-Attention Networks for Visual Question Answering"],"prefix":"10.3390","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4281-530X","authenticated-orcid":false,"given":"Shirong","family":"He","sequence":"first","affiliation":[{"name":"College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8861-5461","authenticated-orcid":false,"given":"Dezhi","family":"Han","sequence":"additional","affiliation":[{"name":"College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China"}]}],"member":"1968","published-online":{"date-parts":[[2020,8,30]]},"reference":[{"key":"ref_1","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., and Bengio, Y. (2015, January 6\u201311). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., and Schiele, B. (2016, January 11\u201314). Grounding of Textual Phrases in Images by Reconstruction. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46448-0_49"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"35662","DOI":"10.1109\/ACCESS.2020.2975093","article-title":"Multimodal Encoder-Decoder Attention Networks for Visual Question Answering","volume":"8","author":"Chen","year":"2020","journal-title":"IEEE Access"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Liang, W., Long, J., Li, C., Xu, J., Ma, N., and Lei, X. (2020). A Fast Defogging Image Recognition Algorithm based on Bilateral Hybrid Filtering. ACM Trans. Multimed. Comput. Commun. Appl.","DOI":"10.1145\/3391297"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1437","DOI":"10.1109\/TNNLS.2019.2920267","article-title":"Multimodal Deep Network Embedding with Integrated Structure and Attribute Information","volume":"31","author":"Zheng","year":"2019","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"4218","DOI":"10.1109\/JIOT.2020.2966870","article-title":"Edge-Computing-based Trustworthy Data Collection Model in the Internet of Things","volume":"7","author":"Wang","year":"2020","journal-title":"IEEE Internet Things J."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"6663","DOI":"10.1109\/TII.2019.2962844","article-title":"Privacy-Enhanced Data Collection Based on Deep Learning for Internet of Vehicles","volume":"16","author":"Wang","year":"2020","journal-title":"IEEE Trans Ind. Inform."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"9076","DOI":"10.1109\/JIOT.2019.2927497","article-title":"An Efficient and Safe Road Condition Monitoring Authentication Scheme Based on Fog Computing","volume":"6","author":"Cui","year":"2019","journal-title":"IEEE Internet Things J."},{"key":"ref_9","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2015, January 7\u20139). Neural Machine Translation by Jointly Learning to Align and Translate. Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1242","DOI":"10.1109\/TPAMI.2018.2828437","article-title":"Visual Dialog","volume":"41","author":"Das","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Bigham, J.P., Jayant, C., Ji, H., Little, G., Miller, A., Miller, R., Tatarowicz, A., White, B., White, S., and Yeh, T. (2010). VizWiz: Nearly Real-time Answers to Visual Questions. User Interface Softw. Technol., 333\u2013342.","DOI":"10.1145\/1866029.1866080"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Wang, T., Luo, H., Zeng, X., Yu, Z., Liu, A., and Sangaiah, A.K. (2020). Mobility Based Trust Evaluation for Heterogeneous Electric Vehicles Network in Smart Cities. IEEE Trans. Intell. Transp.","DOI":"10.1109\/TITS.2020.2997377"},{"key":"ref_13","unstructured":"Mnih, V., Heess, N.M.O., Graves, A., and Kavukcuoglu, K. (2014, January 8\u201313). Recurrent Models of Visual Attention. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_14","unstructured":"Dozat, T., and Manning, C.D. (2017, January 24\u201326). Deep Biaffine Attention for Neural Dependency Parsing. Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France."},{"key":"ref_15","unstructured":"Han, D., Pan, N., and Li, K.-C. (2020). A Traceable and Revocable Ciphertext-policy Attribute-based Encryption Scheme Based on Privacy Protection. IEEE Trans. Dependable Secur. Comput."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Liang, W., Zhang, D., Lei, X., Tang, M., and Zomaya, Y. (2020). Circuit Copyright Blockchain: Blockchain-based Homomorphic Encryption for IP Circuit Protection. IEEE Trans. Emerg. Top. Comput.","DOI":"10.1109\/TETC.2020.2993032"},{"key":"ref_17","unstructured":"Lu, J., Yang, J., Batra, D., and Parikh, D. (2016, January 5\u201310). Hierarchical Question-Image Co-Attention for Visual Question Answering. Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Yu, Z., Yu, J., Fan, J., and Tao, D. (2017, January 22\u201329). Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.202"},{"key":"ref_19","unstructured":"Kim, J.-H., Jun, J., and Zhang, B.-T. (2018, January 3\u20138). Bilinear Attention Networks. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Nguyen, D.-K., and Okatani, T. (2018, January 18\u201323). Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00637"},{"key":"ref_21","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention is All you Need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_22","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the NAACL-HLT, Minneapolis, MN, USA."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"18207","DOI":"10.1109\/ACCESS.2020.2968492","article-title":"Fabric-iot: A Blockchain-Based Access Control System in IoT","volume":"8","author":"Liu","year":"2020","journal-title":"IEEE Access"},{"key":"ref_24","unstructured":"Gao, P., You, H., Zhang, Z., Wang, X., and Li, H. (2019, January 23\u201325). Multi-modality latent interaction network for visual question answering. Proceedings of the IEEE International Conference on Computer Vision, Thessaloniki, Greece."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Yu, Z., Yu, J., Cui, Y., Tao, D., and Tian, Q. (2019, January 16\u201320). Deep Modular Co-Attention Networks for Visual Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00644"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. (2019). Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. Int. J. Comput. Vis., 398\u2013414.","DOI":"10.1007\/s11263-018-1116-0"},{"key":"ref_27","unstructured":"Hu, H., Chao, W.-L., and Sha, F. (2018, January 18\u201323). Learning Answer Embeddings for Visual Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Shih, K.J., Singh, S., and Hoiem, D. (2016, January 27\u201330). Where to Look: Focus Regions for Visual Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.499"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. (2016, January 1\u20135). Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA.","DOI":"10.18653\/v1\/D16-1044"},{"key":"ref_30","unstructured":"Li, R., and Jia, J. (2016, January 5\u201310). Visual Question Answering with Question Representation Update (QRU). Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Teney, D., Anderson, P., He, X., and Van Den Hengel, A. (2018, January 18\u201323). Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00444"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","article-title":"Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations","volume":"123","author":"Krishna","year":"2017","journal-title":"Int. J. Comput. Vis."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Ben-younes, H., Cad\u00e8ne, R., Cord, M., and Thome, N. (2017, January 22\u201329). MUTAN: Multimodal Tucker Fusion for Visual Question Answering. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.285"},{"key":"ref_34","unstructured":"Wu, C., Liu, J., Wang, X., and Li, R. (February, January 27). Differential Networks for Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_35","unstructured":"Kim, J.-H., On, K.-W., Lim, W., Kim, J., Ha, J.-W., and Zhang, B.-T. (2017, January 24\u201326). Hadamard product for low-rank bilinear pooling. Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"5947","DOI":"10.1109\/TNNLS.2018.2817340","article-title":"Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering","volume":"29","author":"Yu","year":"2018","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_37","unstructured":"Yu, Z., Cui, Y., Yu, J., Tao, D., and Tian, Q. (2019). Multimodal Unified Attention Networks for Vision-and-Language Interactions. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C.D. (2014, January 25\u201329). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition Supplementary Materials. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014, January 6\u201312). Microsoft COCO: Common Objects in Context. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Gao, P., Jiang, Z., You, H., Lu, P., Hoi, S.C.H., Wang, X., and Li, H. (2019, January 16\u201320). Dynamic fusion with intra- and inter-modality attention flow for visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00680"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1016\/j.jpdc.2019.04.007","article-title":"A Risk Defense Method Based on Microscopic State Prediction with Partial Information Observations in Social Networks","volume":"131","author":"Wu","year":"2019","journal-title":"J. Parallel Distrib. Comput."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"20493","DOI":"10.1109\/ACCESS.2020.2968853","article-title":"Reinforcement-Based Robust Variable Pitch Control of Wind Turbines","volume":"8","author":"Chen","year":"2020","journal-title":"IEEE Access"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"6392","DOI":"10.1109\/JIOT.2020.2974281","article-title":"Deep Reinforcement Learning for Resource Protection and Real-Time Detection in IoT Environment","volume":"7","author":"Liang","year":"2020","journal-title":"IEEE Internet Things J."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Tian, Q., Han, D., Li, K.-C., Liu, X., Duan, L., and Castiglione, A. (2020). An intrusion detection approach based on improved deep belief network. Appl. Intell.","DOI":"10.1007\/s10489-020-01694-4"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"179273","DOI":"10.1109\/ACCESS.2019.2956157","article-title":"EduRSS: A Blockchain-Based Educational Records Secure Storage and Sharing Scheme","volume":"7","author":"Li","year":"2019","journal-title":"IEEE Access"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/17\/4897\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T10:04:48Z","timestamp":1760177088000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/17\/4897"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,8,30]]},"references-count":46,"journal-issue":{"issue":"17","published-online":{"date-parts":[[2020,9]]}},"alternative-id":["s20174897"],"URL":"https:\/\/doi.org\/10.3390\/s20174897","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2020,8,30]]}}}