{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T08:26:34Z","timestamp":1767860794330,"version":"3.49.0"},"reference-count":48,"publisher":"National Library of Serbia","issue":"1","license":[{"start":{"date-parts":[[2025,1,1]],"date-time":"2025-01-01T00:00:00Z","timestamp":1735689600000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["ComSIS","COMPUT SCI INF SYST","COMPUT SCI INFORM SY","COMPUTER SCI INFORM","COMSIS J"],"published-print":{"date-parts":[[2025]]},"abstract":"<jats:p>Visual Question Answering (VQA) is an emerging field of deep learning that combines image and question features and generates collaborative feature representations for classification by uniquely fusing the components. To enhance the effectiveness of models, it is crucial to fully utilize the semantic information from both text and vision. Some researchers have improved the accuracy of the model?s training by either adding new features or enhancing the model?s ability to extract more detailed information. However, these methods have made experimentation more challenging and expensive. We propose a model called asynchronous selfattention model (ASAM) that makes use of an asynchronous self-attention component and a controller, integrating the asynchronous self-attention mechanism and collaborative attention mechanism effectively to leverage the rich semantic information of the underlying visuals. It realizes an end-to-end training framework that can extract and exploit the rich representational information of the underlying visual images while performing coordinated attention with text features, as it does not over-emphasize fine-grained but finds a balance within it, thus allowing the model to learn more valuable information. Extensive ablation experiments were conducted on the proposed ASAM using the VQA v2 dataset to verify its effectiveness. The results of the experiments demonstrate that the proposed model outperforms other state-of-the-art models, without increasing the model complexity and the number of parameters.<\/jats:p>","DOI":"10.2298\/csis240321003l","type":"journal-article","created":{"date-parts":[[2025,1,21]],"date-time":"2025-01-21T08:54:23Z","timestamp":1737449663000},"page":"199-217","source":"Crossref","is-referenced-by-count":1,"title":["ASAM: Asynchronous self-attention model for visual question answering"],"prefix":"10.2298","volume":"22","author":[{"given":"Han","family":"Liu","sequence":"first","affiliation":[{"name":"College of Information Engineering, Shanghai Maritime University Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dezhi","family":"Han","sequence":"additional","affiliation":[{"name":"College of Information Engineering, Shanghai Maritime University Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shukai","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Information Engineering, Shanghai Maritime University Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jingya","family":"Shi","sequence":"additional","affiliation":[{"name":"College of Information Engineering, Shanghai Maritime University Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Huafeng","family":"Wu","sequence":"additional","affiliation":[{"name":"Merchant Marine College, Shanghai Maritime University Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yachao","family":"Zhou","sequence":"additional","affiliation":[{"name":"Shanghai Anheng Times Information Technology Co., Ltd. Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kuan-Ching","family":"Li","sequence":"additional","affiliation":[{"name":"Dept of Computer Science and Information Engineering, Providence University Taiwan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1078","reference":[{"key":"ref1","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6077-6086 (2018)","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref2","doi-asserted-by":"crossref","unstructured":"Chen, C., Han, D., Chang, C.C.: Caan: Context-aware attention network for visual question answering. Pattern Recognition 132, 108980 (2022)","DOI":"10.1016\/j.patcog.2022.108980"},{"key":"ref3","doi-asserted-by":"crossref","unstructured":"Chen, C., Han, D., Chang, C.: MPCCT: multimodal vision-language learning paradigm with context-based compact transformer. Pattern Recognit. 147, 110084 (2024)","DOI":"10.1016\/j.patcog.2023.110084"},{"key":"ref4","doi-asserted-by":"crossref","unstructured":"Chen, C., Han, D., Shen, X.: CLVIN: complete language-vision interaction network for visual question answering. Knowl. Based Syst. 275, 110706 (2023)","DOI":"10.1016\/j.knosys.2023.110706"},{"key":"ref5","doi-asserted-by":"crossref","unstructured":"Diao, C., Zhang, D., Liang, W., Li, K.C., Hong, Y., Gaudiot, J.L.: A novel spatial-temporal multi-scale alignment graph neural network security model for vehicles prediction. IEEE Transactions on Intelligent Transportation Systems (2022)","DOI":"10.1109\/TITS.2022.3140229"},{"key":"ref6","doi-asserted-by":"crossref","unstructured":"Diao, C., Zhang, D., Liang, W., Li, K., Hong, Y., Gaudiot, J.: A novel spatial-temporal multiscale alignment graph neural network security model for vehicles prediction. IEEE Trans. Intell. Transp. Syst. 24(1), 904-914 (2023)","DOI":"10.1109\/TITS.2022.3140229"},{"key":"ref7","doi-asserted-by":"crossref","unstructured":"Ding, Y., Yu, J., Liu, B., Hu, Y., Cui, M., Wu, Q.: Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 5089-5098 (2022)","DOI":"10.1109\/CVPR52688.2022.00503"},{"key":"ref8","doi-asserted-by":"crossref","unstructured":"Fan, Y., Xu, B., Zhang, L., Song, J., Zomaya, A.Y., Li, K.: Validating the integrity of convolutional neural network predictions based on zero-knowledge proof. Inf. Sci. 625, 125-140 (2023)","DOI":"10.1016\/j.ins.2023.01.036"},{"key":"ref9","unstructured":"Gao, H., Mao, J., Zhou, J., Huang, Z.,Wang, L., Xu,W.: Are you talking to a machine? dataset and methods for multilingual image question. Advances in neural information processing systems 28 (2015)"},{"key":"ref10","doi-asserted-by":"crossref","unstructured":"Gao, P., Jiang, Z., You, H., Lu, P., Hoi, S.C., Wang, X., Li, H.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp. 6639-6648 (2019)","DOI":"10.1109\/CVPR.2019.00680"},{"key":"ref11","doi-asserted-by":"crossref","unstructured":"Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904-6913 (2017)","DOI":"10.1109\/CVPR.2017.670"},{"key":"ref12","doi-asserted-by":"crossref","unstructured":"Guo,W., Zhang, Y., Yang, J., Yuan, X.: Re-attention for visual question answering. IEEE Trans. Image Process. 30, 6730-6743 (2021)","DOI":"10.1109\/TIP.2021.3097180"},{"key":"ref13","doi-asserted-by":"crossref","unstructured":"Guo, Z., Han, D.: Multi-modal explicit sparse attention networks for visual question answering. Sensors 20(23), 6758 (2020)","DOI":"10.3390\/s20236758"},{"key":"ref14","doi-asserted-by":"crossref","unstructured":"Guo, Z., Han, D.: Sparse co-attention visual question answering networks based on thresholds. Applied Intelligence 53(1), 586-600 (2023)","DOI":"10.1007\/s10489-022-03559-4"},{"key":"ref15","doi-asserted-by":"crossref","unstructured":"Guo, Z., Han, D., Massetto, F.I., Li, K.C.: Double-layer affective visual question answering network. Computer Science and Information Systems 18(1), 155-168 (2021)","DOI":"10.2298\/CSIS200515038G"},{"key":"ref16","doi-asserted-by":"crossref","unstructured":"Han, D., Zhou, S., Li, K.C., de Mello, R.F.: Cross-modality co-attention networks for visual question answering. Soft Computing 25, 5411-5421 (2021)","DOI":"10.1007\/s00500-020-05539-7"},{"key":"ref17","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 770-778. IEEE Computer Society (2016)","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref18","unstructured":"Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. Advances in neural information processing systems 31 (2018)"},{"key":"ref19","doi-asserted-by":"crossref","unstructured":"Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 32-73 (2017)","DOI":"10.1007\/s11263-016-0981-7"},{"key":"ref20","doi-asserted-by":"crossref","unstructured":"Li, H., Han, D., Chen, C., Chang, C., Li, K., Li, D.: A visual question answering network merging high- and low-level semantic information. IEICE Trans. Inf. Syst. 106(5), 581-589 (2023)","DOI":"10.1587\/transinf.2022DLP0002"},{"key":"ref21","doi-asserted-by":"crossref","unstructured":"Li, J., Han, D.,Wu, Z.,Wang, J., Li, K., Castiglione, A.: A novel system for medical equipment supply chain traceability based on alliance chain and attribute and role access control. Future Gener. Comput. Syst. 142, 195-211 (2023)","DOI":"10.1016\/j.future.2022.12.037"},{"key":"ref22","doi-asserted-by":"crossref","unstructured":"Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE\/CVF international conference on computer vision. pp. 10313-10322 (2019)","DOI":"10.1109\/ICCV.2019.01041"},{"key":"ref23","doi-asserted-by":"crossref","unstructured":"Li, S., Gong, C., Zhu, Y., Luo, C., Hong, Y., Lv, X.: Context-aware multi-level question embedding fusion for visual question answering. Inf. Fusion 102, 102000 (2024)","DOI":"10.1016\/j.inffus.2023.102000"},{"key":"ref24","doi-asserted-by":"crossref","unstructured":"Liang, W., Yang, Y., Yang, C., Hu, Y., Xie, S., Li, K., Cao, J.: Pdpchain: A consortium blockchain-based privacy protection scheme for personal data. IEEE Trans. Reliab. 72(2), 586- 598 (2023)","DOI":"10.1109\/TR.2022.3190932"},{"key":"ref25","unstructured":"Lin, W., Chen, J., Mei, J., Coca, A., Byrne, B.: Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (2023)"},{"key":"ref26","doi-asserted-by":"crossref","unstructured":"Long, J., Liang, W., Li, K.C., Wei, Y., Marino, M.D.: A regularized cross-layer ladder network for intrusion detection in industrial internet of things. IEEE Transactions on Industrial Informatics 19(2), 1747-1755 (2022)","DOI":"10.1109\/TII.2022.3204034"},{"key":"ref27","unstructured":"Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems 29 (2016)"},{"key":"ref28","doi-asserted-by":"crossref","unstructured":"Ma, F., Zhou, Y., Rao, F., Zhang, Y., Sun, X.: Image captioning with multi-context synthetic data. In: Wooldridge, M.J., Dy, J.G., Natarajan, S. (eds.) Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver,Canada. pp. 4089-4097. AAAI Press (2024)","DOI":"10.1609\/aaai.v38i5.28203"},{"key":"ref29","doi-asserted-by":"crossref","unstructured":"Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision. pp. 1-9 (2015)","DOI":"10.1109\/ICCV.2015.9"},{"key":"ref30","doi-asserted-by":"crossref","unstructured":"Mao, A., Yang, Z., Lin, K., Xuan, J., Liu, Y.J.: Positional attention guided transformer-like architecture for visual question answering. IEEE Transactions on Multimedia (2022)","DOI":"10.1109\/TMM.2022.3216770"},{"key":"ref31","doi-asserted-by":"crossref","unstructured":"Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 299- 307 (2017)","DOI":"10.1109\/CVPR.2017.232"},{"key":"ref32","doi-asserted-by":"crossref","unstructured":"Nguyen, B.X., Do, T., Tran, H., Tjiputra, E., Tran, Q.D., Nguyen, A.: Coarse-to-fine reasoning for visual question answering. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. pp. 4558-4566 (2022)","DOI":"10.1109\/CVPRW56347.2022.00502"},{"key":"ref33","doi-asserted-by":"crossref","unstructured":"Peng, L., Yang, Y., Wang, Z., Huang, Z., Shen, H.T.: Mra-net: Improving vqa via multi-modal relation attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(1), 318-329 (2020)","DOI":"10.1109\/TPAMI.2020.3004830"},{"key":"ref34","doi-asserted-by":"crossref","unstructured":"Qin, B., Hu, H., Zhuang, Y.: Deep residual weight-sharing attention network with low-rank attention for visual question answering. IEEE Transactions on Multimedia (2022)","DOI":"10.1109\/TMM.2022.3173131"},{"key":"ref35","unstructured":"Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. Advances in neural information processing systems 28 (2015)"},{"key":"ref36","unstructured":"Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)"},{"key":"ref37","doi-asserted-by":"crossref","unstructured":"Shen, X., Han, D., Zong, L., Guo, Z., Hua, J.: Relational reasoning and adaptive fusion for visual question answering. Appl. Intell. 54(6), 5062-5080 (2024)","DOI":"10.1007\/s10489-024-05437-7"},{"key":"ref38","doi-asserted-by":"crossref","unstructured":"Sturman, D.J., Zeltzer, D.: A survey of glove-based input. IEEE Computer graphics and Applications 14(1), 30-39 (1994)","DOI":"10.1109\/38.250916"},{"key":"ref39","doi-asserted-by":"crossref","unstructured":"Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: Learnings from the 2017 challenge. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4223-4232 (2018)","DOI":"10.1109\/CVPR.2018.00444"},{"key":"ref40","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. pp. 5998-6008 (2017)"},{"key":"ref41","doi-asserted-by":"crossref","unstructured":"Wang, Y., Yasunaga, M., Ren, H.,Wada, S., Leskovec, J.: Vqa-gnn: Reasoning with multimodal knowledge via graph neural networks for visual question answering. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision. pp. 21582-21592 (2023)","DOI":"10.1109\/ICCV51070.2023.01973"},{"key":"ref42","doi-asserted-by":"crossref","unstructured":"Xia, H., Lan, R., Li, H., Song, S.: ST-VQA: shrinkage transformer with accurate alignment for visual question answering. Appl. Intell. 53(18), 20967-20978 (2023)","DOI":"10.1007\/s10489-023-04564-x"},{"key":"ref43","doi-asserted-by":"crossref","unstructured":"Yan, S., andWeifeng Chen, M.B., Zhou, X., Huang, Q., Li, L.E.: Vigor: Improving visual grounding of large vision language models with fine-grained reward modeling. CoRR abs\/2402.06118 (2024)","DOI":"10.1007\/978-3-031-73030-6_3"},{"key":"ref44","doi-asserted-by":"crossref","unstructured":"Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 21-29 (2016)","DOI":"10.1109\/CVPR.2016.10"},{"key":"ref45","doi-asserted-by":"crossref","unstructured":"Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. pp. 6281-6290. Computer Vision Foundation \/ IEEE (2019)","DOI":"10.1109\/CVPR.2019.00644"},{"key":"ref46","doi-asserted-by":"crossref","unstructured":"Yu, Z., Yu, J., Xiang, C., Fan, J., Tao, D.: Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE transactions on neural networks and learning systems 29(12), 5947-5959 (2018)","DOI":"10.1109\/TNNLS.2018.2817340"},{"key":"ref47","doi-asserted-by":"crossref","unstructured":"Zheng, W., Yin, L., Chen, X., Ma, Z., Liu, S., Yang, B.: Knowledge base graph embedding module design for visual question answering model. Pattern recognition 120, 108153 (2021)","DOI":"10.1016\/j.patcog.2021.108153"},{"key":"ref48","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Ren, T., Zhu, C., Sun, X., Liu, J., Ding, X., Xu, M., Ji, R.: TRAR: routing the attention spans in transformer for visual question answering. In: 2021 IEEE\/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. pp. 2054-2064. IEEE (2021)","DOI":"10.1109\/ICCV48922.2021.00208"}],"container-title":["Computer Science and Information Systems"],"original-title":[],"language":"en","deposited":{"date-parts":[[2025,3,5]],"date-time":"2025-03-05T09:25:44Z","timestamp":1741166744000},"score":1,"resource":{"primary":{"URL":"https:\/\/doiserbia.nb.rs\/Article.aspx?ID=1820-02142500003L"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":48,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025]]}},"URL":"https:\/\/doi.org\/10.2298\/csis240321003l","relation":{},"ISSN":["1820-0214","2406-1018"],"issn-type":[{"value":"1820-0214","type":"print"},{"value":"2406-1018","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025]]}}}