{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T21:54:34Z","timestamp":1740174874436,"version":"3.37.3"},"reference-count":40,"publisher":"Wiley","license":[{"start":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T00:00:00Z","timestamp":1634515200000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61672338","61873160"],"award-info":[{"award-number":["61672338","61873160"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Mobile Information Systems"],"published-print":{"date-parts":[[2021,10,18]]},"abstract":"<jats:p>Visual question answering (VQA) is the natural language question-answering of visual images. The model of VQA needs to make corresponding answers according to specific questions based on understanding images, the most important of which is to understand the relationship between images and language. Therefore, this paper proposes a new model, Representation of Dense Multimodality Fusion Encoder Based on Transformer, for short, RDMMFET, which can learn the related knowledge between vision and language. The RDMMFET model consists of three parts: dense language encoder, image encoder, and multimodality fusion encoder. In addition, we designed three types of pretraining tasks: masked language model, masked image model, and multimodality fusion task. These pretraining tasks can help to understand the fine-grained alignment between text and image regions. Simulation results on the VQA v2.0 data set show that the RDMMFET model can work better than the previous model. Finally, we conducted detailed ablation studies on the RDMMFET model and provided the results of attention visualization, which proves that the RDMMFET model can significantly improve the effect of VQA.<\/jats:p>","DOI":"10.1155\/2021\/2662064","type":"journal-article","created":{"date-parts":[[2021,10,19]],"date-time":"2021-10-19T11:35:08Z","timestamp":1634643308000},"page":"1-9","source":"Crossref","is-referenced-by-count":0,"title":["RDMMFET: Representation of Dense Multimodality Fusion Encoder Based on Transformer"],"prefix":"10.1155","volume":"2021","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8194-0018","authenticated-orcid":true,"given":"Xu","family":"Zhang","sequence":"first","affiliation":[{"name":"College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8861-5461","authenticated-orcid":true,"given":"DeZhi","family":"Han","sequence":"additional","affiliation":[{"name":"College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China"}]},{"given":"Chin-Chen","family":"Chang","sequence":"additional","affiliation":[{"name":"Department of Information Engineering and Computer Science, Feng Chia University, Taichung 40724, Taiwan"}]}],"member":"311","reference":[{"key":"1","doi-asserted-by":"publisher","DOI":"10.1109\/cvpr42600.2020.01081"},{"key":"2","doi-asserted-by":"publisher","DOI":"10.1561\/0600000079"},{"key":"3","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"4","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0227222"},{"article-title":"Attention is all you need","year":"2017","author":"A. Vaswani","key":"5"},{"article-title":"Multimodal compact bilinear pooling for visual question answering and visual grounding","year":"2016","author":"A. Fukui","key":"6"},{"article-title":"Bilinear attention networks","year":"2018","author":"J. H. Kim","key":"7"},{"key":"8","doi-asserted-by":"publisher","DOI":"10.1109\/cvpr.2019.00680"},{"key":"9","doi-asserted-by":"publisher","DOI":"10.1109\/cvpr.2019.00644"},{"key":"10","doi-asserted-by":"publisher","DOI":"10.1109\/iccv.2019.00502"},{"article-title":"Bert: pre-training of deep bidirectional transformers for language understanding","year":"2018","author":"J. Devlin","key":"11"},{"key":"12","doi-asserted-by":"publisher","DOI":"10.1109\/iccv.2019.00756"},{"article-title":"Deep captioning with multimodal recurrent neural networks (m-rnn)","year":"2014","author":"J. Mao","key":"13"},{"first-page":"203","article-title":"What value do explicit high level concepts have in vision to language problems?","author":"Q. Wu","key":"14"},{"key":"15","doi-asserted-by":"publisher","DOI":"10.1007\/s00500-020-05539-7"},{"article-title":"Hadamard product for low-rank bilinear pooling","year":"2016","author":"J. H. Kim","key":"16"},{"key":"17","doi-asserted-by":"publisher","DOI":"10.1109\/iccv.2017.202"},{"key":"18","doi-asserted-by":"publisher","DOI":"10.1109\/tnnls.2018.2817340"},{"article-title":"Vilt: vision-and-language transformer without convolution or region supervision","year":"2021","author":"W. Kim","key":"19"},{"article-title":"Attention bottlenecks for multimodal fusion","year":"2021","author":"A. Nagrani","key":"20"},{"article-title":"Semi-supervised sequence learning","year":"2015","author":"A. M. Dai","key":"21"},{"key":"22","doi-asserted-by":"publisher","DOI":"10.1109\/cvpr.2014.81"},{"key":"23","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/n18-1202"},{"issue":"8","key":"24","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"A. Radford","year":"2019","journal-title":"OpenAI blog"},{"article-title":"Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks","year":"2019","author":"J. Lu","key":"25"},{"key":"26","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/d19-1514"},{"key":"27","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/d19-1219"},{"key":"28","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6795"},{"article-title":"Visualbert: a simple and performant baseline for vision and language","year":"2019","author":"L. H. Li","key":"29"},{"article-title":"Vl-bert: pre-training of generic visual-linguistic representations","year":"2019","author":"W. Su","key":"30"},{"article-title":"Faster R-CNN: towards real-time object detection with region proposal networks","year":"2015","author":"S. Ren","key":"31"},{"key":"32","doi-asserted-by":"publisher","DOI":"10.1109\/cvpr.2018.00636"},{"issue":"1","key":"33","article-title":"Deep semantic role labeling with self-attention","volume":"32","author":"Z. Tan","year":"2018","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"34","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/p19-1580"},{"key":"35","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"36","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"37","first-page":"11","article-title":"Gqa: a new dataset for compositional question answering over real-world images","author":"D. A. Hudson","year":"2019"},{"key":"38","doi-asserted-by":"publisher","DOI":"10.1109\/cvpr.2016.540"},{"article-title":"Adam: a method for stochastic optimization","year":"2014","author":"D. P. Kingma","key":"39"},{"article-title":"Multimodal unified attention networks for vision-and-language interactions","year":"2019","author":"Z. Yu","key":"40"}],"container-title":["Mobile Information Systems"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/downloads.hindawi.com\/journals\/misy\/2021\/2662064.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/downloads.hindawi.com\/journals\/misy\/2021\/2662064.xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/downloads.hindawi.com\/journals\/misy\/2021\/2662064.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,10,19]],"date-time":"2021-10-19T11:35:23Z","timestamp":1634643323000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.hindawi.com\/journals\/misy\/2021\/2662064\/"}},"subtitle":[],"editor":[{"given":"Chin-Ling","family":"Chen","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,10,18]]},"references-count":40,"alternative-id":["2662064","2662064"],"URL":"https:\/\/doi.org\/10.1155\/2021\/2662064","relation":{},"ISSN":["1875-905X","1574-017X"],"issn-type":[{"type":"electronic","value":"1875-905X"},{"type":"print","value":"1574-017X"}],"subject":[],"published":{"date-parts":[[2021,10,18]]}}}