{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T17:23:49Z","timestamp":1777051429892,"version":"3.51.4"},"reference-count":26,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2024,10,11]],"date-time":"2024-10-11T00:00:00Z","timestamp":1728604800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Science and Technology Council, Taiwan","award":["NSTC112-2221-E-027-101"],"award-info":[{"award-number":["NSTC112-2221-E-027-101"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Information sharing on social media has become a common practice for people around the world. Since it is difficult to check user-generated content on social media, huge amounts of rumors and misinformation are being spread with authentic information. On the one hand, most of the social platforms identify rumors through manual fact-checking, which is very inefficient. On the other hand, with an emerging form of misinformation that contains inconsistent image\u2013text pairs, it would be beneficial if we could compare the meaning of multimodal content within the same post for detecting image\u2013text inconsistency. In this paper, we propose a novel approach to misinformation detection by multimodal feature fusion with transformers and credibility assessment with self-attention-based Bi-RNN networks. Firstly, captions are derived from images using an image captioning module to obtain their semantic descriptions. These are compared with surrounding text by fine-tuning transformers for consistency check in semantics. Then, to further aggregate sentiment features into text representation, we fine-tune a separate transformer for text sentiment classification, where the output is concatenated to augment text embeddings. Finally, Multi-Cell Bi-GRUs with self-attention are used to train the credibility assessment model for misinformation detection. From the experimental results on tweets, the best performance with an accuracy of 0.904 and an F1-score of 0.921 can be obtained when applying feature fusion of augmented embeddings with sentiment classification results. This shows the potential of the innovative way of applying transformers in our proposed approach to misinformation detection. Further investigation is needed to validate the performance on various types of multimodal discrepancies.<\/jats:p>","DOI":"10.3390\/bdcc8100134","type":"journal-article","created":{"date-parts":[[2024,10,11]],"date-time":"2024-10-11T08:10:16Z","timestamp":1728634216000},"page":"134","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Augmenting Multimodal Content Representation with Transformers for Misinformation Detection"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6076-7380","authenticated-orcid":false,"given":"Jenq-Haur","family":"Wang","sequence":"first","affiliation":[{"name":"Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei 106, Taiwan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1701-2668","authenticated-orcid":false,"given":"Mehdi","family":"Norouzi","sequence":"additional","affiliation":[{"name":"Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221, USA"}]},{"given":"Shu Ming","family":"Tsai","sequence":"additional","affiliation":[{"name":"Inventory Department, Cheng Hsin General Hospital, Taipei 112, Taiwan"}]}],"member":"1968","published-online":{"date-parts":[[2024,10,11]]},"reference":[{"key":"ref_1","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_2","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"e928","DOI":"10.7717\/peerj-cs.928","article-title":"Multi-modal affine fusion network for social media rumor detection","volume":"8","author":"Fu","year":"2022","journal-title":"PeerJ Comput. Sci."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Han, H., Ke, Z., Nie, X., Dai, L., and Slamu, W. (2023). Multimodal fusion with dual-attention based on textual double-embedding networks for rumor detection. Appl. Sci., 13.","DOI":"10.3390\/app13084886"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"120537","DOI":"10.1016\/j.eswa.2023.120537","article-title":"Multi-modal fusion using fine-tuned self-attention and transfer learning for veracity analysis of web information","volume":"229","author":"Meel","year":"2023","journal-title":"Expert Syst. Appl."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Huang, M., Jia, S., Chang, M.-C., and Lyu, S. (2022, January 7\u201313). Text-image de-contextualization detection using vision-language models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022), Virtual.","DOI":"10.1109\/ICASSP43922.2022.9746193"},{"key":"ref_7","unstructured":"Aneja, S., Bregler, C., and Niessner, M. (2023, January 7\u201314). Cosmos: Catching out-of-context image misuse with self-supervised learning. Proceedings of the AAAI 2023, Washington DC, USA."},{"key":"ref_8","unstructured":"Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (May, January 26). ALBERT: A lite BERT for self-supervised learning of language representations. Proceedings of the ICLR 2020, Virtual."},{"key":"ref_9","unstructured":"Ma, J., Gao, W., Mitra, P., Kwon, S., Jansen, B.J., Wong, K.-F., and Cha, M. (2016, January 9\u201315). Detecting rumors from microblogs with recurrent neural networks. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY, USA."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Yu, F., Liu, Q., Wu, S., Wang, L., and Tan, T. (2017, January 19\u201325). A convolutional approach for misinformation identification. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI 2017), Melbourne, VIC, Australia.","DOI":"10.24963\/ijcai.2017\/545"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Chen, T., Li, X., Yin, H., and Zhang, J. (2018, January 3\u20136). Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection. Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018), Melbourne, VIC, Australia.","DOI":"10.1007\/978-3-030-04503-6_4"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Sampson, J., Morstatter, F., Wu, L., and Liu, H. (2016, January 24). Leveraging the implicit structure within social media for emergent rumor detection. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM 2016), Indianapolis, IN, USA.","DOI":"10.1145\/2983323.2983697"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Ma, J., Gao, W., and Wong, K.-F. (2018, January 23\u201327). Detect rumor and stance jointly by neural multi-task learning. Proceedings of the Companion Proceedings of the Web Conference 2018 (WWW 2018), Lyon, France.","DOI":"10.1145\/3184558.3188729"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Jin, Z., Cao, J., Guo, H., Zhang, Y., and Luo, J. (2017, January 23\u201327). Multimodal fusion with recurrent neural networks for rumor detection on microblogs. Proceedings of the 25th ACM international conference on Multimedia (MM 2017), Mountain View, CA, USA.","DOI":"10.1145\/3123266.3123454"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"104","DOI":"10.1109\/MMUL.2022.3146568","article-title":"Multimodal fusion network with contrary latent topic memory for rumor detection","volume":"29","author":"Chen","year":"2022","journal-title":"IEEE MultiMedia"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1795","DOI":"10.1007\/s10796-022-10315-z","article-title":"Rumor classification through a multimodal fusion framework and ensemble learning","volume":"25","author":"Azri","year":"2023","journal-title":"Inf. Syst. Front."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Hu, X., Guo, Z., Chen, J., Wen, L., and Yu, P.S. (2023, January 23\u201327). Mr2: A benchmark for multimodal retrieval-augmented rumor detection in social media. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan.","DOI":"10.1145\/3539618.3591896"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Liu, X., Pang, M., Li, Q., Zhou, J., Wang, H., and Yang, D. (2024). MVACLNet: A multimodal virtual augmentation contrastive learning network for rumor detection. Algorithms, 17.","DOI":"10.3390\/a17050199"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"102097","DOI":"10.1016\/j.ipm.2019.102097","article-title":"An image-text consistency driven multimodal sentiment analysis approach for social media","volume":"56","author":"Zhao","year":"2019","journal-title":"Inf. Process. Manag."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Chen, H., Ding, G., Lin, Z., Zhao, S., and Han, J. (2019, January 21\u201325). Cross-modal image-text retrieval with semantic consistency. Proceedings of the 27th ACM International Conference on Multimedia (MM 2019), Nice, France.","DOI":"10.1145\/3343031.3351055"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Lee, K.-H., Chen, X., Hua, G., Hu, H., and He, X. (2018, January 8\u201314). Stacked cross attention for image-text matching. Proceedings of the European Conference on Computer Vision (ECCV 2018), Munich, Germany.","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"M\u00fcller-Budack, E., Theiner, J., Diering, S., Idahl, M., and Ewerth, R. (2020, January 8\u201311). Multimodal analytics for real-world news using measures of cross-modal entity consistency. Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR 2020), Dublin, Ireland.","DOI":"10.1145\/3372278.3390670"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"665","DOI":"10.1177\/0146167203029005010","article-title":"Lying words: Predicting deception from linguistic styles","volume":"29","author":"Newman","year":"2003","journal-title":"Personal. Soc. Psychol. Bull."},{"key":"ref_24","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014, January 13). Empirical evaluation of gated recurrent neural networks on sequence modelling. Proceedings of the NIPS, Montreal, QC, Canada."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wang, J.-H., Norouzi, M., and Tsai, S.M. (2022, January 5\u201317). Multimodal content veracity assessment with bidirectional transformers and self-attention-based bi-GRU networks. Proceedings of the IEEE International Conference on Multimedia Big Data (BigMM 2022), Naples, Italy.","DOI":"10.1109\/BigMM55396.2022.00030"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/10\/134\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:11:08Z","timestamp":1760112668000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/10\/134"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,11]]},"references-count":26,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2024,10]]}},"alternative-id":["bdcc8100134"],"URL":"https:\/\/doi.org\/10.3390\/bdcc8100134","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,11]]}}}