{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,29]],"date-time":"2026-05-29T20:16:13Z","timestamp":1780085773170,"version":"3.54.0"},"reference-count":53,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2023,10,7]],"date-time":"2023-10-07T00:00:00Z","timestamp":1696636800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2021YFC3300500"],"award-info":[{"award-number":["2021YFC3300500"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>The integration of information from multiple modalities is a highly active area of research. Previous techniques have predominantly focused on fusing shallow features or high-level representations generated by deep unimodal networks, which only capture a subset of the hierarchical relationships across modalities. However, previous methods are often limited to exploiting the fine-grained statistical features inherent in multimodal data. This paper proposes an approach that densely integrates representations by computing image features\u2019 means and standard deviations. The global statistics of features afford a holistic perspective, capturing the overarching distribution and trends inherent in the data, thereby facilitating enhanced comprehension and characterization of multimodal data. We also leverage a Transformer-based fusion encoder to effectively capture global variations in multimodal features. To further enhance the learning process, we incorporate a contrastive loss function that encourages the discovery of shared information across different modalities. To validate the effectiveness of our approach, we conduct experiments on three widely used multimodal sentiment analysis datasets. The results demonstrate the efficacy of our proposed method, achieving significant performance improvements compared to existing approaches.<\/jats:p>","DOI":"10.3390\/e25101421","type":"journal-article","created":{"date-parts":[[2023,10,7]],"date-time":"2023-10-07T14:03:03Z","timestamp":1696687383000},"page":"1421","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["Multi-Modal Representation via Contrastive Learning with Attention Bottleneck Fusion and Attentive Statistics Features"],"prefix":"10.3390","volume":"25","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1552-9033","authenticated-orcid":false,"given":"Qinglang","family":"Guo","sequence":"first","affiliation":[{"name":"School of Cyber Science and Technology, University of Science and Technology of China, Heifei 230027, China"},{"name":"National Engineering Research Center for Public Safety Risk Perception and Control by Big Data (RPP), CETC Academy of Electronics and Information Technology Group Co., Ltd., China Academic of Electronics and Information Technology, Beijing 100041, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yong","family":"Liao","sequence":"additional","affiliation":[{"name":"National Engineering Research Center for Public Safety Risk Perception and Control by Big Data (RPP), CETC Academy of Electronics and Information Technology Group Co., Ltd., China Academic of Electronics and Information Technology, Beijing 100041, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0519-7434","authenticated-orcid":false,"given":"Zhe","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Shenglin","family":"Liang","sequence":"additional","affiliation":[{"name":"School of Telecommunications Engineering, Xidian University, Xi\u2019an 710071, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2023,10,7]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Colombo, P., Chapuis, E., Labeau, M., and Clavel, C. (2021, January 7\u201311). Improving Multimodal fusion via Mutual Dependency Maximisation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual.","DOI":"10.18653\/v1\/2021.emnlp-main.21"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Han, W., Chen, H., and Poria, S. (2021, January 7\u201311). Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual.","DOI":"10.18653\/v1\/2021.emnlp-main.723"},{"key":"ref_3","unstructured":"Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A.Y. (July, January 28). Multimodal deep learning. Proceedings of the ICML, Bellevue, WA, USA."},{"key":"ref_4","first-page":"2949","article-title":"Multimodal learning with deep boltzmann machines","volume":"25","author":"Srivastava","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1692","DOI":"10.1109\/JPROC.2010.2057231","article-title":"Audiovisual information fusion in human\u2013computer interfaces and intelligent environments: A survey","volume":"98","author":"Shivappa","year":"2010","journal-title":"Proc. IEEE"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Feng, F., Wang, X., and Li, R. (2014, January 7). Cross-modal retrieval with correspondence autoencoder. Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA.","DOI":"10.1145\/2647868.2654902"},{"key":"ref_7","unstructured":"Oord, A.v.d., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020, January 13\u201319). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"ref_9","unstructured":"Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020, January 13\u201318). A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning, PMLR, Virtual."},{"key":"ref_10","unstructured":"Tian, Y., Krishnan, D., and Isola, P. (2020). Proceedings of the European Conference on Computer Vision, Springer."},{"key":"ref_11","unstructured":"Liu, Y., Yi, L., Zhang, S., Fan, Q., Funkhouser, T., and Dong, H. (2020). P4contrast: Contrastive learning with pairs of point-pixel pairs for rgb-d scene understanding. arXiv."},{"key":"ref_12","first-page":"25","article-title":"Self-supervised multimodal versatile networks","volume":"33","author":"Alayrac","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Murthygowda, M.Y., Krishnegowda, R.G., and Venkataramu, S.S. (2023). An integrated multi-level feature fusion framework for crowd behaviour prediction and analysis. Int. J. Electr. Comput. Eng. (IJECE), 30.","DOI":"10.11591\/ijeecs.v30.i3.pp1369-1380"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Liang, M., Wei, M., Li, Y., Tian, H., and Li, Y. (2023). Improvement and Application of Fusion Scheme in Automatic Medical Image Analysis. Asian J. Sci. Technol.","DOI":"10.54097\/ajst.v5i3.8018"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"3466","DOI":"10.1109\/JBHI.2022.3165640","article-title":"Computer-aided recognition based on decision-level multimodal fusion for depression","volume":"26","author":"Zhang","year":"2022","journal-title":"IEEE J. Biomed. Health Inform."},{"key":"ref_16","first-page":"2007","article-title":"3D Vehicle Detection Algorithm Based on Multimodal Decision-Level Fusion","volume":"135","author":"Shi","year":"2023","journal-title":"CMES-Comput. Model. Eng. Sci."},{"key":"ref_17","unstructured":"Islam, M.M., and Iqbal, T. (March, January 22). Mumu: Cooperative multitask learning-based guided multimodal fusion. Proceedings of the AAAI Conference on Artificial Intelligence, Virtual."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Shankar, S. (2022, January 22\u201327). Multimodal fusion via cortical network inspired losses. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland.","DOI":"10.18653\/v1\/2022.acl-long.83"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"424","DOI":"10.1016\/j.inffus.2022.09.025","article-title":"Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions","volume":"91","author":"Gandhi","year":"2023","journal-title":"Inf. Fusion"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Li, Z., Mak, M.-W., and Meng, H.M.-L. (2023, January 4\u201310). Discriminative Speaker Representation Via Contrastive Learning with Class-Aware Attention in Angular Space. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10096230"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Sheng, J., Lam, S.-K., Li, Z., Zhang, J., Teng, X., Zhang, Y., and Cai, J. (2023, January 12\u201315). Multi-view Contrastive Learning with Additive Margin for Adaptive Nasopharyngeal Carcinoma Radiotherapy Prediction. Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, Thessaloniki, Greece.","DOI":"10.1145\/3591106.3592261"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Li, Z., and Mak, M.-W. (2022, January 7\u201310). Speaker representation learning via contrastive loss with maximal speaker separability. Proceedings of the 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand.","DOI":"10.23919\/APSIPAASC55919.2022.9980014"},{"key":"ref_23","first-page":"857","article-title":"Self-supervised learning: Generative or contrastive","volume":"35","author":"Liu","year":"2021","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"4037","DOI":"10.1109\/TPAMI.2020.2992393","article-title":"Self-supervised visual feature learning with deep neural networks: A survey","volume":"43","author":"Jing","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., and Makedon, F. (2020). A survey on contrastive self-supervised learning. Technologies, 9.","DOI":"10.3390\/technologies9010002"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"3570","DOI":"10.1109\/ACCESS.2020.3048088","article-title":"Knowledge-guided sentiment analysis via learning from natural language explanations","volume":"9","author":"Ke","year":"2021","journal-title":"IEEE Access"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"148489","DOI":"10.1109\/ACCESS.2020.3015854","article-title":"AgglutiFiT: Efficient low-resource agglutinative language model fine-tuning","volume":"8","author":"Li","year":"2020","journal-title":"IEEE Access"},{"key":"ref_28","unstructured":"Li, X., Li, Z., Sheng, J., and Slamu, W. (2020). Proceedings of the China National Conference on Chinese Computational Linguistics, Springer."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Yan, Y., Li, R., Wang, S., Zhang, F., Wu, W., and Xu, W. (2021). Consert: A contrastive framework for self-supervised sentence representation transfer. arXiv.","DOI":"10.18653\/v1\/2021.acl-long.393"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Gao, T., Yao, X., and Chen, D. (2021, January 7\u201311). SimCSE: Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual.","DOI":"10.18653\/v1\/2021.emnlp-main.552"},{"key":"ref_31","unstructured":"Wu, Z., Wang, S., Gu, J., Khabsa, M., Sun, F., and Ma, H. (2020). Clear: Contrastive learning for sentence representation. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Chen, X., and He, K. (2021, January 20\u201325). Exploring simple siamese representation learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01549"},{"key":"ref_33","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, PMLR, Virtual."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Huang, P.Y., Patrick, M., Hu, J., Neubig, G., Metze, F., and Hauptmann, A.G. (June, January 6\u2013). Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online.","DOI":"10.18653\/v1\/2021.naacl-main.195"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Yuan, X., Lin, Z., Kuen, J., Zhang, J., Wang, Y., Maire, M., Kale, A., and Faieta, B. (2021, January 20\u201325). Multimodal contrastive training for visual representation learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00692"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Nojavanasghari, B., Gopinath, D., Koushik, J., Baltru\u0161aitis, T., and Morency, L.P. (2016, January 12\u201316). Deep multimodal fusion for persuasiveness prediction. Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan.","DOI":"10.1145\/2993148.2993176"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1109\/MSP.2017.2738401","article-title":"Deep multimodal learning: A survey on recent advances and trends","volume":"34","author":"Ramachandram","year":"2017","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_38","first-page":"1","article-title":"Improved multimodal deep learning with variation of information","volume":"27","author":"Sohn","year":"2014","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_39","unstructured":"Niu, T., Zhu, S., Pang, L., and Saddik, A.E. (2016). Proceedings of the International Conference on Multimedia Modeling, Springer."},{"key":"ref_40","unstructured":"Cai, Y., Cai, H., and Wan, X. (August, January 28). Multi-modal sarcasm detection in twitter with hierarchical fusion model. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Xu, N., and Mao, W. (2017, January 6\u201310). Multisentinet: A deep semantic network for multimodal sentiment analysis. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore.","DOI":"10.1145\/3132847.3133142"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., and Funtowicz, M. (2020, January 16\u201320). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online.","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"ref_43","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Huang, L., Ma, D., Li, S., Zhang, X., and Wang, H. (2019, January 3\u20137). Text Level Graph Neural Network for Text Classification. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1345"},{"key":"ref_45","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"4014","DOI":"10.1109\/TMM.2020.3035277","article-title":"Image-text multimodal emotion classification via multi-view attentional network","volume":"23","author":"Yang","year":"2020","journal-title":"IEEE Trans. Multimed."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Xu, N. (2017, January 22\u201324). Analyzing multimodal public sentiment based on hierarchical semantic attentional network. Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, China.","DOI":"10.1109\/ISI.2017.8004895"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Xu, N., Mao, W., and Chen, G. (2018, January 8\u201312). A co-memory network for multimodal sentiment analysis. Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA.","DOI":"10.1145\/3209978.3210093"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Yang, X., Feng, S., Zhang, Y., and Wang, D. (2021, January 1\u20136). Multimodal sentiment detection based on multi-channel graph neural networks. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event.","DOI":"10.18653\/v1\/2021.acl-long.28"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Schifanella, R., De Juan, P., Tetreault, J., and Cao, L. (2016, January 15\u201319). Detecting sarcasm in multimodal social platforms. Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands.","DOI":"10.1145\/2964284.2964321"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Xu, N., Zeng, Z., and Mao, W. (2020, January 5\u201310). Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online.","DOI":"10.18653\/v1\/2020.acl-main.349"},{"key":"ref_53","first-page":"4271","article-title":"Funnel-transformer: Filtering out sequential redundancy for efficient language processing","volume":"33","author":"Dai","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/25\/10\/1421\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T21:02:25Z","timestamp":1760130145000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/25\/10\/1421"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,7]]},"references-count":53,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2023,10]]}},"alternative-id":["e25101421"],"URL":"https:\/\/doi.org\/10.3390\/e25101421","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,7]]}}}