{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,15]],"date-time":"2026-05-15T17:36:11Z","timestamp":1778866571420,"version":"3.51.4"},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,2,25]],"date-time":"2023-02-25T00:00:00Z","timestamp":1677283200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001381","name":"National Research Foundation, Singapore","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001381","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,5,31]]},"abstract":"<jats:p>Making each modality in multi-modal data contribute is of vital importance to learning a versatile multi-modal model. Existing methods, however, are often dominated by one or few of modalities during model training, resulting in sub-optimal performance. In this article, we refer to this problem as modality bias and attempt to study it in the context of multi-modal classification systematically and comprehensively. After stepping into several empirical analyses, we recognize that one modality affects the model prediction more just because this modality has a spurious correlation with instance labels. To primarily facilitate the evaluation on the modality bias problem, we construct two datasets, respectively, for the colored digit recognition and video action recognition tasks in line with the Out-of-Distribution (OoD) protocol. Collaborating with the benchmarks in the visual question answering task, we empirically justify the performance degradation of the existing methods on these OoD datasets, which serves as evidence to justify the modality bias learning. In addition, to overcome this problem, we propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned according to the training set statistics. Thereafter, we apply this method on 10 baselines in total to test its effectiveness. From the results on four datasets regarding the above three tasks, our method yields remarkable performance improvements compared with the baselines, demonstrating its superiority on reducing the modality bias problem.<\/jats:p>","DOI":"10.1145\/3565266","type":"journal-article","created":{"date-parts":[[2022,9,29]],"date-time":"2022-09-29T11:48:36Z","timestamp":1664452116000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":45,"title":["On Modality Bias Recognition and Reduction"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8691-5372","authenticated-orcid":false,"given":"Yangyang","family":"Guo","sequence":"first","affiliation":[{"name":"National University of Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1476-0273","authenticated-orcid":false,"given":"Liqiang","family":"Nie","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology (Shenzhen), China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7436-0162","authenticated-orcid":false,"given":"Harry","family":"Cheng","sequence":"additional","affiliation":[{"name":"Shandong University, Qingdao, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1052-8322","authenticated-orcid":false,"given":"Zhiyong","family":"Cheng","sequence":"additional","affiliation":[{"name":"Shandong Artificial Intelligence Institute, Jinan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4846-2015","authenticated-orcid":false,"given":"Mohan","family":"Kankanhalli","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1109-5028","authenticated-orcid":false,"given":"Alberto","family":"Del Bimbo","sequence":"additional","affiliation":[{"name":"University of Florence, Florence, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,2,25]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00522"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.12"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i8.16829"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00821"},{"key":"e_1_3_3_9_2","first-page":"813","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Bertasius Gedas","year":"2021","unstructured":"Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning. PMLR, 813\u2013824."},{"key":"e_1_3_3_10_2","first-page":"839","volume-title":"Advances in Neural Information Processing Systems","author":"Cad\u00e8ne R\u00e9mi","year":"2019","unstructured":"R\u00e9mi Cad\u00e8ne, Corentin Dancette, Hedi Ben-younes, Matthieu Cord, and Devi Parikh. 2019. RUBi: Reducing unimodal biases for visual question answering. In Advances in Neural Information Processing Systems. MIT, 839\u2013850."},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.aal4230"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00610"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01081"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01471"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.168"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1418"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00482"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3278721.3278729"},{"key":"e_1_3_3_20_2","first-page":"6447","volume-title":"Advances in Neural Information Processing Systems","author":"Dou Qi","year":"2019","unstructured":"Qi Dou, Daniel Coelho de Castro, Konstantinos Kamnitsas, and Ben Glocker. 2019. Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems. MIT, 6447\u20136458."},{"key":"e_1_3_3_21_2","first-page":"2922","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Engstrom Logan","year":"2020","unstructured":"Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Jacob Steinhardt, and Aleksander Madry. 2020. Identifying statistical bias in dataset replication. In Proceedings of the International Conference on Machine Learning. PMLR, 2922\u20132932."},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1044"},{"key":"e_1_3_3_23_2","volume-title":"Advances in Neural Information Processing Systems","author":"Gat Itai","year":"2020","unstructured":"Itai Gat, Idan Schwartz, Alexander G. Schwing, and Tamir Hazan. 2020. Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. In Advances in Neural Information Processing Systems. MIT."},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00342"},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331186"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2021\/98"},{"key":"e_1_3_3_28_2","article-title":"Loss re-scaling VQA: Revisiting the language prior problem from a class-imbalance view","author":"Guo Yangyang","year":"2021","unstructured":"Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Qi Tian, and Min Zhang. 2021. Loss re-scaling VQA: Revisiting the language prior problem from a class-imbalance view. IEEE Trans. Image Process. 31, 227\u2013238.","journal-title":"IEEE Trans. Image Process."},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3425663"},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00823"},{"key":"e_1_3_3_32_2","first-page":"2712","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Hendrycks Dan","year":"2019","unstructured":"Dan Hendrycks, Kimin Lee, and Mantas Mazeika. 2019. Using pre-training can improve model robustness and uncertainty. In Proceedings of the International Conference on Machine Learning. PMLR, 2712\u20132721."},{"key":"e_1_3_3_33_2","article-title":"Distilling the knowledge in a neural network","volume":"1503","author":"Hinton Geoffrey E.","year":"2015","unstructured":"Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. CoRR abs\/1503.02531.","journal-title":"CoRR"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.167"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6776"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.215"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.215"},{"key":"e_1_3_3_38_2","article-title":"The kinetics human action video dataset","volume":"1705","author":"Kay Will","year":"2017","unstructured":"Will Kay, Jo\u00e3o Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The kinetics human action video dataset. CoRR abs\/1705.06950.","journal-title":"CoRR"},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01472"},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2018.00283"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_3_3_42_2","first-page":"7082","volume-title":"Advances in Neural Information Processing","author":"Lee Hyuck","year":"2021","unstructured":"Hyuck Lee, Seungjae Shin, and Heeyoung Kim. 2021. ABC: Auxiliary balanced classifier for class-imbalanced semi-supervised learning. In Advances in Neural Information Processing. 7082\u20137094."},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3418217"},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00980"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01470"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01470"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBIOM.2018.2890577"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3377882"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1302"},{"key":"e_1_3_3_50_2","volume-title":"CoRR","author":"Perez Ethan","year":"2022","unstructured":"Ethan Perez, Saffron Huang, H. Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models. CoRR abs\/2202.03286."},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11671"},{"key":"e_1_3_3_52_2","first-page":"46","volume-title":"Proceedings of the Computer Vision and Pattern Recognition Workshops","author":"Pernici Federico","year":"2019","unstructured":"Federico Pernici, Matteo Bruni, Claudio Baecchi, and Alberto Del Bimbo. 2019. Maximally compact and separated features with regular polytope networks. In Proceedings of the Computer Vision and Pattern Recognition Workshops. IEEE, 46\u201353."},{"key":"e_1_3_3_53_2","first-page":"1548","volume-title":"Advances in Neural Information Processing Systems","author":"Ramakrishnan Sainandan","year":"2018","unstructured":"Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. 2018. Overcoming language priors in visual question answering with adversarial regularization. In Advances in Neural Information Processing Systems. MIT, 1548\u20131558."},{"key":"e_1_3_3_54_2","first-page":"5389","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Recht Benjamin","year":"2019","unstructured":"Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the International Conference on Machine Learning. PMLR, 5389\u20135400."},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00268"},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.468"},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00145"},{"key":"e_1_3_3_59_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Wang Haohan","year":"2019","unstructured":"Haohan Wang, Zexue He, Zachary C. Lipton, and Eric P. Xing. 2019. Learning robust representations by projecting superficial statistics out. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00552"},{"key":"e_1_3_3_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00078"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00541"},{"key":"e_1_3_3_63_2","first-page":"8601","volume-title":"Advances in Neural Information Processing Systems","author":"Wu Jialin","year":"2019","unstructured":"Jialin Wu and Raymond J. Mooney. 2019. Self-critical reasoning for robust visual question answering. In Advances in Neural Information Processing Systems. MIT, 8601\u20138611."},{"key":"e_1_3_3_64_2","first-page":"4134","volume-title":"Annual Meeting of the Association for Computational Linguistics","author":"Zhang Guanhua","year":"2020","unstructured":"Guanhua Zhang, Bing Bai, Junqi Zhang, Kun Bai, Conghui Zhu, and Tiejun Zhao. 2020. Demographics should not be the reason of toxicity: Mitigating discrimination in text classifications with instance weighting. In Annual Meeting of the Association for Computational Linguistics. ACL, 4134\u20134145."},{"key":"e_1_3_3_65_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Zhang Yan","year":"2018","unstructured":"Yan Zhang, Jonathon S. Hare, and Adam Pr\u00fcgel-Bennett. 2018. Learning to count objects in natural images for visual question answering. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_3_66_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i4.16458"},{"key":"e_1_3_3_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3366710"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3565266","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3565266","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:43Z","timestamp":1750178263000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3565266"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,25]]},"references-count":66,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,5,31]]}},"alternative-id":["10.1145\/3565266"],"URL":"https:\/\/doi.org\/10.1145\/3565266","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,25]]},"assertion":[{"value":"2021-12-09","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-09-26","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-25","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}