{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T14:56:11Z","timestamp":1775228171939,"version":"3.50.1"},"reference-count":23,"publisher":"MDPI AG","issue":"14","license":[{"start":{"date-parts":[[2021,7,12]],"date-time":"2021-07-12T00:00:00Z","timestamp":1626048000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["20J13009"],"award-info":[{"award-number":["20J13009"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100017530","name":"Major Scientific Project of Zhejiang Laboratory","doi-asserted-by":"publisher","award":["2020ND8AD01"],"award-info":[{"award-number":["2020ND8AD01"]}],"id":[{"id":"10.13039\/501100017530","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Depression is a severe psychological condition that affects millions of people worldwide. As depression has received more attention in recent years, it has become imperative to develop automatic methods for detecting depression. Although numerous machine learning methods have been proposed for estimating the levels of depression via audio, visual, and audiovisual emotion sensing, several challenges still exist. For example, it is difficult to extract long-term temporal context information from long sequences of audio and visual data, and it is also difficult to select and fuse useful multi-modal information or features effectively. In addition, how to include other information or tasks to enhance the estimation accuracy is also one of the challenges. In this study, we propose a multi-modal adaptive fusion transformer network for estimating the levels of depression. Transformer-based models have achieved state-of-the-art performance in language understanding and sequence modeling. Thus, the proposed transformer-based network is utilized to extract long-term temporal context information from uni-modal audio and visual data in our work. This is the first transformer-based approach for depression detection. We also propose an adaptive fusion method for adaptively fusing useful multi-modal features. Furthermore, inspired by current multi-task learning work, we also incorporate an auxiliary task (depression classification) to enhance the main task of depression level regression (estimation). The effectiveness of the proposed method has been validated on a public dataset (AVEC 2019 Detecting Depression with AI Sub-challenge) in terms of the PHQ-8 scores. Experimental results indicate that the proposed method achieves better performance compared with currently state-of-the-art methods. Our proposed method achieves a concordance correlation coefficient (CCC) of 0.733 on AVEC 2019 which is 6.2% higher than the accuracy (CCC = 0.696) of the state-of-the-art method.<\/jats:p>","DOI":"10.3390\/s21144764","type":"journal-article","created":{"date-parts":[[2021,7,12]],"date-time":"2021-07-12T21:56:52Z","timestamp":1626127012000},"page":"4764","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":87,"title":["Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8094-1991","authenticated-orcid":false,"given":"Hao","family":"Sun","sequence":"first","affiliation":[{"name":"School of Software Technology, Zhejiang University, Hangzhou 315048, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5340-7053","authenticated-orcid":false,"given":"Jiaqing","family":"Liu","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Ritsumeikan University, Kusatsushi 5250058, Shiga, Japan"}]},{"given":"Shurong","family":"Chai","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Ritsumeikan University, Kusatsushi 5250058, Shiga, Japan"}]},{"given":"Zhaolin","family":"Qiu","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Zhejiang University, Hangzhou 315048, China"}]},{"given":"Lanfen","family":"Lin","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Zhejiang University, Hangzhou 315048, China"}]},{"given":"Xinyin","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Education, Soochow University, Suzhou 215006, China"}]},{"given":"Yenwei","family":"Chen","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Ritsumeikan University, Kusatsushi 5250058, Shiga, Japan"},{"name":"Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou 311121, China"}]}],"member":"1968","published-online":{"date-parts":[[2021,7,12]]},"reference":[{"key":"ref_1","unstructured":"Trinh, T., Dai, A., Luong, T., and Le, Q. (2018, January 10\u201315). Learning longer-term dependencies in rnns with auxiliary losses. Proceedings of the International Conference on Machine Learnin (PMLR 2018), Stockholm Sweden."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Yin, S., Liang, C., Ding, H., and Wang, S. (2019, January 21). A multi-modal hierarchical recurrent neural network for depression detection. Proceedings of the 9th International on Audio\/Visual Emotion Challenge and Workshop, Nice, France.","DOI":"10.1145\/3347320.3357696"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1109\/MIS.2019.2925204","article-title":"Multitask representation learning for multimodal estimation of depression level","volume":"34","author":"Qureshi","year":"2019","journal-title":"IEEE Intell. Syst."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Ray, A., Kumar, S., Reddy, R., Mukherjee, P., and Garg, R. (2019, January 21). Multi-level attention network using text, audio and video for depression prediction. Proceedings of the 9th International on Audio\/Visual Emotion Challenge and Workshop, Nice, France.","DOI":"10.1145\/3347320.3357697"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Ringeval, F., Schuller, B., Valstar, M., Cummins, N., Cowie, R., Tavabi, L., Schmitt, M., Alisamir, S., Amiriparian, S., and Messner, E.M. (2019, January 21). AVEC 2019 workshop and challenge: State-of-mind, detecting depression with AI, and cross-cultural affect recognition. Proceedings of the 9th International on Audio\/Visual Emotion Challenge and Workshop, Nice, France.","DOI":"10.1145\/3347320.3357688"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Liu, J.Q., Huang, Y., Huang, X.Y., Xia, X.T., Niu, X.X., and Chen, Y.W. (2019). Multimodal behavioral dataset of depressive symptoms in chinese college students\u2013preliminary study. Innovation in Medicine and Healthcare Systems, and Multimedia, Springer.","DOI":"10.1007\/978-981-13-8566-7_17"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Fan, W., He, Z., Xing, X., Cai, B., and Lu, W. (2019, January 21). Multi-modality depression detection via multi-scale temporal dilated cnns. Proceedings of the 9th International on Audio\/Visual Emotion Challenge and Workshop, Nice, France.","DOI":"10.1145\/3347320.3357695"},{"key":"ref_8","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. arXiv."},{"key":"ref_9","unstructured":"Wang, Y., Wang, Z., Li, C., Zhang, Y., and Wang, H. (2020). A Multitask Deep Learning Approach for User Depression Detection on Sina Weibo. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., and Wu, Y. (2020). Conformer: Convolution-augmented transformer for speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Delbrouck, J.B., Tits, N., Brousmiche, M., and Dupont, S. (2020). A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis. arXiv.","DOI":"10.18653\/v1\/2020.challengehml-1.1"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"302","DOI":"10.1192\/bjp.bp.110.078139","article-title":"State-dependent alteration in face emotion recognition in depression","volume":"198","author":"Anderson","year":"2011","journal-title":"Br. J. Psychiatry"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1007\/s12193-013-0123-2","article-title":"Multimodal assistive technologies for depression diagnosis and monitoring","volume":"7","author":"Joshi","year":"2013","journal-title":"J. Multimodal User Interfaces"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Rodrigues Makiuchi, M., Warnita, T., Uto, K., and Shinoda, K. (2019, January 21). Multimodal fusion of BERT-CNN and gated CNN representations for depression detection. Proceedings of the 9th International on Audio\/Visual Emotion Challenge and Workshop, Nice, France.","DOI":"10.1145\/3347320.3357694"},{"key":"ref_15","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"509","DOI":"10.3928\/0048-5713-20020901-06","article-title":"The PHQ-9: A New Depression Diagnostic and Severity Measure","volume":"32","author":"Kroenke","year":"2002","journal-title":"Psychiatr. Ann."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"255","DOI":"10.2307\/2532051","article-title":"A concordance correlation coefficient to evaluate reproducibility","volume":"45","author":"Lawrence","year":"1989","journal-title":"Biometrics"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"163","DOI":"10.1016\/j.jad.2008.06.026","article-title":"The PHQ-8 as a measure of current depression in the general population","volume":"114","author":"Kroenke","year":"2009","journal-title":"J. Affect. Disord."},{"key":"ref_19","unstructured":"Gratch, J., Artstein, R., Lucas, G.M., Stratou, G., Scherer, S., Nazarian, A., Wood, R., Boberg, J., DeVault, D., and Marsella, S. (2014, January 26\u201331). The Distress Analysis Interview Corpus of Human and Computer Interviews. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914), Reykjavik, Iceland."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Eyben, F., W\u00f6llmer, M., and Schuller, B. (2010, January 25\u201329). Opensmile: The munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy.","DOI":"10.1145\/1873951.1874246"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Baltru\u0161aitis, T., Robinson, P., and Morency, L.P. (2016, January 7\u201310). Openface: An open source facial behavior analysis toolkit. Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA.","DOI":"10.1109\/WACV.2016.7477553"},{"key":"ref_22","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_23","unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., and Antiga, L. (2019). Pytorch: An imperative style, high-performance deep learning library. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/14\/4764\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:29:30Z","timestamp":1760164170000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/14\/4764"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,12]]},"references-count":23,"journal-issue":{"issue":"14","published-online":{"date-parts":[[2021,7]]}},"alternative-id":["s21144764"],"URL":"https:\/\/doi.org\/10.3390\/s21144764","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,12]]}}}