{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T02:05:55Z","timestamp":1760148355427,"version":"build-2065373602"},"reference-count":34,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2023,4,23]],"date-time":"2023-04-23T00:00:00Z","timestamp":1682208000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62006120","KYCX21-0878"],"award-info":[{"award-number":["62006120","KYCX21-0878"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Graduate Research Practice Innovation Plan of Jiangsu in 2021","award":["62006120","KYCX21-0878"],"award-info":[{"award-number":["62006120","KYCX21-0878"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>In recent years, convolutional neural networks have been in the leading position for ground-based cloud image classification tasks. However, this approach introduces too much inductive bias, fails to perform global modeling, and gradually tends to saturate the performance effect of convolutional neural network models as the amount of data increases. In this paper, we propose a novel method for ground-based cloud image recognition based on the multi-modal Swin Transformer (MMST), which discards the idea of using convolution to extract visual features and mainly consists of an attention mechanism module and linear layers. The Swin Transformer, the visual backbone network of MMST, enables the model to achieve better performance in downstream tasks through pre-trained weights obtained from the large-scale dataset ImageNet and can significantly shorten the transfer learning time. At the same time, the multi-modal information fusion network uses multiple linear layers and a residual structure to thoroughly learn multi-modal features, further improving the model\u2019s performance. MMST is evaluated on the multi-modal ground-based cloud public data set MGCD. Compared with the state-of-art methods, the classification accuracy rate reaches 91.30%, which verifies its validity in ground-based cloud image classification and proves that in ground-based cloud image recognition, models based on the Transformer architecture can also achieve better results.<\/jats:p>","DOI":"10.3390\/s23094222","type":"journal-article","created":{"date-parts":[[2023,4,24]],"date-time":"2023-04-24T03:04:08Z","timestamp":1682305448000},"page":"4222","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["MMST: A Multi-Modal Ground-Based Cloud Image Classification Method"],"prefix":"10.3390","volume":"23","author":[{"given":"Liang","family":"Wei","sequence":"first","affiliation":[{"name":"College of Mechanical and Electronic Engineering, Nanjing Forestry University, Nanjing 210037, China"}]},{"given":"Tingting","family":"Zhu","sequence":"additional","affiliation":[{"name":"College of Mechanical and Electronic Engineering, Nanjing Forestry University, Nanjing 210037, China"}]},{"given":"Yiren","family":"Guo","sequence":"additional","affiliation":[{"name":"College of Mechanical and Electronic Engineering, Nanjing Forestry University, Nanjing 210037, China"}]},{"given":"Chao","family":"Ni","sequence":"additional","affiliation":[{"name":"College of Mechanical and Electronic Engineering, Nanjing Forestry University, Nanjing 210037, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,4,23]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"012020","DOI":"10.1088\/1742-6596\/2035\/1\/012020","article-title":"Cloud Classification of Ground-Based Cloud Images Based on Convolutional Neural Network","volume":"2035","author":"Zhu","year":"2021","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"410","DOI":"10.1175\/2010JTECHA1385.1","article-title":"Cloud Classification Based on Structure Features of Infrared Images","volume":"28","author":"Liu","year":"2011","journal-title":"J. Atmos. Ocean. Technol."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"557","DOI":"10.5194\/amt-3-557-2010","article-title":"Automatic Cloud Classification of Whole Sky Images","volume":"3","author":"Heinle","year":"2010","journal-title":"Atmos. Meas. Tech."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"2667","DOI":"10.1080\/01431161.2018.1530807","article-title":"A Local Binary Pattern Classification Approach for Cloud Types Derived from All-Sky Imagers","volume":"40","author":"Oikonomou","year":"2019","journal-title":"Int. J. Remote Sens."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"789","DOI":"10.1175\/JTECH-D-15-0015.1","article-title":"MCLOUD: A Multiview Visual Feature Extraction Mechanism for Ground-Based Cloud Image Categorization","volume":"33","author":"Xiao","year":"2016","journal-title":"J. Atmos. Ocean. Technol."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"8665","DOI":"10.1029\/2018GL077787","article-title":"CloudNet: Ground-Based Cloud Classification with Deep Convolutional Neural Network","volume":"45","author":"Zhang","year":"2018","journal-title":"Geophys. Res. Lett."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"e2020GL087338","DOI":"10.1029\/2020GL087338","article-title":"Ground-Based Cloud Classification Using Task-Based Graph Convolutional Network","volume":"47","author":"Liu","year":"2020","journal-title":"Geophys. Res. Lett."},{"key":"ref_8","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Proceedings of the Advances in Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_9","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Li, X., Qiu, B., Cao, G., Wu, C., and Zhang, L. (2022). A Novel Method for Ground-Based Cloud Image Classification Using Transformer. Remote Sens., 14.","DOI":"10.3390\/rs14163978"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"4199","DOI":"10.1021\/cr5006292","article-title":"Atmospheric Processes and Their Controlling Influence on Cloud Condensation Nuclei Activity","volume":"115","author":"Farmer","year":"2015","journal-title":"Chem. Rev."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Liu, S., Li, M., Zhang, Z., Xiao, B., and Durrani, T.S. (2020). Multi-Evidence and Multi-Modal Fusion Network for Ground-Based Cloud Recognition. Remote Sens., 12.","DOI":"10.3390\/rs12030464"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., and Dong, L. (2022, January 18\u201324). Swin Transformer V2: Scaling Up Capacity and Resolution. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01170"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 10\u201317). Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zheng, Z., Zhao, Y., Li, A., and Yu, Q. (2022). Wild Terrestrial Animal Re-Identification Based on an Improved Locally Aware Transformer with a Cross-Attention Mechanism. Animals, 12.","DOI":"10.3390\/ani12243503"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Li, A., Zhao, Y., and Zheng, Z. (2022). Novel Recursive BiFPN Combining with Swin Transformer for Wildland Fire Smoke Detection. Forests, 13.","DOI":"10.3390\/f13122032"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). Self-Attention with Relative Position Representations. arXiv.","DOI":"10.18653\/v1\/N18-2074"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Chattopadhay, A., Sarkar, A., Howlader, P., and Balasubramanian, V.N. (2018, January 12\u201315). Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA.","DOI":"10.1109\/WACV.2018.00097"},{"key":"ref_19","unstructured":"Hendrycks, D., and Gimpel, K. (2020). Gaussian Error Linear Units (GELUs). arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Shan, Y., Hoens, T.R., Jiao, J., Wang, H., Yu, D., and Mao, J. (2016, January 13). Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA.","DOI":"10.1145\/2939672.2939704"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv, 1026\u20131034.","DOI":"10.1109\/ICCV.2015.123"},{"key":"ref_22","unstructured":"Glorot, X., and Bengio, Y. (2010, January 31). Understanding the Difficulty of Training Deep Feedforward Neural Networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics; JMLR Workshop and Conference Proceedings, Sardinia, Italy."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"971","DOI":"10.1109\/TPAMI.2002.1017623","article-title":"Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns","volume":"24","author":"Ojala","year":"2002","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1657","DOI":"10.1109\/TIP.2010.2044957","article-title":"A Completed Modeling of Local Binary Pattern Operator for Texture Classification","volume":"19","author":"Guo","year":"2010","journal-title":"IEEE Trans. Image Process."},{"key":"ref_25","unstructured":"Csurka, G., Dance, C.R., Fan, L., Willamowski, J., and Bray, C. (2004, January 10\u221214). Visual Categorization with Bags of Keypoints. Proceedings of the Workshop on Statistical Learning in Computer Vision, ECCV, Prague, Czech Republic."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"2169","DOI":"10.1109\/CVPR.2006.68","article-title":"Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories","volume":"Volume 2","author":"Lazebnik","year":"2006","journal-title":"IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR\u201906)"},{"key":"ref_27","unstructured":"Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition. arXiv.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1186\/s13638-018-1062-0","article-title":"Deep Multimodal Fusion for Ground-Based Cloud Classification in Weather Station Networks","volume":"2018","author":"Liu","year":"2018","journal-title":"J. Wirel. Com. Netw."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"816","DOI":"10.1109\/LGRS.2017.2681658","article-title":"Deep Convolutional Activations-Based Features for Ground-Based Cloud Classification","volume":"14","author":"Shi","year":"2017","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Liu, S., Li, M., Zhang, Z., Xiao, B., and Cao, X. (2018). Multimodal Ground-Based Cloud Classification Using Joint Fusion Convolutional Neural Network. Remote Sens., 10.","DOI":"10.3390\/rs10060822"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"101991","DOI":"10.1016\/j.adhoc.2019.101991","article-title":"Deep Tensor Fusion Network for Multimodal Ground-Based Cloud Classification in Weather Station Networks","volume":"96","author":"Li","year":"2020","journal-title":"Ad. Hoc Netw."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"85688","DOI":"10.1109\/ACCESS.2019.2926092","article-title":"Hierarchical Multimodal Fusion for Ground-Based Cloud Classification in Weather Station Networks","volume":"7","author":"Liu","year":"2019","journal-title":"IEEE Access."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"e4794","DOI":"10.1002\/nbm.4794","article-title":"Impact of Deep Learning Architectures on Accelerated Cardiac T1 Mapping Using MyoMapNet","volume":"35","author":"Amyar","year":"2022","journal-title":"NMR Biomed."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/9\/4222\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:22:00Z","timestamp":1760124120000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/9\/4222"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,23]]},"references-count":34,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2023,5]]}},"alternative-id":["s23094222"],"URL":"https:\/\/doi.org\/10.3390\/s23094222","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2023,4,23]]}}}