{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,30]],"date-time":"2026-01-30T04:38:12Z","timestamp":1769747892840,"version":"3.49.0"},"reference-count":38,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2024,9,24]],"date-time":"2024-09-24T00:00:00Z","timestamp":1727136000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Comput. Neurosci."],"abstract":"<jats:sec><jats:title>Introduction<\/jats:title><jats:p>With the great success of Transformers in the field of machine learning, it is also gradually attracting widespread interest in the field of remote sensing (RS). However, the research in the field of remote sensing has been hampered by the lack of large labeled data sets and the inconsistency of data modes caused by the diversity of RS platforms. With the rise of self-supervised learning (SSL) algorithms in recent years, RS researchers began to pay attention to the application of \u201cpre-training and fine-tuning\u201d paradigm in RS. However, there are few researches on multi-modal data fusion in remote sensing field. Most of them choose to use only one of the modal data or simply splice multiple modal data roughly.<\/jats:p><\/jats:sec><jats:sec><jats:title>Method<\/jats:title><jats:p>In order to study a more efficient multi-modal data fusion scheme, we propose a multi-modal fusion mechanism based on gated unit control (MGSViT). In this paper, we pretrain the ViT model based on BigEarthNet dataset by combining two commonly used SSL algorithms, and propose an intra-modal and inter-modal gated fusion unit for feature learning by combining multispectral (MS) and synthetic aperture radar (SAR). Our method can effectively combine different modal data to extract key feature information.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results and discussion<\/jats:title><jats:p>After fine-tuning and comparison experiments, we outperform the most advanced algorithms in all downstream classification tasks. The validity of our proposed method is verified.<\/jats:p><\/jats:sec>","DOI":"10.3389\/fncom.2024.1404623","type":"journal-article","created":{"date-parts":[[2024,9,24]],"date-time":"2024-09-24T00:28:17Z","timestamp":1727137697000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Multi-label remote sensing classification with self-supervised gated multi-modal transformers"],"prefix":"10.3389","volume":"18","author":[{"given":"Na","family":"Liu","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ye","family":"Yuan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guodong","family":"Wu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sai","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jie","family":"Leng","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lihong","family":"Wan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1965","published-online":{"date-parts":[[2024,9,24]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2110.02095","article-title":"Exploring the limits of large scale pre-training","author":"Abnar","year":"2021","journal-title":"arXiv"},{"key":"B2","doi-asserted-by":"publisher","first-page":"22300","DOI":"10.48550\/arXiv.2209.06640","article-title":"Revisiting neural scaling laws in language and vision","volume":"35","author":"Alabdulmohsin","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B3","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1702.01992","article-title":"Gated multimodal units for information fusion","author":"Arevalo","year":"2017","journal-title":"arXiv"},{"key":"B4","first-page":"213","article-title":"\u201cEnd-to-end object detection with transformers,\u201d","volume-title":"European conference on computer vision","author":"Carion","year":"2020"},{"key":"B5","first-page":"9650","article-title":"\u201cEmerging properties in self-supervised vision transformers,\u201d","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Caron","year":"2021"},{"key":"B6","first-page":"9640","article-title":"\u201cAn empirical study of training self-supervised vision transformers,\u201d","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Chen","year":"2021"},{"key":"B7","doi-asserted-by":"publisher","first-page":"197","DOI":"10.48550\/arXiv.2207.08051","article-title":"Satmae: pre-training transformers for temporal and multi-spectral satellite imagery","volume":"35","author":"Cong","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B8","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2010.11929","article-title":"An image is worth 16x16 words: transformers for image recognition at scale","author":"Dosovitskiy","year":"2020","journal-title":"arXiv"},{"key":"B9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/LGRS.2022.3201489","article-title":"Satvit: pretraining transformers for earth observation","volume":"19","author":"Fuller","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett"},{"key":"B10","doi-asserted-by":"publisher","DOI":"10.4855\/arXiv.2209.14969","article-title":"Transfer learning with pretrained remote sensing transformers","author":"Green","year":"2022","journal-title":"arXiv"},{"key":"B11","doi-asserted-by":"publisher","first-page":"21271","DOI":"10.48550\/arXiv.2006.07733","article-title":"Bootstrap your own latent-a new approach to self-supervised learning","volume":"33","author":"Grill","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B12","first-page":"16000","article-title":"\u201cMasked autoencoders are scalable vision learners,\u201d","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"He","year":"2022"},{"key":"B13","doi-asserted-by":"crossref","first-page":"3241","DOI":"10.1109\/IGARSS47720.2021.9553741","article-title":"\u201cMulti-modal self-supervised representation learning for earth observation,\u201d","volume-title":"2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS","author":"Jain","year":"2021"},{"key":"B14","doi-asserted-by":"publisher","first-page":"7797","DOI":"10.1109\/JSTARS.2022.3204888","article-title":"Self-supervised learning for invariant representations from multi-spectral and sar images","volume":"15","author":"Jain","year":"2022","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens"},{"key":"B15","doi-asserted-by":"publisher","first-page":"2598","DOI":"10.1109\/TGRS.2020.3007029","article-title":"Deep unsupervised embedding for remotely sensed images based on spatially augmented momentum contrast","volume":"59","author":"Kang","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens"},{"key":"B16","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2001.08361","article-title":"Scaling laws for neural language models","author":"Kaplan","year":"2020","journal-title":"arXiv"},{"key":"B17","doi-asserted-by":"crossref","first-page":"1516","DOI":"10.1109\/ICCT46805.2019.8947072","article-title":"\u201cA survey of recent advances in transfer learning,\u201d","volume-title":"2019 IEEE 19th International Conference on Communication Technology (ICCT)","author":"Liang","year":"2019"},{"key":"B18","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48","article-title":"\u201cMicrosoft coco: common objects in context,\u201d","author":"Lin","year":"2014","journal-title":"ECCV. European Conference on Computer Vision"},{"key":"B19","first-page":"12009","article-title":"\u201cSwin transformer v2: scaling up capacity and resolution,\u201d","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Liu","year":"2022"},{"key":"B20","doi-asserted-by":"publisher","first-page":"6730","DOI":"10.1109\/IGARSS39084.2020.9324501","article-title":"\u201cTraining general representations for remote sensing using in-domain knowledge,\u201d","author":"Neumann","year":"2020","journal-title":"IGARSS"},{"key":"B21","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1807.03748","author":"Oord","year":"2018","journal-title":"Representation learning with contrastive predictive coding"},{"key":"B22","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2208.06366","article-title":"Beit v2: masked image modeling with vector-quantized visual tokenizers","author":"Peng","year":"2022","journal-title":"arXiv"},{"key":"B23","doi-asserted-by":"publisher","first-page":"86","DOI":"10.3390\/rs12010086","article-title":"Convolutional neural network for remote-sensing scene classification: transfer learning analysis","volume":"12","author":"Pires de Lima","year":"2019","journal-title":"Remote Sens"},{"key":"B24","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"Imagenet large scale visual recognition challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis"},{"key":"B25","first-page":"1422","article-title":"\u201cSelf-supervised vision transformers for land-cover segmentation and classification,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Scheibenreif","year":"2022"},{"key":"B26","doi-asserted-by":"publisher","first-page":"1285","DOI":"10.1109\/TMI.2016.2528162","article-title":"Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning","volume":"35","author":"Shin","year":"2016","journal-title":"IEEE Trans. Med. Imaging"},{"key":"B27","first-page":"1182","article-title":"\u201cSelf-supervised learning of remote sensing scene representations using contrastive multiview coding,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Stojnic","year":"2021"},{"key":"B28","doi-asserted-by":"crossref","first-page":"5901","DOI":"10.1109\/IGARSS.2019.8900532","article-title":"\u201cBigearthnet: a large-scale benchmark archive for remote sensing image understanding,\u201d","volume-title":"IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium","author":"Sumbul","year":"2019"},{"key":"B29","doi-asserted-by":"publisher","first-page":"174","DOI":"10.1109\/MGRS.2021.3089174","article-title":"Bigearthnet-mm: a large-scale, multimodal, multilabel benchmark archive for remote sensing image classification and retrieval [software and data sets]","volume":"9","author":"Sumbul","year":"2021","journal-title":"IEEE Geosci. Remote Sens. Mag"},{"key":"B30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/LGRS.2020.3038420","article-title":"Remote sensing image scene classification with self-supervised paradigm under limited labeled samples","volume":"19","author":"Tao","year":"2020","journal-title":"IEEE Geosci. Remote Sens. Lett"},{"key":"B31","doi-asserted-by":"publisher","author":"Tay","year":"2022","DOI":"10.48550\/arXiv.2207.10551"},{"key":"B32","doi-asserted-by":"crossref","first-page":"3034","DOI":"10.1109\/ICPR48806.2021.9413112","article-title":"\u201cThe color out of space: learning self-supervised representations for earth observation imagery,\u201d","volume-title":"2020 25th International Conference on Pattern Recognition (ICPR)","author":"Vincenzi","year":"2021"},{"key":"B33","doi-asserted-by":"publisher","first-page":"5608020","DOI":"10.1109\/TGRS.2022.3176603","article-title":"An empirical study of remote sensing pretraining","volume":"61","author":"Wang","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens"},{"key":"B34","first-page":"139","article-title":"\u201cSelf-supervised vision transformers for joint sar-optical representation learning,\u201d","volume-title":"IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium","author":"Wang","year":""},{"key":"B35","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2211.07044","article-title":"Ssl4eo-s12: a large-scale multi-modal, multi-temporal dataset for self-supervised learning in earth observation","author":"Wang","year":"","journal-title":"arXiv"},{"key":"B36","doi-asserted-by":"publisher","first-page":"38571","DOI":"10.48550\/arXiv.2204.12484","article-title":"Vitpose: simple vision transformer baselines for human pose estimation","volume":"35","author":"Xu","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B37","doi-asserted-by":"publisher","first-page":"474","DOI":"10.1109\/JSTARS.2020.3036602","article-title":"Self-supervised pretraining of transformers for satellite image time series classification","volume":"14","author":"Yuan","year":"2021","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens"},{"key":"B38","first-page":"12104","article-title":"\u201cScaling vision transformers,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhai","year":"2022"}],"updated-by":[{"DOI":"10.3389\/fncom.2025.1665406","type":"corrigendum","label":"Corrigendum","source":"publisher","updated":{"date-parts":[[2025,8,18]],"date-time":"2025-08-18T00:00:00Z","timestamp":1755475200000}}],"container-title":["Frontiers in Computational Neuroscience"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fncom.2024.1404623\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,18]],"date-time":"2025-08-18T10:43:04Z","timestamp":1755513784000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fncom.2024.1404623\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,24]]},"references-count":38,"alternative-id":["10.3389\/fncom.2024.1404623"],"URL":"https:\/\/doi.org\/10.3389\/fncom.2024.1404623","relation":{"corrigendum":[{"id-type":"doi","id":"10.3389\/fncom.2025.1665406","asserted-by":"object"}]},"ISSN":["1662-5188"],"issn-type":[{"value":"1662-5188","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,24]]},"article-number":"1404623"}}