{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T14:23:22Z","timestamp":1762957402749,"version":"3.37.3"},"reference-count":48,"publisher":"Springer Science and Business Media LLC","issue":"8","license":[{"start":{"date-parts":[[2024,2,26]],"date-time":"2024-02-26T00:00:00Z","timestamp":1708905600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,26]],"date-time":"2024-02-26T00:00:00Z","timestamp":1708905600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Modeling in computer vision has long been dominated by convolutional neural networks (CNNs). Recently, in light of the excellent performance of self-attention mechanism in the language field, transformers tailored for visual data have drawn significant attention and triumphed over CNNs in various vision tasks. These vision transformers heavily rely on large-scale pre-training to achieve competitive accuracy, which not only hinders the freedom of architectural design in downstream tasks like object detection, but also causes learning bias and domain mismatch in the fine-tuning stages. To this end, we aim to get rid of the \u201cpre-train and fine-tune\u201d paradigm of vision transformer and train transformer based object detector from scratch. Some earlier works in the CNNs era have successfully trained CNNs based detectors without pre-training, unfortunately, their findings do not generalize well when the backbone is switched from CNNs to a vision transformer. Instead of proposing a specific vision transformer based detector, in this work, our goal is to reveal the insights of training vision transformer based detectors from scratch. In particular, we expect those insights to help other researchers and practitioners, and inspire more interesting research in other fields, such as remote sensing, visual-linguistic pre-training, etc. One of the key findings is that both architectural changes and more epochs play critical roles in training vision transformer based detectors from scratch. Experiments on the MS COCO dataset demonstrate that vision transformer based detectors trained from scratch can also achieve similar performance to their counterparts with ImageNet pre-training.<\/jats:p>","DOI":"10.1007\/s11263-024-01988-x","type":"journal-article","created":{"date-parts":[[2024,2,26]],"date-time":"2024-02-26T16:02:11Z","timestamp":1708963331000},"page":"2929-2942","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Training Object Detectors from Scratch: An Empirical Study in the Era of Vision Transformer"],"prefix":"10.1007","volume":"132","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3794-3972","authenticated-orcid":false,"given":"Weixiang","family":"Hong","sequence":"first","affiliation":[]},{"given":"Wang","family":"Ren","sequence":"additional","affiliation":[]},{"given":"Jiangwei","family":"Lao","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7731-9341","authenticated-orcid":false,"given":"Lele","family":"Xie","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8161-9168","authenticated-orcid":false,"given":"Liheng","family":"Zhong","sequence":"additional","affiliation":[]},{"given":"Jian","family":"Wang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1872-2592","authenticated-orcid":false,"given":"Jingdong","family":"Chen","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2880-4698","authenticated-orcid":false,"given":"Honghai","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Wei","family":"Chu","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,2,26]]},"reference":[{"key":"1988_CR1","doi-asserted-by":"crossref","unstructured":"Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR.2018.00644"},{"key":"1988_CR2","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In The European conference on computer vision (ECCV).","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"1988_CR3","unstructured":"Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., & Lin, D. (2019). Mmdetection: Open MMLab detection toolbox and benchmark. arXiv:1906.07155."},{"key":"1988_CR4","unstructured":"Chen, Y., Zhang, Z., Cao, Y., Wang, L., Lin, S., & Hu, H. (2020). Reppoints v2: Verification meets regression for object detection. In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR5","unstructured":"Cheng, B., Schwing, A.G., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR6","doi-asserted-by":"crossref","unstructured":"Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In The IEEE international conference on computer vision (ICCV).","DOI":"10.1109\/ICCV.2017.89"},{"key":"1988_CR7","doi-asserted-by":"crossref","unstructured":"d\u2019Ascoli, S., Touvron, H., Leavitt, M., Morcos, A., Biroli, G., & Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In International conference on machine learning (ICML).","DOI":"10.1088\/1742-5468\/ac9830"},{"key":"1988_CR8","unstructured":"Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL HLT."},{"key":"1988_CR9","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations (ICLR)."},{"key":"1988_CR10","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015). Fast R-CNN. In The IEEE international conference on computer vision (ICCV).","DOI":"10.1109\/ICCV.2015.169"},{"key":"1988_CR11","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR.2014.81"},{"key":"1988_CR12","doi-asserted-by":"crossref","unstructured":"Guo, J., Han, K., Wu. H., Tang, Y., Chen, X., Wang, Y., Xu, C. (2022). Cmt: Convolutional neural networks meet vision transformers. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR52688.2022.01186"},{"key":"1988_CR13","doi-asserted-by":"crossref","unstructured":"Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR.2016.309"},{"key":"1988_CR14","unstructured":"Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR15","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In The IEEE International Conference on Computer Vision (ICCV).","DOI":"10.1109\/ICCV.2015.123"},{"key":"1988_CR16","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR.2016.90"},{"key":"1988_CR17","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). Mask R-CNN. In The IEEE international conference on computer vision (ICCV).","DOI":"10.1109\/ICCV.2017.322"},{"key":"1988_CR18","doi-asserted-by":"crossref","unstructured":"He, K., Girshick, R., & Dollar, P. (2019). Rethinking imagenet pre-training. In The IEEE international conference on computer vision (ICCV).","DOI":"10.1109\/ICCV.2019.00502"},{"key":"1988_CR19","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., van\u00a0der Maaten, L., & Weinberger, K.Q. (2017). Densely connected convolutional networks. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR.2017.243"},{"key":"1988_CR20","unstructured":"Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR21","doi-asserted-by":"crossref","unstructured":"Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In The European conference on computer vision (ECCV).","DOI":"10.1007\/978-3-030-01264-9_45"},{"key":"1988_CR22","doi-asserted-by":"crossref","unstructured":"Li, Y., Wu, C.Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Mvitv2: Improved multiscale vision transformers for classification and detection. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR52688.2022.00476"},{"key":"1988_CR23","doi-asserted-by":"crossref","unstructured":"Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2018). Detnet: A backbone network for object detection. In The European conference on computer vision (ECCV).","DOI":"10.1007\/978-3-030-01240-3_21"},{"key":"1988_CR24","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In The European conference on computer vision (ECCV).","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"1988_CR25","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., & Berg, A.C. (2016). SSD: Single shot multibox detector. In The European conference on computer vision (ECCV).","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"1988_CR26","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In The IEEE international conference on computer vision (ICCV).","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"1988_CR27","unstructured":"Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In international conference on learning representations (ICLR)."},{"key":"1988_CR28","unstructured":"Matan, O., Burges, C.J.C., LeCun, Y., & Denker, J. (1992). Multi-digit recognition using a space displacement neural network. In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR29","first-page":"1345","volume":"22","author":"SJ Pan","year":"2010","unstructured":"Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE TKDE, 22, 1345\u20131359.","journal-title":"IEEE TKDE"},{"key":"1988_CR30","doi-asserted-by":"crossref","unstructured":"Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., & Ye, Q. (2021). Conformer: Local features coupling global representations for visual recognition. In The IEEE international conference on computer vision (ICCV).","DOI":"10.1109\/ICCV48922.2021.00042"},{"key":"1988_CR31","unstructured":"Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., & Dosovitskiy, A. (2021). Do vision transformers see like convolutional neural networks? In: Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW (Eds.) Advances in neural information processing systems (NIPS)."},{"key":"1988_CR32","unstructured":"Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR33","unstructured":"Rowley, H., Baluja, S., & Kanade, T. (1996). Human face detection in visual scenes. In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR34","doi-asserted-by":"crossref","unstructured":"Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV).","DOI":"10.1007\/s11263-015-0816-y"},{"key":"1988_CR35","doi-asserted-by":"crossref","unstructured":"Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., & Xue, X. (2017). DSOD: Learning deeply supervised object detectors from scratch. In The IEEE international conference on computer vision (ICCV).","DOI":"10.1109\/ICCV.2017.212"},{"key":"1988_CR36","doi-asserted-by":"crossref","unstructured":"Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., & Vaswani, A. (2021). Bottleneck transformers for visual recognition. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR46437.2021.01625"},{"key":"1988_CR37","doi-asserted-by":"crossref","unstructured":"Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., Wang, C., & Luo, P. (2021). Sparse R-CNN: end-to-end object detection with learnable proposals. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR46437.2021.01422"},{"key":"1988_CR38","unstructured":"Szegedy, C., Toshev, A., & Erhan, D. (2013). Deep neural networks for object detection. In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR39","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR.2016.308"},{"key":"1988_CR40","doi-asserted-by":"crossref","unstructured":"Tian, Z., Shen, C., Chen, H., & He, T. (2019). Fcos: Fully convolutional one-stage object detection. In The IEEE international conference on computer vision (ICCV).","DOI":"10.1109\/ICCV.2019.00972"},{"key":"1988_CR41","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jegou, H. (2021). Training data-efficient image transformers & distillation through attention. In International conference on machine learning (ICML)."},{"key":"1988_CR42","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Lu., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR43","doi-asserted-by":"crossref","unstructured":"Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In The IEEE international conference on computer vision (ICCV).","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"1988_CR44","doi-asserted-by":"crossref","unstructured":"Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In The IEEE international conference on computer vision (ICCV).","DOI":"10.1109\/ICCV48922.2021.00009"},{"key":"1988_CR45","unstructured":"Xiao, T., Singh, M., Mintun, E., Darrell, T., Doll\u00e1r, P., & Girshick, R. (2021). Early convolutions help transformers see better. In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR46","doi-asserted-by":"crossref","unstructured":"Xu, R., Luo, F., Zhang, Z., Tan, C., Chang, B., Huang, S., & Huang, F. (2021). Raise a child in large language model: Towards effective and generalizable fine-tuning. In EMNLP.","DOI":"10.18653\/v1\/2021.emnlp-main.749"},{"key":"1988_CR47","unstructured":"Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems (NIPS)."},{"key":"1988_CR48","doi-asserted-by":"crossref","unstructured":"Zhang, S., Chi, C., Yao, Y., Lei, Z., & Li, S.Z. (2020). Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In The IEEE conference on computer vision and pattern recognition (CVPR).","DOI":"10.1109\/CVPR42600.2020.00978"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-024-01988-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-024-01988-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-024-01988-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,11]],"date-time":"2024-07-11T14:16:32Z","timestamp":1720707392000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-024-01988-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,26]]},"references-count":48,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["1988"],"URL":"https:\/\/doi.org\/10.1007\/s11263-024-01988-x","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"type":"print","value":"0920-5691"},{"type":"electronic","value":"1573-1405"}],"subject":[],"published":{"date-parts":[[2024,2,26]]},"assertion":[{"value":"6 May 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 January 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 February 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}