{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T09:03:44Z","timestamp":1780477424387,"version":"3.54.1"},"reference-count":32,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2022,5,2]],"date-time":"2022-05-02T00:00:00Z","timestamp":1651449600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Provincial Key Research and Development Plan","award":["BE2016032"],"award-info":[{"award-number":["BE2016032"]}]},{"name":"Provincial Key Research and Development Plan","award":["BE2010019"],"award-info":[{"award-number":["BE2010019"]}]},{"name":"Major Scientific and Technological Support and Independent Innovation Project","award":["BE2016032"],"award-info":[{"award-number":["BE2016032"]}]},{"name":"Major Scientific and Technological Support and Independent Innovation Project","award":["BE2010019"],"award-info":[{"award-number":["BE2010019"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>With the development of artificial intelligence technology and the popularity of intelligent production projects, intelligent inspection systems have gradually become a hot topic in the industrial field. As a fundamental problem in the field of computer vision, how to achieve object detection in the industry while taking into account the accuracy and real-time detection is an important challenge in the development of intelligent detection systems. The detection of defects on steel surfaces is an important application of object detection in the industry. Correct and fast detection of surface defects can greatly improve productivity and product quality. To this end, this paper introduces the MSFT-YOLO model, which is improved based on the one-stage detector. The MSFT-YOLO model is proposed for the industrial scenario in which the image background interference is great, the defect category is easily confused, the defect scale changes a great deal, and the detection results of small defects are poor. By adding the TRANS module, which is designed based on Transformer, to the backbone and detection headers, the features can be combined with global information. The fusion of features at different scales by combining multi-scale feature fusion structures enhances the dynamic adjustment of the detector to objects at different scales. To further improve the performance of MSFT-YOLO, we also introduce plenty of effective strategies, such as data augmentation and multi-step training methods. The test results on the NEU-DET dataset show that MSPF-YOLO can achieve real-time detection, and the average detection accuracy of MSFT-YOLO is 75.2, improving about 7% compared to the baseline model (YOLOv5) and 18% compared to Faster R-CNN, which is advantageous and inspiring.<\/jats:p>","DOI":"10.3390\/s22093467","type":"journal-article","created":{"date-parts":[[2022,5,3]],"date-time":"2022-05-03T08:26:35Z","timestamp":1651566395000},"page":"3467","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":275,"title":["MSFT-YOLO: Improved YOLOv5 Based on Transformer for Detecting Defects of Steel Surface"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4406-1900","authenticated-orcid":false,"given":"Zexuan","family":"Guo","sequence":"first","affiliation":[{"name":"School of Modern Post, Beijing University of Posts and Telecommunications, Beijing 100876, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6858-3159","authenticated-orcid":false,"given":"Chensheng","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Guang","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zeyuan","family":"Huang","sequence":"additional","affiliation":[{"name":"Teaching Affairs Office, Beijing University of Posts and Telecommunications, Beijing 100876, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Guo","family":"Li","sequence":"additional","affiliation":[{"name":"School of Modern Post, Beijing University of Posts and Telecommunications, Beijing 100876, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2022,5,2]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Kim, S., Kim, W., Noh, Y.K., and Park, F.C. (2017, January 14\u201319). Transfer learning for automated optical inspection. Proceedings of the International Joint Conference on Neural Networks, Anchorage, AK, USA.","DOI":"10.1109\/IJCNN.2017.7966162"},{"key":"ref_2","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G. (2012, January 3\u20136). ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. arXiv.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2016). YOLO9000: Better, Faster, Stronger. arXiv.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_5","unstructured":"Redmon, J., and Farhadi, A. (2018). YOLOv3: An Incremental Improvement. arXiv."},{"key":"ref_6","unstructured":"Bochkovskiy, A., Wang, C.Y., and Liao, H. (2020). YOLOv4: Optimal speed and accuracy of object detection. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"142","DOI":"10.1109\/TPAMI.2015.2437384","article-title":"Region-Based Convolutional Networks for Accurate object Detection and Segmentation","volume":"38","author":"Girshick","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015). Fast R-CNN. arXiv.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_9","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. (2015, January 7\u201312). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_10","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_13","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., and Houlsby, N. (2020). An Image is Worth 16 \u00d7 16 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_14","unstructured":"Naseer, M., Ranasinghe, K., Khan, S.H., Hayat, M., Shahbaz Khan, F., and Yang, M.H. (2012, January 3\u20136). Intriguing Properties of Vision Transformers. Proceedings of the NIPS 2012: Neural Information Processing Systems Conference, Lake Tahoe, NV, USA."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"1585","DOI":"10.1111\/mice.12686","article-title":"Autonomous detection of damage to multiple steel surfaces from 360\u00b0 panoramas using deep neural networks","volume":"36","author":"Luo","year":"2021","journal-title":"Comput.-Aided Civ. Infrastruct. Eng."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Zhang, H., Wang, Y., Dayoub, F., and Sunderhauf, N. (2021, January 20\u201325). VarifocalNet: An IoU-aware Dense Object Detector. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00841"},{"key":"ref_17","unstructured":"Zhou, X., Koltun, V., and Krhenb\u00fchl, P. (2021). Probabilistic two-stage detection. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Munawar, H.S., Hammad, A.W., Haddad, A., Soares, C.A.P., and Waller, S.T. (2021). Image-based crack detection methods: A review. Infrastructures, 6.","DOI":"10.3390\/infrastructures6080115"},{"key":"ref_19","unstructured":"Ge, Z., Liu, S., Wang, F., Li, Z., and Sun, J. (2021). YOLOX: Exceeding YOLO Series in 2021. arXiv."},{"key":"ref_20","first-page":"1922","article-title":"FCOS: A simple and strong anchor-free object detector","volume":"44","author":"Tian","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_21","unstructured":"Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2010). Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Wang, C.Y., Bochkovskiy, A., and Liao, H. (2020). Scaled-YOLOv4: Scaling Cross Stage Partial Network. arXiv.","DOI":"10.1109\/CVPR46437.2021.01283"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017). Feature Pyramid Networks for Object Detection, IEEE Computer Society.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018, January 28\u201323). Path Aggregation Network for Instance Segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00913"},{"key":"ref_25","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Ghiasi, G., Lin, T.Y., and Le, Q.V. (2019, January 15\u201320). NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00720"},{"key":"ref_27","unstructured":"Liu, S., Huang, D., and Wang, Y. (2019). Learning Spatial Fusion for Single-Shot Object Detection. arXiv."},{"key":"ref_28","first-page":"12051","article-title":"Parallel Residual Bi-Fusion Feature Pyramid Network for Accurate Single-Shot Object Detection","volume":"1911","author":"Hsieh","year":"2019","journal-title":"IEEE Trans. Image Processing"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Kong, T., Sun, F., Huang, W., Tan, C., and Liu, H. (2018, January 8\u201314). Deep Feature Pyramid Reconfiguration for Object Detection. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01228-1_11"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., and Ling, H. (2018, January 3\u20134). M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network. Proceedings of the AAAI conference on Artificial Intelligence, Virtual.","DOI":"10.1609\/aaai.v33i01.33019259"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Zhang, H., Cisse, M., Dauphin, Y.N., and Lopez-Paz, D. (2017). Mixup: Beyond Empirical Risk Minimization. arXiv.","DOI":"10.1007\/978-1-4899-7687-1_79"},{"key":"ref_32","first-page":"19365","article-title":"Self-adaptive training: Beyond empirical risk minimization","volume":"33","author":"Huang","year":"2020","journal-title":"Adv. Neural Inf. Processing Syst."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/9\/3467\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:05:27Z","timestamp":1760137527000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/9\/3467"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,2]]},"references-count":32,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2022,5]]}},"alternative-id":["s22093467"],"URL":"https:\/\/doi.org\/10.3390\/s22093467","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,2]]}}}