{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T04:10:14Z","timestamp":1773547814030,"version":"3.50.1"},"reference-count":62,"publisher":"MDPI AG","issue":"17","license":[{"start":{"date-parts":[[2022,8,25]],"date-time":"2022-08-25T00:00:00Z","timestamp":1661385600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"State Key Laboratory of Rail Traffic Control &amp; Safety","award":["RCS2021ZT003"],"award-info":[{"award-number":["RCS2021ZT003"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The state monitoring of the railway track line is one of the important tasks to ensure the safety of the railway transportation system. While the defect recognition result, that is, the inspection report, is the main basis for the maintenance decision. Most previous attempts have proposed intelligent detection methods to achieve rapid and accurate inspection of the safety state of the railway track line. However, there are few investigations on the automatic generation of inspection reports. Fortunately, inspired by the recent advances and successes in dense captioning, such technologies can be investigated and used to generate textual information on the type, position, status, and interrelationship of the key components from the field images. To this end, based on the work of DenseCap, a railway track line image captioning model (RTLCap for short) is proposed, which replaces VGG16 with ResNet-50-FPN as the backbone of the model to extract more powerful image features. In addition, towards the problems of object occlusion and category imbalance in the field images, Soft-NMS and Focal Loss are applied in RTLCap to promote defect description performance. After that, to improve the image processing speed of RTLCap and reduce the complexity of the model, a reconstructed RTLCap model named Faster RTLCap is presented with the help of YOLOv3. In the encoder part, a multi-level regional feature localization, mapping, and fusion module (MFLMF) are proposed to extract regional features, and an SPP (Spatial Pyramid Pooling) layer is employed after MFLMF to reduce model parameters. As for the decoder part, a stacked LSTM is adopted as the language model for better language representation learning. Both quantitative and qualitative experimental results demonstrate the effectiveness of the proposed methods.<\/jats:p>","DOI":"10.3390\/s22176419","type":"journal-article","created":{"date-parts":[[2022,8,30]],"date-time":"2022-08-30T01:37:55Z","timestamp":1661823475000},"page":"6419","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":14,"title":["Automatic Defect Description of Railway Track Line Image Based on Dense Captioning"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0373-3598","authenticated-orcid":false,"given":"Dehua","family":"Wei","sequence":"first","affiliation":[{"name":"School of Traffic and Transportation, Beijing Jiaotong University, Beijing 100044, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0341-966X","authenticated-orcid":false,"given":"Xiukun","family":"Wei","sequence":"additional","affiliation":[{"name":"State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100044, China"}]},{"given":"Limin","family":"Jia","sequence":"additional","affiliation":[{"name":"State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100044, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,8,25]]},"reference":[{"key":"ref_1","first-page":"760","article-title":"Rail component detection, optimization, and assessment for automatic rail track inspection","volume":"15","author":"Li","year":"2013","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_2","first-page":"41","article-title":"Overall comments on track technology of high-speed railway","volume":"1","author":"Zuwen","year":"2007","journal-title":"J. Railw. Eng. Soc."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Johnson, J., Karpathy, A., and Li, F.-F. (2016, January 27\u201330). Densecap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.494"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Yang, L., Tang, K., Yang, J., and Li, L.J. (2017, January 21\u201326). Dense captioning with joint inference and visual context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.214"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Wang, T.J.J., Tavakoli, H.R., Sj\u00f6berg, M., and Laaksonen, J. (2019, January 25). Geometry-aware relational exemplar attention for dense captioning. Proceedings of the 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications, Nice, France.","DOI":"10.1145\/3347450.3357656"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Yin, G., Sheng, L., Liu, B., Yu, N., Wang, X., and Shao, J. (2019, January 15\u201320). Context and attribute grounded dense captioning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00640"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Zhang, Y., Shi, Y., Yu, W., Nie, L., He, G., Fan, Y., and Yang, Z. (2019). Dense Image Captioning Based on Precise Feature Extraction. International Conference on Neural Information Processing, Springer.","DOI":"10.1007\/978-3-030-36802-9_10"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1016\/j.neucom.2019.09.055","article-title":"Cross-scale fusion detection with global attribute for dense captioning","volume":"373","author":"Zhao","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3295748","article-title":"A comprehensive survey of deep learning for image captioning","volume":"51","author":"Hossain","year":"2019","journal-title":"ACM Comput. Surv. (CsUR)"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Doll\u00e1r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_12","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Bodla, N., Singh, B., Chellappa, R., and Davis, L.S. (2017, January 22\u201329). Soft-NMS\u2013improving object detection with one line of code. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.593"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_15","unstructured":"Redmon, J., and Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1904","DOI":"10.1109\/TPAMI.2015.2389824","article-title":"Spatial pyramid pooling in deep convolutional networks for visual recognition","volume":"37","author":"He","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"418","DOI":"10.1109\/TSMCC.2007.893278","article-title":"A real-time visual inspection system for railway maintenance: Automatic hexagonal-headed bolts detection","volume":"37","author":"Marino","year":"2007","journal-title":"IEEE Trans. Syst. Man Cybern. Part C Appl. Rev."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"De Ruvo, P., Distante, A., Stella, E., and Marino, F. (2009, January 7\u201310). A GPU-based vision system for real time detection of fastening elements in railway inspection. Proceedings of the 2009 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt.","DOI":"10.1109\/ICIP.2009.5414438"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Gibert, X., Patel, V.M., and Chellappa, R. (2015, January 5\u20139). Robust fastener detection for autonomous visual railway track inspection. Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV.2015.98"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"153","DOI":"10.1109\/TITS.2016.2568758","article-title":"Deep multitask learning for railway track inspection","volume":"18","author":"Gibert","year":"2016","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1016\/j.engappai.2019.01.008","article-title":"Railway track fastener defect detection based on image processing and deep learning techniques: A comparative study","volume":"80","author":"Wei","year":"2019","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Li, X., and Chen, H. (2019, January 12\u201314). Railway fastener defect detection based on deep convolutional networks. Proceedings of the Eleventh International Conference on Graphics and Image Processing (ICGIP 2019), Hangzhou, China.","DOI":"10.1117\/12.2557231"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Qi, H., Xu, T., Wang, G., Cheng, Y., and Chen, C. (2020). MYOLOv3-Tiny: A new convolutional neural network architecture for real-time detection of track fasteners. Comput. Ind., 123.","DOI":"10.1016\/j.compind.2020.103303"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Bai, T., Yang, J., Xu, G., and Yao, D. (2021). An optimized railway fastener detection method based on modified Faster R-CNN. Measurement, 182.","DOI":"10.1016\/j.measurement.2021.109742"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Faghih-Roohi, S., Hajizadeh, S., N\u00fa\u00f1ez, A., Babuska, R., and De Schutter, B. (2016, January 24\u201329). Deep convolutional neural networks for detection of rail surface defects. Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada.","DOI":"10.1109\/IJCNN.2016.7727522"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Liang, Z., Zhang, H., Liu, L., He, Z., and Zheng, K. (2018, January 4\u20138). Defect Detection of Rail Surface with Deep Convolutional Neural Networks. Proceedings of the 2018 13th World Congress on Intelligent Control and Automation (WCICA), Changsha, China.","DOI":"10.1109\/WCICA.2018.8630525"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"James, A., Jie, W., Xulei, Y., Chenghao, Y., Ngan, N.B., Yuxin, L., Yi, S., Chandrasekhar, V., and Zeng, Z. (2018, January 12\u201314). TrackNet-A Deep Learning Based Fault Detection for Railway Track Inspection. Proceedings of the 2018 International Conference on Intelligent Rail Transportation (ICIRT), Singapore.","DOI":"10.1109\/ICIRT.2018.8641608"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Shang, L., Yang, Q., Wang, J., Li, S., and Lei, W. (2018, January 11\u201314). Detection of rail surface defects based on CNN image recognition and classification. Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Korea.","DOI":"10.23919\/ICACT.2018.8323642"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1049\/iet-est.2020.0041","article-title":"Research on deep learning method for rail surface defect detection","volume":"10","author":"Feng","year":"2020","journal-title":"IET Electr. Syst. Transp."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"61973","DOI":"10.1109\/ACCESS.2020.2984264","article-title":"Multi-target defect identification for railway track line based on image processing and improved YOLOv3 model","volume":"8","author":"Wei","year":"2020","journal-title":"IEEE Access"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"21798","DOI":"10.1109\/ACCESS.2021.3055512","article-title":"A Deep Extractor for Visual Rail Surface Inspection","volume":"9","author":"Zhang","year":"2021","journal-title":"IEEE Access"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"1694","DOI":"10.1109\/TII.2021.3085848","article-title":"Attention Network for Rail Surface Defect Detection via CASIoU-Guided Center-Point Estimation","volume":"18","author":"Ni","year":"2021","journal-title":"IEEE Trans. Ind. Inform."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"362","DOI":"10.1111\/mice.12625","article-title":"Automatic railroad track components inspection using real-time instance segmentation","volume":"36","author":"Guo","year":"2021","journal-title":"Comput.-Aided Civ. Infrastruct. Eng."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1111\/mice.12710","article-title":"Hybrid deep learning architecture for rail surface segmentation and surface defect detection","volume":"37","author":"Wu","year":"2022","journal-title":"Comput.-Aided Civ. Infrastruct. Eng."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Bai, T., Gao, J., Yang, J., and Yao, D. (2021). A study on railway surface defects detection based on machine vision. Entropy, 23.","DOI":"10.3390\/e23111437"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"39","author":"Ren","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_37","unstructured":"Karpathy, A., Joulin, A., and Li, F.-F. (2014). Deep fragment embeddings for bidirectional image sentence mapping. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Nickolls, J. (2007, January 19\u201321). GPU parallel computing architecture and CUDA programming model. Proceedings of the 2007 IEEE Hot Chips 19 Symposium (HCS), Stanford, CA, USA.","DOI":"10.1109\/HOTCHIPS.2007.7482491"},{"key":"ref_40","unstructured":"Kingma, D., and Ba, J. (2014). Adam: A Method for Stochastic Optimization. Comput. Sci."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Geng, M., Wang, Y., Xiang, T., and Tian, Y. (2016). Deep transfer learning for person re-identification. arXiv.","DOI":"10.1109\/CVPR.2016.146"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., and Shamma, D.A. (2016). Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv.","DOI":"10.1007\/s11263-016-0981-7"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Bang, S., and Kim, H. (2020). Context-based information generation for managing UAV-acquired data using image captioning. Autom. Constr., 112.","DOI":"10.1016\/j.autcon.2020.103116"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Dutta, A., and Zisserman, A. (2019, January 21\u201325). The VIA annotation software for images, audio and video. Proceedings of the 27th ACM International Conference on Multimedia, Nice, France.","DOI":"10.1145\/3343031.3350535"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., and Parikh, D. (2015, January 7\u201313). Vqa: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.279"},{"key":"ref_46","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Ann Arbor, MI, USA."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"261","DOI":"10.1007\/s11263-019-01247-4","article-title":"Deep learning for generic object detection: A survey","volume":"128","author":"Liu","year":"2020","journal-title":"Int. J. Comput. Vis."},{"key":"ref_48","unstructured":"Zou, Z., Shi, Z., Guo, Y., and Ye, J. (2019). Object detection in 20 years: A survey. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015, January 7\u201313). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., and Girshick, R. (2017, January 22\u201329). Mask r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016). Ssd: Single shot multibox detector. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_53","unstructured":"Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., and Berg, A.C. (2017). Dssd: Deconvolutional single shot detector. arXiv."},{"key":"ref_54","unstructured":"Li, Z., and Zhou, F. (2017). FSSD: Feature fusion single shot multibox detector. arXiv."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2017, January 21\u201326). YOLO9000: Better, faster, stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_57","unstructured":"Bochkovskiy, A., Wang, C.Y., and Liao, H.Y.M. (2020). Yolov4: Optimal speed and accuracy of object detection. arXiv."},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"2222","DOI":"10.1109\/TNNLS.2016.2582924","article-title":"LSTM: A search space odyssey","volume":"28","author":"Greff","year":"2016","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Wang, C., Yang, H., Bartz, C., and Meinel, C. (2016, January 15\u201319). Image captioning with deep bidirectional LSTMs. Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherland.","DOI":"10.1145\/2964284.2964299"},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Yu, L., Qu, J., Gao, F., and Tian, Y. (2019). A novel hierarchical algorithm for bearing fault diagnosis based on stacked LSTM. Shock Vib., 2019.","DOI":"10.1155\/2019\/2756284"},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014). Microsoft coco: Common objects in context. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-10602-1_48"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/17\/6419\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:15:31Z","timestamp":1760141731000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/17\/6419"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,25]]},"references-count":62,"journal-issue":{"issue":"17","published-online":{"date-parts":[[2022,9]]}},"alternative-id":["s22176419"],"URL":"https:\/\/doi.org\/10.3390\/s22176419","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,8,25]]}}}