{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T03:24:18Z","timestamp":1762917858027,"version":"build-2065373602"},"reference-count":44,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2018,3,4]],"date-time":"2018-03-04T00:00:00Z","timestamp":1520121600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"111 Project of China","award":["B14010"],"award-info":[{"award-number":["B14010"]}]},{"name":"Chang Jiang Scholars Programme","award":["T2012122"],"award-info":[{"award-number":["T2012122"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>With the development of deep neural networks, many object detection frameworks have shown great success in the fields of smart surveillance, self-driving cars, and facial recognition. However, the data sources are usually videos, and the object detection frameworks are mostly established on still images and only use the spatial information, which means that the feature consistency cannot be ensured because the training procedure loses temporal information. To address these problems, we propose a single, fully-convolutional neural network-based object detection framework that involves temporal information by using Siamese networks. In the training procedure, first, the prediction network combines the multiscale feature map to handle objects of various sizes. Second, we introduce a correlation loss by using the Siamese network, which provides neighboring frame features. This correlation loss represents object co-occurrences across time to aid the consistent feature generation. Since the correlation loss should use the information of the track ID and detection label, our video object detection network has been evaluated on the large-scale ImageNet VID dataset where it achieves a 69.5% mean average precision (mAP).<\/jats:p>","DOI":"10.3390\/s18030774","type":"journal-article","created":{"date-parts":[[2018,3,6]],"date-time":"2018-03-06T07:37:25Z","timestamp":1520321845000},"page":"774","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":19,"title":["Deep Spatial-Temporal Joint Feature Representation for Video Object Detection"],"prefix":"10.3390","volume":"18","author":[{"given":"Baojun","family":"Zhao","sequence":"first","affiliation":[{"name":"School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China"},{"name":"Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5620-406X","authenticated-orcid":false,"given":"Boya","family":"Zhao","sequence":"additional","affiliation":[{"name":"School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China"},{"name":"Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Linbo","family":"Tang","sequence":"additional","affiliation":[{"name":"School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China"},{"name":"Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuqi","family":"Han","sequence":"additional","affiliation":[{"name":"School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China"},{"name":"Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenzheng","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China"},{"name":"Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2018,3,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"541","DOI":"10.1162\/neco.1989.1.4.541","article-title":"Backpropagation applied to handwritten zip code recognition","volume":"1","author":"LeCun","year":"1989","journal-title":"Neural Comput."},{"key":"ref_2","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"2012","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_3","unstructured":"Simonyan, K., and Zisserman, A. (arXiv, 2014). Very deep convolutional networks for large-scale image recognition, arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015, January 7\u201313). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Los Alamitos, CA, USA.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster r-cnn: Towards real-time object detection with region proposal networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_7","first-page":"379","article-title":"R-fcn: Object detection via region-based fully convolutional networks","volume":"2016","author":"Dai","year":"2016","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zhong, J., Lei, T., and Yao, G. (2017). Robust Vehicle Detection in Aerial Images Based on Cascaded Convolutional Neural Networks. Sensors, 17.","DOI":"10.3390\/s17122720"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A.C. (2016). Ssd: Single shot multibox detector. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Oh, S.I., and Kang, H.B. (2017). Object Detection and Classification by Decision-Level Fusion for Intelligent Vehicle Systems. Sensors, 17.","DOI":"10.3390\/s17010207"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zhu, X., Xiong, Y., Dai, J., Yuan, L., and Wei, Y. (arXiv, 2016). Deep feature flow for video recognition, arXiv.","DOI":"10.1109\/CVPR.2017.441"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., Wang, R., and Wang, X. (2017). T-CNN: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Systems Video Technol.","DOI":"10.1109\/TCSVT.2017.2736553"},{"key":"ref_14","unstructured":"Han, W., Khorrami, P., Paine, T.L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., and Huang, T.S. (arXiv, 2016). Seq-nms for video object detection, arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Kang, K., Ouyang, W., Li, H., and Wang, X. (2016, January 27\u201330). Object detection from video tubelets with convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR.2016.95"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Lee, B., Erdenee, E., Jin, S., Nam, M.Y., Jung, Y.G., and Rhee, P.K. (2016). Multi-class multi-object tracking using changing point detection. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-48881-3_6"},{"key":"ref_17","unstructured":"Chopra, S., Hadsell, R., and LeCun, Y. (2005, January 20\u201325). Learning a similarity metric discriminatively, with application to face verification. Proceedings of the IEEE CVPR Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA."},{"key":"ref_18","first-page":"207","article-title":"Distance metric learning for large margin nearest neighbor classification","volume":"10","author":"Weinberger","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Bodla, N., Singh, B., Chellappa, R., and Davis, L.S. (2017, January 22\u201329). Soft-NMS\u2014Improving Object Detection with One Line of Code. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.593"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008, January 23\u201328). A discriminatively trained, multiscale, deformable part model. Proceedings of the IEEE CVPR Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA.","DOI":"10.1109\/CVPR.2008.4587597"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"154","DOI":"10.1007\/s11263-013-0620-5","article-title":"Selective search for object recognition","volume":"104","author":"Uijlings","year":"2013","journal-title":"Int. J. Comput. Vis."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"Imagenet large scale visual recognition challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Gkioxari, G., and Malik, J. (2015, January 7\u201312). Finding action tubes. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298676"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Peng, X., and Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46493-0_45"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Hou, R., Chen, C., and Shah, M. (2017, January 22\u201329). Tube convolutional neural network (T-CNN) for action detection in videos. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.620"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Li, C., Stevens, A., Chen, C., Pu, Y., Gan, Z., and Carin, L. (2016, January 27\u201330). Learning Weight Uncertainty with Stochastic Gradient MCMC for Shape Classification. Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR.2016.611"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Luciano, L., and Hamza, A.B. (2017). Deep learning with geodesic moments for 3D shape classification. Pattern Recognit. Lett.","DOI":"10.1016\/j.patrec.2017.05.011"},{"key":"ref_29","unstructured":"Nair, V., and Hinton, G.E. (2010, January 21\u201324). Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Erhan, D., Szegedy, C., Toshev, A., and Anguelov, D. (2014, January 23\u201328). Scalable object detection using deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.276"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Hosang, J., Benenson, R., and Schiele, B. (arXiv, 2014). How good are detection proposals, really?, arXiv.","DOI":"10.5244\/C.28.24"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Shrivastava, A., Gupta, A., and Girshick, R. (2016, January 27\u201330). Training region-based object detectors with online hard example mining. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR.2016.89"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1214\/aoms\/1177703732","article-title":"Robust Estimation of a Location Parameter","volume":"35","author":"Huber","year":"1964","journal-title":"Ann. Math. Stat."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1109\/TPAMI.2014.2345390","article-title":"High-speed tracking with kernelized correlation filters","volume":"37","author":"Henriques","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"26877","DOI":"10.3390\/s151026877","article-title":"Visual Tracking Based on Extreme Learning Machine and Sparse Representation","volume":"15","author":"Baoxian","year":"2015","journal-title":"Sensors"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., and Torr, P.H. (2016). Fully-convolutional siamese networks for object tracking. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-48881-3_56"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Zhao, Z., Han, Y., Xu, T., Li, X., Song, H., and Luo, J. (2017). A Reliable and Real-Time Tracking Method with Color Distribution. Sensors, 17.","DOI":"10.3390\/s17102303"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Zhu, X., Wang, Y., Dai, J., Yuan, L., and Wei, Y. (arXiv, 2017). Flow-Guided Feature Aggregation for Video Object Detection, arXiv.","DOI":"10.1109\/ICCV.2017.52"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Kang, K., Li, H., Xiao, T., Ouyang, W., Yan, J., Liu, X., and Wang, X. (2017, January 21\u201326). Object detection in videos with tubelet proposal networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, HI, USA.","DOI":"10.1109\/CVPR.2017.101"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Kwak, S., Cho, M., Laptev, I., and Ponce, J. (2015, January 7\u201313). Unsupervised Object Discovery and Tracking in Video Collections. Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA.","DOI":"10.1109\/ICCV.2015.363"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Tripathi, S., Lipton, Z., Belongie, S., and Nguyen, T. (2016, January 19\u201322). Context Matters: Refining Object Detection in Video with Recurrent Neural Networks. Proceedings of the British Machine Vision Conference, York, UK.","DOI":"10.5244\/C.30.44"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Lu, Y., Lu, C., and Tang, C.K. (2017, January 22\u201329). Online Video Object Detection Using Association LSTM. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.257"},{"key":"ref_43","unstructured":"Glorot, X., and Bengio, Y. (2010, January 23\u201324). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sanya, China."},{"key":"ref_44","unstructured":"Ferrari, V., Schmid, C., Civera, J., Leistner, C., and Prest, A. (2012, January 16\u201321). Learning object class detectors from weakly annotated video. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/18\/3\/774\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T14:57:28Z","timestamp":1760194648000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/18\/3\/774"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,3,4]]},"references-count":44,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2018,3]]}},"alternative-id":["s18030774"],"URL":"https:\/\/doi.org\/10.3390\/s18030774","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2018,3,4]]}}}