{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T15:15:23Z","timestamp":1773155723517,"version":"3.50.1"},"reference-count":50,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2018,12,4]],"date-time":"2018-12-04T00:00:00Z","timestamp":1543881600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key R\\&amp;D Program of China","award":["2016YB1200401"],"award-info":[{"award-number":["2016YB1200401"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Foreground detection, which extracts moving objects from videos, is an important and fundamental problem of video analysis. Classic methods often build background models based on some hand-craft features. Recent deep neural network (DNN) based methods can learn more effective image features by training, but most of them do not use temporal feature or use simple hand-craft temporal features. In this paper, we propose a new dual multi-scale 3D fully-convolutional neural network for foreground detection problems. It uses an encoder\u2013decoder structure to establish a mapping from image sequences to pixel-wise classification results. We also propose a two-stage training procedure, which trains the encoder and decoder separately to improve the training results. With multi-scale architecture, the network can learning deep and hierarchical multi-scale features in both spatial and temporal domains, which is proved to have good invariance for both spatial and temporal scales. We used the CDnet dataset, which is currently the largest foreground detection dataset, to evaluate our method. The experiment results show that the proposed method achieves state-of-the-art results in most test scenes, comparing to current DNN based methods.<\/jats:p>","DOI":"10.3390\/s18124269","type":"journal-article","created":{"date-parts":[[2018,12,4]],"date-time":"2018-12-04T11:56:18Z","timestamp":1543924578000},"page":"4269","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":23,"title":["Foreground Detection with Deeply Learned Multi-Scale Spatial-Temporal Features"],"prefix":"10.3390","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8288-9549","authenticated-orcid":false,"given":"Yao","family":"Wang","sequence":"first","affiliation":[{"name":"School of Mechanical, Electronic and Control Engineering, Beijing Jiaotong University, Beijing 100044, China"},{"name":"Key Laboratory of Vehicle Advanced Manufacturing, Measuring and Control Technology (Beijing Jiaotong University), Ministry of Education, Beijing 100044, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zujun","family":"Yu","sequence":"additional","affiliation":[{"name":"School of Mechanical, Electronic and Control Engineering, Beijing Jiaotong University, Beijing 100044, China"},{"name":"Key Laboratory of Vehicle Advanced Manufacturing, Measuring and Control Technology (Beijing Jiaotong University), Ministry of Education, Beijing 100044, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5436-6660","authenticated-orcid":false,"given":"Liqiang","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Mechanical, Electronic and Control Engineering, Beijing Jiaotong University, Beijing 100044, China"},{"name":"Key Laboratory of Vehicle Advanced Manufacturing, Measuring and Control Technology (Beijing Jiaotong University), Ministry of Education, Beijing 100044, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2018,12,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1773","DOI":"10.1109\/TITS.2013.2266661","article-title":"Looking at Vehicles on the Road: A Survey of Vision-Based Vehicle Detection, Tracking, and Behavior Analysis","volume":"14","author":"Sivaraman","year":"2013","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"527","DOI":"10.1109\/TITS.2011.2174358","article-title":"Adaptive Multicue Background Subtraction for Robust Vehicle Counting and Classification","volume":"13","author":"Unzueta","year":"2012","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1016\/j.cosrev.2014.04.001","article-title":"Traditional and recent approaches in background modeling for foreground detection: An overview","volume":"11\u201312","author":"Bouwmans","year":"2014","journal-title":"Comput. Sci. Rev."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1016\/j.cviu.2013.12.005","article-title":"A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos","volume":"122","author":"Sobral","year":"2014","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Maddalena, L., and Petrosino, A. (2018). Background Subtraction for Moving Object Detection in RGBD Data: A Survey. J. Imaging, 4.","DOI":"10.3390\/jimaging4050071"},{"key":"ref_6","unstructured":"Stauffer, C., and Grimson, W.E.L. (1999, January 23\u201325). Adaptive background mixture models for real-time tracking. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Fort Collins, CO, USA."},{"key":"ref_7","unstructured":"Vernon, D. (2000). Non-Parametric Model for Background Subtraction, Springer. ECCV 2000."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Liao, S., Zhao, G., Kellokumpu, V., Pietikainen, M., and Li, S.Z. (2010, January 13\u201318). Modeling pixel process with scale invariant local patterns for background subtraction in complex scenes. Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA.","DOI":"10.1109\/CVPR.2010.5539817"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1016\/j.cviu.2013.10.015","article-title":"Object detection based on spatiotemporal background models","volume":"122","author":"Yoshinaga","year":"2014","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Moshe, Y., Hel-Or, H., and Hel-Or, Y. (2012, January 16\u201321). Foreground detection using spatiotemporal projection kernels. Proceedings of the Computer 2012 IEEE Conference on Vision and Pattern Recognition (CVPR), Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6248056"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"359","DOI":"10.1109\/TIP.2014.2378053","article-title":"SuBSENSE: A Universal Change Detection Method With Local Adaptive Sensitivity","volume":"24","author":"Bilodeau","year":"2015","journal-title":"IEEE Trans. Image Process."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1778","DOI":"10.1109\/TPAMI.2005.213","article-title":"Bayesian modeling of dynamic scenes for object detection","volume":"27","author":"Sheikh","year":"2005","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_15","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very Deep Convolutional Networks for Large-Scale Image Recognition. Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2016, January 21\u201326). YOLO9000: Better, Faster, Stronger. Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"617","DOI":"10.1109\/LGRS.2018.2797538","article-title":"Multiscale Fully Convolutional Network for Foreground Object Detection in Infrared Videos","volume":"15","author":"Zeng","year":"2018","journal-title":"IEEE Geosci. Sens. Lett."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"635","DOI":"10.1016\/j.patcog.2017.09.040","article-title":"A deep convolutional neural network for video sequence background subtraction","volume":"76","author":"Babaee","year":"2018","journal-title":"Pattern Recognit."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"254","DOI":"10.1109\/TITS.2017.2754099","article-title":"Deep Background Modeling Using Fully Convolutional Network","volume":"19","author":"Yang","year":"2018","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Cinelli, L.P., Thomaz, L.A., Silva, A.F., Silva, E.A.B., and Netto, S.L. (2017, January 3\u20136). Foreground Segmentation for Anomaly Detection in Surveillance Videos Using Deep Residual Networks. Proceedings of the XXXV Simp\u00f3sio Brasileiro De Telecomunica\u00e7\u00f5es E Processamento De Sinais, Sao Pedro, Brazil.","DOI":"10.14209\/sbrt.2017.74"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zhao, X., Chen, Y., Tang, M., and Wang, J. (2017, January 10\u201314). Joint background reconstruction and foreground segmentation via a two-stage convolutional neural network. Proceedings of the IEEE International Conference on Multimedia and Expo, Hong Kong, China.","DOI":"10.1109\/ICME.2017.8019397"},{"key":"ref_24","unstructured":"Chen, Y., Wang, J., Zhu, B., Tang, M., Lu, H., and Member, S. (2017). Pixel-wise Deep Sequence Learning for Moving Object Detection. IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1016\/j.patrec.2016.09.014","article-title":"Interactive deep learning method for segmenting moving objects","volume":"96","author":"Wang","year":"2017","journal-title":"Pattern Recognit. Lett."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Braham, M., and Droogenbroeck, M.V. (2016, January 23\u201325). Deep Background Subtraction with Scene-Specific Convolutional Neural Networks. Proceedings of the 23rd International Conference on System, Signals and Image Processing, Bratislava, Slovakia.","DOI":"10.1109\/IWSSIP.2016.7502717"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning Spatiotemporal Features with 3D Convolutional Networks. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1709","DOI":"10.1109\/TIP.2010.2101613","article-title":"ViBe: A Universal Background Subtraction Algorithm for Video Sequences","volume":"20","author":"Barnich","year":"2011","journal-title":"IEEE Trans. Image Process."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"74","DOI":"10.1016\/j.cviu.2013.12.003","article-title":"A texton-based kernel density estimation approach for background modeling under extreme conditions","volume":"122","author":"Spampinato","year":"2014","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1109\/TGRS.1990.572934","article-title":"Texture Unit, Texture Spectrum, And Texture Analysis","volume":"28","author":"He","year":"1990","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"831","DOI":"10.1109\/34.868684","article-title":"A Bayesian computer vision system for modeling human interactions","volume":"22","author":"Oliver","year":"2000","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Monnet, A., Mittal, A., Paragios, N., and Ramesh, V. (2003, January 13\u201316). Background modeling and subtraction of dynamic scenes. Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France.","DOI":"10.1109\/ICCV.2003.1238641"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1168","DOI":"10.1109\/TIP.2008.924285","article-title":"A Self-Organizing Approach to Background Subtraction for Visual Surveillance Applications","volume":"17","author":"Maddalena","year":"2008","journal-title":"IEEE Trans. Image Process."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Maddalena, L., and Petrosino, A. (2012, January 16\u201321). The SOBS algorithm: What are the limits?. Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA.","DOI":"10.1109\/CVPRW.2012.6238922"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1016\/j.patcog.2014.09.009","article-title":"Self-adaptive SOM-CNN neural system for dynamic object detection in normal and complex scenarios","volume":"48","year":"2015","journal-title":"Pattern Recognit."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Maddalena, L., and Petrosino, A. (2018). Self-organizing background subtraction using color and depth data. Multimed. Tools Appl.","DOI":"10.1007\/s11042-018-6741-7"},{"key":"ref_37","unstructured":"Chacon, M., Ramirez, G., and Gonzalez-Duarte, S. (2013, January 4\u20139). Improvement of a neural-fuzzy motion detection vision model for complex scenario conditions. Proceedings of the International Joint Conference on Neural Networks, Dallas, TX, USA."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"1614","DOI":"10.1109\/TNN.2007.896861","article-title":"Neural Network Approach to Background Modeling for Video Object Segmentation","volume":"18","author":"Culibrk","year":"2007","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Zeng, D., Zhu, M., and Kuijper, A. (arXiv, 2018). Combining Background Subtraction Algorithms with Convolutional Neural Network, arXiv.","DOI":"10.1117\/1.JEI.28.1.013011"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Sultana, M., Mahmood, A., Javed, S., and Jung, S.K. (arXiv, 2018). Unsupervised Deep Context Prediction for Background Foreground Separation, arXiv.","DOI":"10.1007\/s00138-018-0993-0"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Bakkay, M.C., Rashwan, H.A., Salmane, H., Khoudour, L., Puigtt, D., and Ruichek, Y. (2018, January 7\u201310). BSCGAN: Deep Background Subtraction with Conditional Generative Adversarial Networks. Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece.","DOI":"10.1109\/ICIP.2018.8451603"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"23023","DOI":"10.1007\/s11042-017-5460-9","article-title":"End-to-end video background subtraction with 3d convolutional neural networks","volume":"77","author":"Sakkos","year":"2018","journal-title":"Multimed. Tools Appl."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"43450","DOI":"10.1109\/ACCESS.2018.2861223","article-title":"A 3D Atrous Convolutional Long Short-Term Memory Network for Background Subtraction","volume":"6","author":"Hu","year":"2018","journal-title":"IEEE Access"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-Term Memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","article-title":"DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs","volume":"40","author":"Chen","year":"2018","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Li, F.-F. (2014, January 23). Large-Scale Video Classification with Convolutional Neural Networks. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.223"},{"key":"ref_47","unstructured":"Kingma, D.P., and Ba, J.L. (2015, January 7\u20139). Adam: A Method for Stochastic Optimization. Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Wang, Y., Jodoin, P.M., Porikli, F., Konrad, J., Benezeth, Y., and Ishwar, P. (2014, January 23\u201328). CDnet 2014: An Expanded Change Detection Benchmark Dataset. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA.","DOI":"10.1109\/CVPRW.2014.126"},{"key":"ref_49","unstructured":"Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., and Devin, M. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, Google Research. Technical Report."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"St-Charles, P.L., Bilodeau, G.A., and Bergevin, R. (2015, January 5\u20139). A Self\u2013Adjusting Approach to Change Detection Based on Background Word Consensus. Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV.2015.137"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/18\/12\/4269\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:31:09Z","timestamp":1760196669000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/18\/12\/4269"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,12,4]]},"references-count":50,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2018,12]]}},"alternative-id":["s18124269"],"URL":"https:\/\/doi.org\/10.3390\/s18124269","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,12,4]]}}}