{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,14]],"date-time":"2025-10-14T07:07:13Z","timestamp":1760425633214,"version":"build-2065373602"},"reference-count":42,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2018,9,1]],"date-time":"2018-09-01T00:00:00Z","timestamp":1535760000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>Nowadays, video surveillance has become ubiquitous with the quick development of artificial intelligence. Multi-object detection (MOD) is a key step in video surveillance and has been widely studied for a long time. The majority of existing MOD algorithms follow the \u201cdivide and conquer\u201d pipeline and utilize popular machine learning techniques to optimize algorithm parameters. However, this pipeline is usually suboptimal since it decomposes the MOD task into several sub-tasks and does not optimize them jointly. In addition, the frequently used supervised learning methods rely on the labeled data which are scarce and expensive to obtain. Thus, we propose an end-to-end Unsupervised Multi-Object Detection framework for video surveillance, where a neural model learns to detect objects from each video frame by minimizing the image reconstruction error. Moreover, we propose a Memory-Based Recurrent Attention Network to ease detection and training. The proposed model was evaluated on both synthetic and real datasets, exhibiting its potential.<\/jats:p>","DOI":"10.3390\/sym10090375","type":"journal-article","created":{"date-parts":[[2018,9,3]],"date-time":"2018-09-03T10:50:51Z","timestamp":1535971851000},"page":"375","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["Unsupervised Multi-Object Detection for Video Surveillance Using Memory-Based Recurrent Attention Networks"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1658-0399","authenticated-orcid":false,"given":"Zhen","family":"He","sequence":"first","affiliation":[{"name":"College of Intelligence Science, National University of Defense Technology, Changsha 410073, China"},{"name":"Department of Computer Science, University College London, London WC1E 6BT, UK"}]},{"given":"Hangen","family":"He","sequence":"additional","affiliation":[{"name":"College of Intelligence Science, National University of Defense Technology, Changsha 410073, China"}]}],"member":"1968","published-online":{"date-parts":[[2018,9,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1627","DOI":"10.1109\/TPAMI.2009.167","article-title":"Object detection with discriminatively trained part-based models","volume":"32","author":"Felzenszwalb","year":"2010","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1007\/BF00994018","article-title":"Support-vector networks","volume":"20","author":"Cortes","year":"1995","journal-title":"Mach. Learn."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015, January 7\u201313). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_5","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. (2015, January 7\u201312). Faster r-cnn: Towards real-time object detection with region proposal networks. Proceedings of the Advances in neural information processing systems, Montreal, DC, Canada."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., and Girshick, R. (2017, January 22\u201329). Mask r-cnn. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"2179","DOI":"10.1109\/TPAMI.2008.260","article-title":"Monocular pedestrian detection: Survey and experiments","volume":"31","author":"Enzweiler","year":"2009","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Li, F.-F. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","article-title":"The pascal visual object classes (voc) challenge","volume":"88","author":"Everingham","year":"2010","journal-title":"Int. J. Comput. Vis."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1231","DOI":"10.1177\/0278364913491297","article-title":"Vision meets robotics: The KITTI dataset","volume":"32","author":"Geiger","year":"2013","journal-title":"Int. J. Robot. Res."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014). Microsoft coco: Common objects in context. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"541","DOI":"10.1162\/neco.1989.1.4.541","article-title":"Backpropagation applied to handwritten zip code recognition","volume":"1","author":"LeCun","year":"1989","journal-title":"Neural Comput."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_14","unstructured":"Huang, L., Yang, Y., Deng, Y., and Yu, Y. (arXiv, 2015). Densebox: Unifying landmark localization with end to end object detection, arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016). Ssd: Single shot multibox detector. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Ristani, E., Solera, F., Zou, R., Cucchiara, R., and Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-48881-3_2"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"533","DOI":"10.1038\/323533a0","article-title":"Learning representations by back-propagating errors","volume":"323","author":"Rumelhart","year":"1986","journal-title":"Nature"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"2451","DOI":"10.1162\/089976600300015015","article-title":"Learning to forget: Continual prediction with LSTM","volume":"12","author":"Gers","year":"2000","journal-title":"Neural Comput."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Cho, K., Van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (arXiv, 2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_21","unstructured":"Jang, E., Gu, S., and Poole, B. (arXiv, 2016). Categorical Reparameterization with Gumbel-Softmax, arXiv."},{"key":"ref_22","unstructured":"Jaderberg, M., Simonyan, K., and Zisserman, A. (2015, January 7\u201312). Spatial transformer networks. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_24","unstructured":"Graves, A., Wayne, G., and Danihelka, I. (arXiv, 2014). Neural turing machines, arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"471","DOI":"10.1038\/nature20101","article-title":"Hybrid computing using a neural network with dynamic external memory","volume":"538","author":"Graves","year":"2016","journal-title":"Nature"},{"key":"ref_26","unstructured":"Eslami, S.A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., and Hinton, G.E. (2016, January 5\u201310). Attend, infer, repeat: Fast scene understanding with generative models. Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_27","unstructured":"Kingma, D., and Ba, J. (arXiv, 2014). Adam: A method for stochastic optimization, arXiv."},{"key":"ref_28","unstructured":"Stiefelhagen, R., Bernardin, K., Bowers, R., Garofolo, J., Mostefa, D., and Soundararajan, P. (2006). The CLEAR 2006 evaluation. International Evaluation Workshop on Classification of Events, Activities and Relationships, Springer."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Bloisi, D., and Iocchi, L. (2012). Independent multimodal background subtraction. CompIMAGE, CRC Press.","DOI":"10.1201\/b12753-8"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yang, B., Yan, J., Lei, Z., and Li, S.Z. (2016, January 27\u201330). Craft objects from images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.650"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Ren, J., Chen, X., Liu, J., Sun, W., Pang, J., Yan, Q., Tai, Y.-W., and Xu, L. (2017, January 21\u201326). Accurate single stage detector using recurrent rolling convolution. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.87"},{"key":"ref_32","unstructured":"Kulkarni, T.D., Whitney, W.F., Kohli, P., and Tenenbaum, J. (2015, January 7\u201312). Deep convolutional inverse graphics network. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_33","unstructured":"Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016, January 5\u201310). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_34","unstructured":"Rolfe, J.T. (arXiv, 2016). Discrete variational autoencoders, arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"593","DOI":"10.1162\/NECO_a_00086","article-title":"Learning a generative model of images by factoring appearance and shape","volume":"23","author":"Heess","year":"2011","journal-title":"Neural Comput."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Moreno, P., Williams, C.K., Nash, C., and Kohli, P. (2016, January 5\u201310). Overcoming occlusion with inverse graphics. Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain.","DOI":"10.1007\/978-3-319-49409-8_16"},{"key":"ref_37","unstructured":"Huang, J., and Murphy, K. (arXiv, 2015). Efficient inference in occlusion-aware generative models of images, arXiv."},{"key":"ref_38","unstructured":"Yan, X., Yang, J., Yumer, E., Guo, Y., and Lee, H. (2016, January 5\u201310). Perspective transformer nets: Learning single-view 3d object reconstruction without 3D supervision. Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_39","unstructured":"Rezende, D.J., Eslami, S.A., Mohamed, S., Battaglia, P., Jaderberg, M., and Heess, N. (2016, January 5\u201310). Unsupervised learning of 3D structure from images. Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Stewart, R., and Ermon, S. (2017, January 4\u20139). Label-free supervision of neural networks with physics and domain knowledge. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.","DOI":"10.1609\/aaai.v31i1.10934"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Wu, J., Tenenbaum, J.B., and Kohli, P. (2017, January 21\u201326). Neural scene de-rendering. Proceedings of the Computer Vision Foundation, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.744"},{"key":"ref_42","unstructured":"Graves, A. (arXiv, 2016). Adaptive computation time for recurrent neural networks, arXiv."}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/10\/9\/375\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:18:18Z","timestamp":1760195898000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/10\/9\/375"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,9,1]]},"references-count":42,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2018,9]]}},"alternative-id":["sym10090375"],"URL":"https:\/\/doi.org\/10.3390\/sym10090375","relation":{},"ISSN":["2073-8994"],"issn-type":[{"type":"electronic","value":"2073-8994"}],"subject":[],"published":{"date-parts":[[2018,9,1]]}}}