{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T14:15:49Z","timestamp":1777472149398,"version":"3.51.4"},"reference-count":52,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2018,2,20]],"date-time":"2018-02-20T00:00:00Z","timestamp":1519084800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The paper presents an emerging issue of fine-grained pedestrian action recognition that induces an advanced pre-crush safety to estimate a pedestrian intention in advance. The fine-grained pedestrian actions include visually slight differences (e.g., walking straight and crossing), which are difficult to distinguish from each other. It is believed that the fine-grained action recognition induces a pedestrian intention estimation for a helpful advanced driver-assistance systems (ADAS). The following difficulties have been studied to achieve a fine-grained and accurate pedestrian action recognition: (i) In order to analyze the fine-grained motion of a pedestrian appearance in the vehicle-mounted drive recorder, a method to describe subtle change of motion characteristics occurring in a short time is necessary; (ii) even when the background moves greatly due to the driving of the vehicle, it is necessary to detect changes in subtle motion of the pedestrian; (iii) the collection of large-scale fine-grained actions is very difficult, and therefore a relatively small database should be focused. We find out how to learn an effective recognition model with only a small-scale database. Here, we have thoroughly evaluated several types of configurations to explore an effective approach in fine-grained pedestrian action recognition without a large-scale database. Moreover, two different datasets have been collected in order to raise the issue. Finally, our proposal attained 91.01% on National Traffic Science and Environment Laboratory database (NTSEL) and 53.23% on the near-miss driving recorder database (NDRDB). The paper has improved +8.28% and +6.53% from baseline two-stream fusion convnets.<\/jats:p>","DOI":"10.3390\/s18020627","type":"journal-article","created":{"date-parts":[[2018,2,20]],"date-time":"2018-02-20T11:56:13Z","timestamp":1519127773000},"page":"627","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":25,"title":["Temporal and Fine-Grained Pedestrian Action Recognition on Driving Recorder Database"],"prefix":"10.3390","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8844-165X","authenticated-orcid":false,"given":"Hirokatsu","family":"Kataoka","sequence":"first","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yutaka","family":"Satoh","sequence":"additional","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yoshimitsu","family":"Aoki","sequence":"additional","affiliation":[{"name":"Department of Electronics and Electrical Engineering, Keio University, Yokohama 223-8522, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shoko","family":"Oikawa","sequence":"additional","affiliation":[{"name":"Tokyo Metropolitan University, Tokyo 192-0364, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1221-4252","authenticated-orcid":false,"given":"Yasuhiro","family":"Matsui","sequence":"additional","affiliation":[{"name":"National Traffic Safety and Environment Laboratory, Tokyo 182-0012, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2018,2,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1239","DOI":"10.1109\/TPAMI.2009.122","article-title":"Survey of Pedestrian Detection for Advanced Driver Assistance Systems","volume":"32","author":"Geronimo","year":"2010","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Benenson, R., Omran, M., Hosang, J., and Schiele, B. (2014, January 6\u201312). Ten years of pedestrian detection, what have we learned?. Proceedings of the European Conference on Computer Vision Workshop (ECCVW), Zurich, Switzerland.","DOI":"10.1007\/978-3-319-16181-5_47"},{"key":"ref_3","unstructured":"Dalal, N., and Triggs, B. (2005, January 20\u201325). Histograms of Oriented Gradients for Human Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA."},{"key":"ref_4","unstructured":"Viola, P., and Jones, M. (2001, January 8\u201314). Rapid Object Detection using a Boosted Cascaded of Simple Features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Kauai, HI, USA."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1627","DOI":"10.1109\/TPAMI.2009.167","article-title":"Object Detection with Discriminatively Trained Part-Based Models","volume":"32","author":"Felzenszwalb","year":"2010","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_6","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very Deep Convolutional Networks for Large-Scale Image Recognition. Proceedings of the International Conference on Learning Representation (ICLR), San Diego, CA, USA."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_8","unstructured":"Zhang, S., Benenson, R., Omran, M., Hosang, J., and Schiele, B. (July, January 26). How Far are We from Solving Pedestrian Detection?. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zhang, L., Lin, L., Liang, X., and He, K. (2016, January 11\u201314). Is Faster R-CNN Doing Well for Pedestrian Detection?. Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46475-6_28"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Watanabe, T., Ito, S., and Yokoi, K. (2009, January 13\u201316). Co-occurrence Histograms of Oriented Gradients for Pedestrian Detection. Proceedings of the 3rd Pacific-Rim Symposium on Image and Video Technology (PSIVT), Tokyo, Japan.","DOI":"10.1007\/978-3-540-92957-4_4"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Kataoka, H., Tamura, K., Iwata, K., Satoh, Y., Matsui, Y., and Aoki, Y. (2014). Extended Feature Descriptor and Vehicle Motion Model with Tracking-by-detection for Pedestrian Active Safety. IEICE Trans. Inf. Syst., 296\u2013304.","DOI":"10.1587\/transinf.E97.D.296"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Dollar, P., Tu, Z., Perona, P., and Belongie, S. (2009, January 7\u201310). Integral Channel Features. Proceedings of the British Machine Vision Conference (BMVC), London, UK.","DOI":"10.5244\/C.23.91"},{"key":"ref_13","unstructured":"Csurka, G., Dance, C.R., Fan, L., Willamowski, J., and Bray, C. (2004, January 11\u201314). Visual Categorization with Bags of Keypoints. Proceedings of the Workshop on Statistical Learning in Computer Vision (ECCVW), Prague, Czech Republic."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Jegou, H., Douze, M., Schmid, C., and Perez, P. (2010, January 13\u201318). Aggregating local descriptors into a compact image representation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA.","DOI":"10.1109\/CVPR.2010.5540039"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Perronnin, F., Sanchez, J., and Mensink, T. (2010, January 5\u201311). Improving the fisher kernel for large-scale image classification. Proceedings of the European Conference on Computer Vision (ECCV), Heraklion, Greece.","DOI":"10.1007\/978-3-642-15561-1_11"},{"key":"ref_16","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20136). ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., and Reed, S. (2015, January 7\u201312). Going Deeper with Convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015, January 7\u201313). Fast R-CNN. Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_20","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. (2015, January 7\u201312). Faster R-CNN: Towards real-time object detection with region proposal networks. Proceedings of the Neural Information Processing Systems (NIPS), Montreal, QC, Canada."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016, January 11\u201314). SSD: Single Shot MultiBox Detector. Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2017, January 21\u201326). YOLO9000: Better, Faster, Stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Du, X., El-Khamy, M., Lee, J., and Davis, L.S. (2017, January 24\u201331). Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection. Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA.","DOI":"10.1109\/WACV.2017.111"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wang, X., Xiao, T., Jiang, Y., Shao, S., Sun, J., and Shen, C. (arXiv, 2017). Repulsion Loss: Detecting Pedestrians in a Crowd, arXiv.","DOI":"10.1109\/CVPR.2018.00811"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Dalal, N., Triggs, B., and Schmid, C. (2006, January 7\u201313). Human Detection using Oriented Histograms of Flow and Appearance. Proceedings of the European Conference on Computer Vision (ECCV), Graz, Austria.","DOI":"10.1007\/11744047_33"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Gonz\u00e1lez, A., V\u00e1zquez, D., Ramos, S., L\u00f3pez, A., and Amores, J. (2015, January 17\u201319). Spatiotemporal Stacked Sequential Learning for Pedestrian Detection. Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, Santiago de Compostela, Spain.","DOI":"10.1007\/978-3-319-19390-8_1"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1007\/s11263-005-1838-7","article-title":"On Space-Time Interest Points","volume":"64","author":"Laptev","year":"2005","journal-title":"Int. J. Comput. Vis."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1007\/s11263-012-0594-8","article-title":"Dense Trajectories and Motion Boundary Descriptors for Action Recognition","volume":"103","author":"Wang","year":"2013","journal-title":"Int. J. Comput. Vis."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Wang, H., and Schmid, C. (2013, January 1\u20138). Action Recognition with Improved Trajectories. Proceedings of the International Conference on Computer Vision (ICCV), Sydney, Australia.","DOI":"10.1109\/ICCV.2013.441"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning Spatiotemporal Features with 3D Convolutional Networks. Proceedings of the International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_32","unstructured":"Simonyan, K., and Zisserman, A. (2014, January 8\u201313). Two-stream convolutional networks for action recognition. Proceedings of the Neural Information Processing Systems (NIPS), Montreal, QC, Canada."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Wang, L., Qiao, Y., and Tang, X. (2015, January 7\u201312). Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299059"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"743","DOI":"10.1109\/TPAMI.2011.155","article-title":"Pedestrian Detection: An Evaluation of the State of the Art","volume":"34","author":"Dollar","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Geiger, A., Lenz, P., and Urtasun, R. (2012, January 16\u201321). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Dollar, P., Wojek, C., Schiele, B., and Perona, P. (2009, January 20\u201325). Pedestrian Detection: A Benchmark. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206631"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Menze, M., and Geiger, A. (2015, January 7\u201312). Object Scene Flow for Autonomous Vehicles. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298925"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully Convolutional Networks for Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., and Urtasun, R. (2016, January 27\u201330). Monocular 3D Object Detection for Autonomous Driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.236"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Bai, M., Luo, W., Kundu, K., and Urtasun, R. (2016, January 8\u201316). Exploiting Semantic Information and Deep Matching for Optical Flow. Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46466-4_10"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Luo, W., Schwing, A., and Urtasun, R. (2016, January 27\u201330). Efficient Deep Learning for Stereo Matching. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.614"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Kundu, A., Vineet, V., and Koltun, V. (2016, January 27\u201330). Feature Space Optimization for Semantic Video Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.345"},{"key":"ref_43","unstructured":"(2018, February 09). First Workshop on Fine-Grained Visual Categorization. Available online: https:\/\/sites.google.com\/site\/cvprfgvc\/."},{"key":"ref_44","unstructured":"(2018, February 09). Hiyari-Hatto Database. Available online: http:\/\/web.tuat.ac.jp\/~smrc\/drcenter.html."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016, January 27\u201330). Convolutional Two-Stream Network Fusion for Video Action Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.213"},{"key":"ref_46","unstructured":"Donahue, J., Jia, Y., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. (2014, January 21\u201326). DeCAF: A deep convolutional activation feature for generic visual recognition. Proceedings of the International Conference on Machine Learning (ICML), Beijing, China."},{"key":"ref_47","unstructured":"Wang, L., Xiong, Y., Wang, Z., and Qiao, Y. (arXiv, 2015). Towards Good Practices for Very Deep Two-Stream ConvNets, arXiv."},{"key":"ref_48","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (arXiv, 2012). UCF101: A Dataset of 101 Human Action Classes From Videos in the Wild, arXiv."},{"key":"ref_49","unstructured":"Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. (2014, January 8\u201313). Learning Deep Features for Scene Recognition using Places Database. Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Hwang, S., Park, J., Kim, N., Choi, Y., and Kweon, I.S. (2015, January 7\u201312). Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298706"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Gonz\u00e1lez, A., Fang, Z., Socarras, Y., Serrat, J., V\u00e1zquez, D., Xu, J., and L\u00f3pez, A.M. (2016). Pedestrian Detection at Day\/Night Time with Visible and FIR Cameras: A Comparison. Sensors, 16.","DOI":"10.3390\/s16060820"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Fang, Z., V\u00e1zquez, D., and L\u00f3pez, A.M. (2017). On-Board Detection of Pedestrian Intentions. Sensors, 17.","DOI":"10.3390\/s17102193"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/18\/2\/627\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T14:55:39Z","timestamp":1760194539000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/18\/2\/627"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,2,20]]},"references-count":52,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2018,2]]}},"alternative-id":["s18020627"],"URL":"https:\/\/doi.org\/10.3390\/s18020627","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,2,20]]}}}