{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,22]],"date-time":"2026-03-22T06:35:26Z","timestamp":1774161326258,"version":"3.50.1"},"reference-count":59,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2018,8,5]],"date-time":"2018-08-05T00:00:00Z","timestamp":1533427200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61772052"],"award-info":[{"award-number":["61772052"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>The emergence of new wearable technologies, such as action cameras and smart glasses, has driven the use of the first-person perspective in computer applications. This field is now attracting the attention and investment of researchers aiming to develop methods to process first-person vision (FPV) video. The current approaches present particular combinations of different image features and quantitative methods to accomplish specific objectives, such as object detection, activity recognition, user\u2013machine interaction, etc. FPV-based navigation is necessary in some special areas, where Global Position System (GPS) or other radio-wave strength methods are blocked, and is especially helpful for visually impaired people. In this paper, we propose a hybrid structure with a convolutional neural network (CNN) and local image features to achieve FPV pedestrian navigation. A novel end-to-end trainable global pooling operator, called AlphaMEX, has been designed to improve the scene classification accuracy of CNNs. A scale-invariant feature transform (SIFT)-based tracking algorithm is employed for movement estimation and trajectory tracking of the person through each frame of FPV images. Experimental results demonstrate the effectiveness of the proposed method. The top-1 error rate of the proposed AlphaMEX-ResNet outperforms the original ResNet (k = 12) by 1.7% on the ImageNet dataset. The CNN-SIFT hybrid pedestrian navigation system reaches 0.57 m average absolute error, which is an adequate accuracy for pedestrian navigation. Both positions and movements can be well estimated by the proposed pedestrian navigation algorithm with a single wearable camera.<\/jats:p>","DOI":"10.3390\/rs10081229","type":"journal-article","created":{"date-parts":[[2018,8,7]],"date-time":"2018-08-07T03:44:18Z","timestamp":1533613458000},"page":"1229","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":19,"title":["A CNN-SIFT Hybrid Pedestrian Navigation Method Based on First-Person Vision"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3508-027X","authenticated-orcid":false,"given":"Qi","family":"Zhao","sequence":"first","affiliation":[{"name":"School of Electronic and Information Engineering, Beihang University, Xueyuan Road, Beijing 100191, China"}]},{"given":"Boxue","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Electronic and Information Engineering, Beihang University, Xueyuan Road, Beijing 100191, China"}]},{"given":"Shuchang","family":"Lyu","sequence":"additional","affiliation":[{"name":"School of Electronic and Information Engineering, Beihang University, Xueyuan Road, Beijing 100191, China"}]},{"given":"Hong","family":"Zhang","sequence":"additional","affiliation":[{"name":"Image Processing Center, Beihang University, Xueyuan Road, Beijing 100191, China"}]},{"given":"Daniel","family":"Sun","sequence":"additional","affiliation":[{"name":"Data61, CSIRO, Canberra, ACT 2601, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9005-7112","authenticated-orcid":false,"given":"Guoqiang","family":"Li","sequence":"additional","affiliation":[{"name":"School of Software, Shanghai Jiao Tong University, Shanghai 200240, China"}]},{"given":"Wenquan","family":"Feng","sequence":"additional","affiliation":[{"name":"School of Electronic and Information Engineering, Beihang University, Xueyuan Road, Beijing 100191, China"}]}],"member":"1968","published-online":{"date-parts":[[2018,8,5]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"744","DOI":"10.1109\/TCSVT.2015.2409731","article-title":"The evolution of First-Person Vision methods: A survey","volume":"25","author":"Betancourt","year":"2015","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"2442","DOI":"10.1109\/JPROC.2012.2200554","article-title":"First-person vision","volume":"100","author":"Kanade","year":"2012","journal-title":"Proc. IEEE"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Li, C., and Kitani, K.M. (2013, January 1\u20138). Model recommendation with virtual probes for egocentric hand detection. Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), Sydney, Australia.","DOI":"10.1109\/ICCV.2013.326"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Li, C., and Kitani, K.M. (2013, January 23\u201328). Pixel-level hand detection in ego-centric videos. Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), Orlando, FL, USA.","DOI":"10.1109\/CVPR.2013.458"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Wang, H., Choudhury, R.R., Nelakuditi, S., and Bao, X. (2013, January 26\u201327). InSight: Recognizing humans without face recognition. Proceedings of the 14th Workshop on Mobile Computing Systems and Applications, Jekyll Island, GA, USA.","DOI":"10.1145\/2444776.2444786"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"114","DOI":"10.1186\/1743-0003-10-114","article-title":"Hand contour detection in wearable camera video using an adaptive histogram region of interest","volume":"10","author":"Zariffa","year":"2013","journal-title":"J. Neuroeng. Rehabilit."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Narayan, S., Kankanhalli, M.S., and Ramakrishnan, K.R. (2014, January 23\u201328). Action and interaction recognition in first-person videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA.","DOI":"10.1109\/CVPRW.2014.82"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Matsuo, K., Yamada, K., Ueno, S., and Naito, S. (2014, January 23\u201328). An attention-based activity recognition for egocentric video. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Columbus, OH, USA.","DOI":"10.1109\/CVPRW.2014.87"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Damen, D., Leelasawassuk, T., Haines, O., Calway, A., and Mayol-Cuevas, W. (2014, January 6\u20137). Multi-user egocentric online system for unsupervised assistance on object usage. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-16199-0_34"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Poleg, Y., Arora, C., and Peleg, S. (2014, January 23\u201328). Temporal segmentation of egocentric videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.325"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Zheng, K., Lin, Y., Zhou, Y., Salvi, D., Fan, X., Guo, D., and Wang, S. (2014, January 6\u20137). Video-based action detection using multiple wearable cameras. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-16178-5_51"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Castle, R.O., Gawley, D.J., Klein, G., and Murray, D.W. (2007, January 10\u201313). Video-rate recognition and localization for wearable cameras. Proceedings of the British Machine Vision Conference, Coventry, UK.","DOI":"10.5244\/C.21.112"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Castle, R.O., Gawley, D.J., Klein, G., and Murray, D.W. (2007, January 10\u201314). Towards simultaneous recognition, localization and mapping for hand-held and wearable cameras. Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Roma, Italy.","DOI":"10.1109\/ROBOT.2007.364109"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1052","DOI":"10.1109\/TPAMI.2007.1049","article-title":"MonoSLAM: Real-time single camera SLAM","volume":"29","author":"Davison","year":"2007","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_15","unstructured":"Castle, R., Klein, G., and Murray, D.W. (October, January 28). Video-rate localization in multiple maps for wearable augmented reality. Proceedings of the 12th IEEE International Symposium on Wearable Computers (ISWC), Pittsburgh, PA, USA."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Kameda, Y., and Ohta, Y. (2010, January 23\u201326). Image retrieval of First-Person Vision for pedestrian navigation in urban area. Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR), Istanbul, Turkey.","DOI":"10.1109\/ICPR.2010.1140"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Li, T., Zhang, H., Gao, Z., Chen, Q., and Niu, X. (2018). High-Accuracy Positioning in Urban Environments Using Single-Frequency Multi-GNSS RTK\/MEMS-IMU Integration. Remote Sens., 10.","DOI":"10.3390\/rs10020205"},{"key":"ref_18","first-page":"1097","article-title":"ImageNet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Lowe, D.G. (1999, January 20\u201327). Object recognition from local scale-invariant features. Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece.","DOI":"10.1109\/ICCV.1999.790410"},{"key":"ref_20","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 6\u201311). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_21","unstructured":"Glorot, X., Bordes, A., and Bengio, Y. (2011, January 11\u201313). Deep sparse rectifier neural networks. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA."},{"key":"ref_22","unstructured":"Krizhevsky, A., and Hinton, G. (2009, April 08). Learning Multiple Layers of Features from Tiny Images. Available online: http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.222.9220&rep=rep1&type=pdf."},{"key":"ref_23","unstructured":"Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A.Y. (2011, January 12\u201317). Reading digits in natural images with unsupervised feature learning. Proceedings of the Advances in Neural Information Processing Systems Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Visual Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Fathi, A., Hodgins, J.K., and Rehg, J.M. (2012, January 16\u201321). Social interactions: A first-person perspective. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6247805"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Yarbus, A.L. (1967). Eye movements during perception of complex objects. Eye Movements and Vision, Springer.","DOI":"10.1007\/978-1-4899-5379-7"},{"key":"ref_27","unstructured":"Bisio, I., Delfino, A., Lavagetto, F., and Marchese, M. (2016). Opportunistic detection methods for emotion-aware smartphone applications. Psychology and Mental Health: Concepts, Methodologies, Tools, and Applications, IGI Global."},{"key":"ref_28","unstructured":"Philipose, M., and Ren, X. (2009, January 20\u201325). Egocentric recognition of handled objects: Benchmark and analysis. Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Miami, FL, USA."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Betancourt, A., L\u00f3pez, M.M., Regazzoni, C.S., and Rauterberg, M. (2014, January 23\u201328). A sequential classifier for hand detection in the framework of egocentric vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Columbus, OH, USA.","DOI":"10.1109\/CVPRW.2014.92"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Lu, Z., and Grauman, K. (2013, January 23\u201328). Story-driven summarization for egocentric video. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA.","DOI":"10.1109\/CVPR.2013.350"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Hodges, S., Williams, L., Berry, E., Izadi, S., Srinivasan, J., Butler, A., and Wood, K. (2006, January 17\u201321). SenseCam: A retrospective memory aid. Proceedings of the International Conference on Ubiquitous Computing, Orange County, CA, USA.","DOI":"10.1007\/11853565_11"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Sundaram, S., and Cuevas, W.W.M. (2009, January 20\u201325). High level activity recognition using low resolution wearable vision. Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5204355"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Ogaki, K., Kitani, K.M., Sugano, Y., and Sato, Y. (2012, January 16\u201321). Coupling eye-motion and ego-motion features for first-person activity recognition. Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA.","DOI":"10.1109\/CVPRW.2012.6239188"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., and Girshick, R. (2017, January 22\u201329). Mask r-cnn. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Takizawa, H., Orita, K., Aoyagi, M., Ezaki, N., and Mizuno, S. (2015, January 2\u20137). A Spot Navigation System for the Visually Impaired by Use of SIFT-Based Image Matching. Proceedings of the International Conference on Universal Access in Human-Computer Interaction, Los Angeles, CA, USA.","DOI":"10.1007\/978-3-319-20687-5_16"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Hu, M., Chen, J., and Shi, C. (2015, January 8\u201312). Three-dimensional mapping based on SIFT and RANSAC for mobile robot. Proceedings of the 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), Shenyang, China.","DOI":"10.1109\/CYBER.2015.7287924"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"1587","DOI":"10.1109\/TNNLS.2017.2676130","article-title":"Convolution in convolution for network in network","volume":"29","author":"Pang","year":"2017","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"1163","DOI":"10.1109\/TNNLS.2015.2495161","article-title":"Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining","volume":"27","author":"Zhang","year":"2016","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Kim, Y. (arXiv, 2014). Convolutional neural networks for sentence classification, arXiv.","DOI":"10.3115\/v1\/D14-1181"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_41","unstructured":"Lin, M., Chen, Q., and Yan, S. (arXiv, 2013). Network in network, arXiv."},{"key":"ref_42","unstructured":"Srivastava, R.K., Greff, K., and Schmidhuber, J. (2015, January 7\u201312). Training very deep networks. Proceedings of the Twenty-ninth Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Visual Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Weinberger, K.Q., and Maaten, L. (2017, January 21\u201326). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1016\/j.jcss.2016.11.002","article-title":"Elasticity management of streaming data analytics flows on clouds","volume":"89","author":"Khoshkbarforoushha","year":"2017","journal-title":"J. Comput. Syst. Sci."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"120","DOI":"10.1109\/TETC.2016.2597546","article-title":"Distribution based workload modelling of continuous queries in clouds","volume":"5","author":"Khoshkbarforoushha","year":"2017","journal-title":"IEEE Trans. Emerg. Top. Comput."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"2126","DOI":"10.1109\/TPDS.2013.272","article-title":"Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic dag scheduling","volume":"25","author":"Ma","year":"2014","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"1497","DOI":"10.1109\/TPDS.2014.2322362","article-title":"A parallel file system with application-aware data layout policies for massive remote sensing image processing in digital earth","volume":"26","author":"Wang","year":"2015","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"1336","DOI":"10.1109\/TC.2014.2317188","article-title":"CloudGenius: A hybrid decision support method for automating the migration of web application clusters to public clouds","volume":"64","author":"Menzel","year":"2015","journal-title":"IEEE Trans. Comput."},{"key":"ref_50","unstructured":"Zeiler, M.D., and Fergus, R. (arXiv, 2013). Stochastic pooling for regularization of deep convolutional neural networks, arXiv."},{"key":"ref_51","unstructured":"Lee, C.Y., Gallagher, P.W., and Tu, Z. (2016, January 7\u201311). Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Cadiz, Spain."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Cohen, N., Sharir, O., and Shashua, A. (2016, January 27\u201330). Deep simnets. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.517"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Hartley, R., and Zisserman, A. (2003). Multiple View Geometry in Computer Vision, Cambridge University Press.","DOI":"10.1017\/CBO9780511811685"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"138","DOI":"10.1006\/cviu.1999.0832","article-title":"MLESAC: A new robust estimator with application to estimating image geometry","volume":"78","author":"Torr","year":"2000","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_55","unstructured":"Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013, January 16\u201321). On the importance of initialization and momentum in deep learning. Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA."},{"key":"ref_56","unstructured":"Springenberg, J.T., Dosovitskiy, A., Brox, T., and Riedmiller, M. (arXiv, 2014). Striving for simplicity: The all convolutional net, arXiv."},{"key":"ref_57","unstructured":"Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. (2015, January 10\u201312). Deeply-supervised nets. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K.Q. (2016, January 11\u201314). Deep networks with stochastic depth. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46493-0_39"},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"1150","DOI":"10.1109\/TPAMI.2006.141","article-title":"Full-frame video stabilization with motion inpainting","volume":"28","author":"Matsushita","year":"2006","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/10\/8\/1229\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:16:43Z","timestamp":1760195803000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/10\/8\/1229"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,8,5]]},"references-count":59,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2018,8]]}},"alternative-id":["rs10081229"],"URL":"https:\/\/doi.org\/10.3390\/rs10081229","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,8,5]]}}}