{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T10:05:12Z","timestamp":1776852312590,"version":"3.51.2"},"reference-count":336,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2019,11,22]],"date-time":"2019-11-22T00:00:00Z","timestamp":1574380800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2019,11,22]],"date-time":"2019-11-22T00:00:00Z","timestamp":1574380800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100003246","name":"Nederlandse Organisatie voor Wetenschappelijk Onderzoek","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100003246","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Multimed Info Retr"],"published-print":{"date-parts":[[2020,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Higher dimensional data such as video and 3D are the leading edge of multimedia retrieval and computer vision research. In this survey, we give a comprehensive overview and key insights into the state of the art of higher dimensional features from deep learning and also traditional approaches. Current approaches are frequently using 3D information from the sensor or are using 3D in modeling and understanding the 3D world. With the growth of prevalent application areas such as 3D games, self-driving automobiles, health monitoring and sports activity training, a wide variety of new sensors have allowed researchers to develop feature description models beyond 2D. Although higher dimensional data enhance the performance of methods on numerous tasks, they can also introduce new challenges and problems. The higher dimensionality of the data often leads to more complicated structures which present additional problems in both extracting meaningful content and in adapting it for current machine learning algorithms. Due to the major importance of the evaluation process, we also present an overview of the current datasets and benchmarks. Moreover, based on more than 330 papers from this study, we present the major challenges and future directions.<\/jats:p>","DOI":"10.1007\/s13735-019-00183-w","type":"journal-article","created":{"date-parts":[[2019,11,22]],"date-time":"2019-11-22T07:04:19Z","timestamp":1574406259000},"page":"135-170","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":152,"title":["A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision"],"prefix":"10.1007","volume":"9","author":[{"given":"Theodoros","family":"Georgiou","sequence":"first","affiliation":[]},{"given":"Yu","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Wei","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Michael","family":"Lew","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2019,11,22]]},"reference":[{"key":"183_CR1","unstructured":"Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675"},{"key":"183_CR2","unstructured":"Agostinelli F, Hoffman M, Sadowski P, Baldi P (2014) Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830"},{"key":"183_CR3","doi-asserted-by":"crossref","unstructured":"Alahi A, Ortiz R, Vandergheynst P (2012) Freak: fast retina keypoint. In: Proceedings of the CVPR. IEEE, pp 510\u2013517","DOI":"10.1109\/CVPR.2012.6247715"},{"key":"183_CR4","doi-asserted-by":"crossref","unstructured":"Alexandre LA (2016) 3D object recognition using convolutional neural networks with transfer learning between input channels. In: Intelligent autonomous systems, vol 13. Springer, pp 889\u2013898","DOI":"10.1007\/978-3-319-08338-4_64"},{"key":"183_CR5","doi-asserted-by":"crossref","unstructured":"Allaire S, Kim JJ, Breen SL, Jaffray DA, Pekar V (2008) Full orientation invariance and improved feature selectivity of 3D SIFT with application to medical image analysis. In: Proceedings of the CVPRW. IEEE, pp 1\u20138","DOI":"10.1109\/CVPRW.2008.4563023"},{"key":"183_CR6","unstructured":"Anne\u00a0Hendricks L, Wang O, Shechtman E, Sivic J, Darrell T, Russell B (2017) Localizing moments in video with natural language. In: ICCV. IEEE, pp 5803\u20135812"},{"key":"183_CR7","doi-asserted-by":"crossref","unstructured":"Aubry M, Schlickewei U, Cremers D (2011) The wave kernel signature: a quantum mechanical approach to shape analysis. In: ICCVW. IEEE, pp 1626\u20131633","DOI":"10.1109\/ICCVW.2011.6130444"},{"key":"183_CR8","unstructured":"Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450"},{"key":"183_CR9","doi-asserted-by":"crossref","unstructured":"Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding. Springer, pp 29\u201339","DOI":"10.1007\/978-3-642-25446-8_4"},{"key":"183_CR10","doi-asserted-by":"crossref","first-page":"2481","DOI":"10.1109\/TPAMI.2016.2644615","volume":"39","author":"V Badrinarayanan","year":"2017","unstructured":"Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder\u2013decoder architecture for image segmentation. Trans Pattern Anal Mach Intell 39:2481\u20132495","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR11","doi-asserted-by":"crossref","unstructured":"Barekatain M, Mart\u00ed M, Shih HF, Murray S, Nakayama K, Matsuo Y, Prendinger H (2017) Okutama-action: an aerial view video dataset for concurrent human action detection. In: Proceedings of the CVPRW. IEEE, pp 28\u201335","DOI":"10.1109\/CVPRW.2017.267"},{"key":"183_CR12","doi-asserted-by":"crossref","unstructured":"Bay H, Tuytelaars T, Van\u00a0Gool L (2006) Surf: speeded up robust features. In: Proceedings of the ECCV. Springer, pp 404\u2013417","DOI":"10.1007\/11744023_32"},{"key":"183_CR13","unstructured":"Beaudet PR (1978) Rotationally invariant image operators. In: Proceedings 4th international joint conference pattern recognition, Tokyo, Japan, 1978"},{"key":"183_CR14","doi-asserted-by":"crossref","unstructured":"Behley J, Steinhage V, Cremers AB (2013) Laser-based segment classification using a mixture of bag-of-words. In: IROS. IEEE, pp 4195\u20134200","DOI":"10.1109\/IROS.2013.6696957"},{"key":"183_CR15","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1109\/34.993558","volume":"24","author":"S Belongie","year":"2002","unstructured":"Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition using shape contexts. Trans Pattern Anal Mach Intell 24:509\u2013522","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR16","doi-asserted-by":"crossref","unstructured":"Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: Proceedings of the CVPR. IEEE, pp 3034\u20133042","DOI":"10.1109\/CVPR.2016.331"},{"key":"183_CR17","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1023\/A:1007939232436","volume":"26","author":"MJ Black","year":"1998","unstructured":"Black MJ, Jepson AD (1998) Eigentracking: robust matching and tracking of articulated objects using a view-based representation. Int J Comput Vis 26:63\u201384","journal-title":"Int J Comput Vis"},{"key":"183_CR18","doi-asserted-by":"crossref","unstructured":"Bo L, Lai K, Ren X, Fox D (2011) Object recognition with hierarchical kernel descriptors. In: Proceedings of the CVPR. IEEE, pp 1729\u20131736","DOI":"10.1109\/CVPR.2011.5995719"},{"key":"183_CR19","unstructured":"Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition. In: Advances in neural information processing systems, vol 23. Curran Associates, Inc., pp 244\u2013252"},{"key":"183_CR20","doi-asserted-by":"crossref","unstructured":"Bo L, Ren X, Fox D (2011) Depth kernel descriptors for object recognition. In: IROS. IEEE, pp 821\u2013826","DOI":"10.1109\/IROS.2011.6095119"},{"key":"183_CR21","doi-asserted-by":"crossref","first-page":"387","DOI":"10.1007\/978-3-319-00065-7_27","volume-title":"Experimental robotics","author":"L Bo","year":"2013","unstructured":"Bo L, Ren X, Fox D (2013) Unsupervised feature learning for RGB-D based object recognition. In: Desai J, Dudek G, Khatib O, Kumar V (eds) Experimental robotics. Springer, Heidelberg, pp 387\u2013402"},{"key":"183_CR22","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1109\/34.910878","volume":"23","author":"AF Bobick","year":"2001","unstructured":"Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. Trans Pattern Anal Mach Intell 23:257\u2013267","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR23","doi-asserted-by":"crossref","first-page":"291","DOI":"10.1007\/BF00332918","volume":"59","author":"H Bourlard","year":"1988","unstructured":"Bourlard H, Kamp Y (1988) Auto-association by multilayer perceptrons and singular value decomposition. Biol Cybern 59:291\u2013294","journal-title":"Biol Cybern"},{"key":"183_CR24","doi-asserted-by":"crossref","unstructured":"Bregonzio M, Gong S, Xiang T (2009) Recognising action as clouds of space-time interest points. In: Proceedings of the CVPR. IEEE, pp 1948\u20131955","DOI":"10.1109\/CVPR.2009.5206779"},{"key":"183_CR25","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1002\/cem.1122","volume":"22","author":"R Bro","year":"2008","unstructured":"Bro R, Acar E, Kolda TG (2008) Resolving the sign ambiguity in the singular value decomposition. J Chemometr 22:135\u2013140","journal-title":"J Chemometr"},{"key":"183_CR26","unstructured":"Brock A, Lim T, Ritchie J, Weston N (2016) Generative and discriminative voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236"},{"key":"183_CR27","first-page":"1","volume":"3D","author":"A Bronstein","year":"2010","unstructured":"Bronstein A, Bronstein M, Ovsjanikov M (2010) 3D features, surface descriptors, and object descriptors. Imaging Anal Appl 3D:1\u201327","journal-title":"Imaging Anal Appl"},{"key":"183_CR28","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1899404.1899405","volume":"30","author":"AM Bronstein","year":"2011","unstructured":"Bronstein AM, Bronstein MM, Guibas LJ, Ovsjanikov M (2011) Shape google: geometric words and expressions for invariant shape retrieval. Trans Graph 30:1","journal-title":"Trans Graph"},{"key":"183_CR29","unstructured":"Bronstein MM, Kokkinos I (2010) Scale-invariant heat kernel signatures for non-rigid shape recognition. In: Proceedings of the CVPR. IEEE, pp 1704\u20131711"},{"key":"183_CR30","unstructured":"Caba\u00a0Heilbron F, Escorcia V, Ghanem B, Carlos\u00a0Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the CVPR. IEEE, pp 961\u2013970"},{"key":"183_CR31","doi-asserted-by":"crossref","unstructured":"Calli B, Singh A, Walsman A, Srinivasa S, Abbeel P, Dollar AM (2015) The ycb object and model set: towards common benchmarks for manipulation research. In: ICAR. IEEE, pp 510\u2013517","DOI":"10.1109\/ICAR.2015.7251504"},{"key":"183_CR32","doi-asserted-by":"crossref","unstructured":"Cao L, Liu Z, Huang TS (2010) Cross-dataset action detection. In: Proceedings of the CVPR. IEEE, pp 1998\u20132005","DOI":"10.1109\/CVPR.2010.5539875"},{"key":"183_CR33","doi-asserted-by":"crossref","unstructured":"Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the CVPR. IEEE, pp 4724\u20134733","DOI":"10.1109\/CVPR.2017.502"},{"key":"183_CR34","doi-asserted-by":"crossref","first-page":"396","DOI":"10.1016\/j.cviu.2011.09.010","volume":"116","author":"B Chakraborty","year":"2012","unstructured":"Chakraborty B, Holte MB, Moeslund TB, Gonz\u00e0lez J (2012) Selective spatio-temporal interest points. Comput Vis Image Underst 116:396\u2013410","journal-title":"Comput Vis Image Underst"},{"key":"183_CR35","unstructured":"Chang AX, Funkhouser T, Guibas L, Hanrahan P, Huang Q, Li Z, Savarese S, Savva M, Song S, Su H et al (2015) Shapenet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012"},{"key":"183_CR36","doi-asserted-by":"crossref","unstructured":"Chen DY, Tian XP, Shen YT, Ouhyoung M (2003) On visual similarity based 3D model retrieval. In: Computer graphics forum. Wiley Online Library, pp 223\u2013232","DOI":"10.1111\/1467-8659.00669"},{"key":"183_CR37","doi-asserted-by":"crossref","first-page":"1252","DOI":"10.1016\/j.patrec.2007.02.009","volume":"28","author":"H Chen","year":"2007","unstructured":"Chen H, Bhanu B (2007) 3D free-form object recognition in range images using local surface patches. Pattern Recogn Lett 28:1252\u20131262","journal-title":"Pattern Recogn Lett"},{"key":"183_CR38","doi-asserted-by":"crossref","unstructured":"Cheng G, Zhou P, Han J (2016) RIFD-CNN: rotation-invariant and fisher discriminative convolutional neural networks for object detection. In: Proceedings of the CVPR. IEEE, pp 2884\u20132893","DOI":"10.1109\/CVPR.2016.315"},{"key":"183_CR39","doi-asserted-by":"crossref","unstructured":"Cheung W, Hamarneh G (2007) N-SIFT: N-dimensional scale invariant feature transform for matching medical images. In: 2007 4th IEEE international symposium on biomedical imaging: from nano to macro. IEEE, pp 720\u2013723","DOI":"10.1109\/ISBI.2007.356953"},{"key":"183_CR40","doi-asserted-by":"crossref","unstructured":"Cho K, Van\u00a0Merri\u00ebnboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder\u2013decoder for statistical machine translation. arXiv preprint arXiv:1406.1078","DOI":"10.3115\/v1\/D14-1179"},{"key":"183_CR41","unstructured":"Choi S, Zhou QY, Miller S, Koltun V (2016) A large dataset of object scans. arXiv:1602.02481"},{"key":"183_CR42","unstructured":"Clevert DA, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289"},{"key":"183_CR43","unstructured":"Cocosco CA, Kollokian V, Kwan RKS, Pike GB, Evans AC (1997) Brainweb: online interface to a 3D MRI simulated brain database. In: NeuroImage. Citeseer"},{"key":"183_CR44","unstructured":"Cooijmans T, Ballas N, Laurent C, G\u00fcl\u00e7ehre \u00c7, Courville A (2016) Recurrent batch normalization. arXiv preprint arXiv:1603.09025"},{"key":"183_CR45","unstructured":"Couprie C (2012) Multi-label energy minimization for object class segmentation. In: EUSIPCO. IEEE, pp 2233\u20132237"},{"key":"183_CR46","unstructured":"Couprie C, Farabet C, Najman L, LeCun Y (2013) Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572"},{"key":"183_CR47","doi-asserted-by":"crossref","unstructured":"Dai A, Chang AX, Savva M, Halber M, Funkhouser T, Nie\u00dfner M (2017) Scannet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the CVPR. IEEE, pp 5828\u20135839","DOI":"10.1109\/CVPR.2017.261"},{"key":"183_CR48","doi-asserted-by":"crossref","unstructured":"Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the CVPR. IEEE, pp 886\u2013893","DOI":"10.1109\/CVPR.2005.177"},{"key":"183_CR49","doi-asserted-by":"crossref","first-page":"2758","DOI":"10.1109\/TIP.2012.2183142","volume":"21","author":"T Darom","year":"2012","unstructured":"Darom T, Keller Y (2012) Scale-invariant features for 3-D mesh models. IEEE Trans Image Process 21:2758\u20132769","journal-title":"IEEE Trans Image Process"},{"key":"183_CR50","unstructured":"Deng L, Yang M, Li T, He Y, Wang C (2019) RFBNet: deep multimodal networks with residual fusion blocks for RGB-D semantic segmentation. arXiv preprint arXiv:1907.00135"},{"key":"183_CR51","doi-asserted-by":"crossref","unstructured":"Deng Z, Todorovic S, Jan\u00a0Latecki L (2015) Semantic segmentation of RGBD images with mutex constraints. In: ICCV. IEEE, pp 1733\u20131741","DOI":"10.1109\/ICCV.2015.202"},{"key":"183_CR52","doi-asserted-by":"crossref","unstructured":"Doll\u00e1r P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Workshop on visual surveillance and performance evaluation of tracking and surveillance. IEEE, pp 65\u201372","DOI":"10.1109\/VSPETS.2005.1570899"},{"key":"183_CR53","doi-asserted-by":"crossref","first-page":"456","DOI":"10.1016\/j.neuroimage.2017.04.039","volume":"170","author":"J Dolz","year":"2017","unstructured":"Dolz J, Desrosiers C, Ayed IB (2017) 3D fully convolutional networks for subcortical segmentation in MRI: a large-scale study. NeuroImage 170:456\u2013470","journal-title":"NeuroImage"},{"key":"183_CR54","doi-asserted-by":"crossref","unstructured":"Donahue J, Anne\u00a0Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the CVPR. IEEE, pp 2625\u20132634","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"183_CR55","unstructured":"Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the CVPR. IEEE, pp 1110\u20131118"},{"key":"183_CR56","doi-asserted-by":"crossref","unstructured":"Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV. IEEE, pp 2650\u20132658","DOI":"10.1109\/ICCV.2015.304"},{"key":"183_CR57","doi-asserted-by":"crossref","unstructured":"Eitel A, Springenberg JT, Spinello L, Riedmiller M, Burgard W (2015) Multimodal deep learning for robust RGB-D object recognition. In: IROS. IEEE, pp 681\u2013687","DOI":"10.1109\/IROS.2015.7353446"},{"key":"183_CR58","first-page":"412","volume":"14","author":"H ElNaghy","year":"2013","unstructured":"ElNaghy H, Hamad S, Khalifa ME (2013) Taxonomy for 3D content-based object retrieval methods. Int J Res Rev Appl Sci 14:412\u2013446","journal-title":"Int J Res Rev Appl Sci"},{"key":"183_CR59","doi-asserted-by":"crossref","unstructured":"Endres F, Hess J, Engelhard N, Sturm J, Cremers D, Burgard W (2012) An evaluation of the RGB-D slam system. In: ICRA. IEEE, pp 1691\u20131696","DOI":"10.1109\/ICRA.2012.6225199"},{"key":"183_CR60","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1109\/TRO.2013.2279412","volume":"30","author":"F Endres","year":"2014","unstructured":"Endres F, Hess J, Sturm J, Cremers D, Burgard W (2014) 3-d mapping with an RGB-D camera. Trans Robot 30:177\u2013187","journal-title":"Trans Robot"},{"key":"183_CR61","doi-asserted-by":"crossref","unstructured":"Engelcke M, Rao D, Wang DZ, Tong CH, Posner I (2017) Vote3deep: fast object detection in 3D point clouds using efficient convolutional neural networks. In: ICRA. IEEE, pp 1355\u20131361","DOI":"10.1109\/ICRA.2017.7989161"},{"key":"183_CR62","doi-asserted-by":"crossref","unstructured":"Fan Y, Qian Y, Xie FL, Soong FK (2014) TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Fifteenth annual conference of the international speech communication association","DOI":"10.21437\/Interspeech.2014-443"},{"key":"183_CR63","unstructured":"Farabet C, Couprie C, Najman L, LeCun Y (2012) Scene parsing with multiscale feature learning, purity trees, and optimal covers. In: Proceedings of the ICML. Omnipress, pp 1857\u20131864"},{"key":"183_CR64","doi-asserted-by":"crossref","first-page":"1915","DOI":"10.1109\/TPAMI.2012.231","volume":"35","author":"C Farabet","year":"2013","unstructured":"Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. Trans Pattern Anal Mach Intell 35:1915\u20131929","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR65","doi-asserted-by":"crossref","unstructured":"Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the CVPR. IEEE, pp 1933\u20131941","DOI":"10.1109\/CVPR.2016.213"},{"key":"183_CR66","doi-asserted-by":"crossref","unstructured":"Fernando B, Gavves S, Mogrovejo O, Antonio J, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: Proceedings of the CVPR. IEEE, pp 5378\u20135387","DOI":"10.1109\/CVPR.2015.7299176"},{"key":"183_CR67","doi-asserted-by":"crossref","unstructured":"Firman M (2016) RGBD datasets: past, present and future. In: Proceedings of the CVPRW. IEEE, pp 19\u201331","DOI":"10.1109\/CVPRW.2016.88"},{"key":"183_CR68","doi-asserted-by":"crossref","unstructured":"Flint A, Dick A, Van Den\u00a0Hengel A (2007) Thrift: local 3D structure recognition. In: DICTA. IEEE, pp 182\u2013188","DOI":"10.1109\/DICTA.2007.4426794"},{"key":"183_CR69","doi-asserted-by":"crossref","unstructured":"Frome A, Huber D, Kolluri R, B\u00fclow T, Malik J (2004) Recognizing objects in range data using regional point descriptors. In: Proceedings of the ECCV. Springer, pp 224\u2013237","DOI":"10.1007\/978-3-540-24672-5_18"},{"key":"183_CR70","doi-asserted-by":"crossref","unstructured":"Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: temporal activity localization via language query. In: ICCV. IEEE, pp 5267\u20135275","DOI":"10.1109\/ICCV.2017.563"},{"key":"183_CR71","doi-asserted-by":"crossref","first-page":"1142","DOI":"10.1016\/j.patcog.2009.07.012","volume":"43","author":"Y Gao","year":"2010","unstructured":"Gao Y, Dai Q, Zhang NY (2010) 3D model comparison using spatial structure circular descriptor. Pattern Recognit 43:1142\u20131151","journal-title":"Pattern Recognit"},{"key":"183_CR72","doi-asserted-by":"crossref","unstructured":"Garcia N (2018) Temporal aggregation of visual features for large-scale image-to-video retrieval. In: Proceedings of the 2018 ACM on international conference on multimedia retrieval. ACM, pp 489\u2013492","DOI":"10.1145\/3206025.3206083"},{"key":"183_CR73","doi-asserted-by":"crossref","unstructured":"Garcia N, Vogiatzis G (2017) Dress like a star: Retrieving fashion products from videos. In: ICCVW. IEEE, pp 2293\u20132299","DOI":"10.1109\/ICCVW.2017.270"},{"key":"183_CR74","unstructured":"Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Garcia-Rodriguez J (2017) A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857"},{"key":"183_CR75","doi-asserted-by":"crossref","unstructured":"Geiger A (2012) Are we ready for autonomous driving? The kitti vision benchmark suite. In: Proceedings of the CVPR. IEEE, pp 3354\u20133361","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"183_CR76","doi-asserted-by":"crossref","unstructured":"Georgiou T, Schmitt S, Olhofer M, Liu Y, B\u00e4ck T, Lew, M (2018) Learning fluid flows. In: IJCNN. IEEE, pp 1\u20138","DOI":"10.1109\/IJCNN.2018.8489664"},{"key":"183_CR77","first-page":"115","volume":"3","author":"FA Gers","year":"2002","unstructured":"Gers FA, Schraudolph NN, Schmidhuber J (2002) Learning precise timing with LSTM recurrent networks. J Mach Learn Res 3:115\u2013143","journal-title":"J Mach Learn Res"},{"key":"183_CR78","unstructured":"Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: AISTATS, pp 315\u2013323. PMLR"},{"key":"183_CR79","volume-title":"Deep learning","author":"I Goodfellow","year":"2016","unstructured":"Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning, vol 1. MIT Press, Cambridge"},{"key":"183_CR80","unstructured":"Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: Proceedings of the ICML. Omnipress, pp III\u20131319\u2013III\u20131327"},{"key":"183_CR81","doi-asserted-by":"crossref","unstructured":"Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, et\u00a0al. (2017) The \u201csomething something\u201d video database for learning and evaluating visual common sense. In: ICCV. IEEE, p\u00a03","DOI":"10.1109\/ICCV.2017.622"},{"key":"183_CR82","doi-asserted-by":"crossref","first-page":"2222","DOI":"10.1109\/TNNLS.2016.2582924","volume":"28","author":"K Greff","year":"2017","unstructured":"Greff K, Srivastava RK, Koutn\u00edk J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. Trans Neural Netw Learn Syst 28:2222\u20132232","journal-title":"Trans Neural Netw Learn Syst"},{"key":"183_CR83","doi-asserted-by":"crossref","first-page":"763","DOI":"10.1007\/s00138-019-01027-7","volume":"30","author":"W Guo","year":"2019","unstructured":"Guo W, Hu W, Liu C, Lu T (2019) 3D object recognition from cluttered and occluded scenes with a compact local feature. Mach Vis Appl 30:763\u2013783","journal-title":"Mach Vis Appl"},{"key":"183_CR84","doi-asserted-by":"crossref","unstructured":"Guo Y, Bennamoun M, Sohel F, Lu M, Wan J (2014) 3D object recognition in cluttered scenes with local surface features: a survey. Trans Pattern Anal Mach Intell pp 2270\u20132287","DOI":"10.1109\/TPAMI.2014.2316828"},{"key":"183_CR85","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1007\/s13735-017-0141-z","volume":"7","author":"Y Guo","year":"2018","unstructured":"Guo Y, Liu Y, Georgiou T, Lew MS (2018) A review of semantic segmentation using deep neural networks. Int J Multi Inf Retrieval 7:87\u201393","journal-title":"Int J Multi Inf Retrieval"},{"key":"183_CR86","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1016\/j.neucom.2015.09.116","volume":"187","author":"Y Guo","year":"2016","unstructured":"Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS (2016) Deep learning for visual understanding: a review. Neurocomputing 187:27\u201348","journal-title":"Neurocomputing"},{"key":"183_CR87","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1007\/s11263-013-0627-y","volume":"105","author":"Y Guo","year":"2013","unstructured":"Guo Y, Sohel F, Bennamoun M, Lu M, Wan J (2013) Rotational projection statistics for 3D local surface description and object recognition. Int J Comput Vis 105:63\u201386","journal-title":"Int J Comput Vis"},{"key":"183_CR88","doi-asserted-by":"crossref","first-page":"196","DOI":"10.1016\/j.ins.2014.09.015","volume":"293","author":"Y Guo","year":"2015","unstructured":"Guo Y, Sohel F, Bennamoun M, Wan J, Lu M (2015) A novel local surface feature for 3D object recognition under clutter and occlusion. Inf Sci 293:196\u2013213","journal-title":"Inf Sci"},{"key":"183_CR89","unstructured":"Guo Y, Sohel FA, Bennamoun M, Lu M, Wan J (2013) TriSI: a distinctive local surface descriptor for 3D modeling and object recognition. In: GRAPP\/IVAPP, pp 86\u201393"},{"key":"183_CR90","doi-asserted-by":"crossref","first-page":"133","DOI":"10.1007\/s11263-014-0777-6","volume":"112","author":"S Gupta","year":"2015","unstructured":"Gupta S, Arbel\u00e1ez P, Girshick R, Malik J (2015) Indoor scene understanding with RGB-D images: bottom-up segmentation, object detection and semantic segmentation. Int J Comput Vis 112:133\u2013149","journal-title":"Int J Comput Vis"},{"key":"183_CR91","doi-asserted-by":"crossref","unstructured":"Gupta S, Arbelaez P, Malik J (2013) Perceptual organization and recognition of indoor scenes from RGB-D images. In: Proceedings of the CVPR. IEEE, pp 564\u2013571","DOI":"10.1109\/CVPR.2013.79"},{"key":"183_CR92","doi-asserted-by":"crossref","unstructured":"Gupta S, Girshick R, Arbel\u00e1ez P, Malik J (2014) Learning rich features from RGB-D images for object detection and segmentation. In: Proceedings of the ECCV. Springer, pp 345\u2013360","DOI":"10.1007\/978-3-319-10584-0_23"},{"key":"183_CR93","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1007\/s11263-016-0917-2","volume":"121","author":"S Hadfield","year":"2017","unstructured":"Hadfield S, Lebeda K, Bowden R (2017) Hollywood 3D: what are the best 3D features for action recognition? Int J Comput Vis 121:95\u2013110","journal-title":"Int J Comput Vis"},{"key":"183_CR94","unstructured":"Handa A, Patraucean V, Badrinarayanan V, Stent S, Cipolla R (2016) Understanding real world indoor scenes with synthetic data. In: Proceedings of the CVPR. IEEE, pp 4077\u20134085"},{"key":"183_CR95","doi-asserted-by":"crossref","unstructured":"Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey vision conference. Citeseer, pp 10\u20135244","DOI":"10.5244\/C.2.23"},{"key":"183_CR96","doi-asserted-by":"crossref","unstructured":"Hassner T (2013) A critical review of action recognition benchmarks. In: Proceedings of the CVPRW. IEEE, pp 245\u2013250","DOI":"10.1109\/CVPRW.2013.43"},{"key":"183_CR97","unstructured":"Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: Incorporating depth into semantic segmentation via fusion-based CNN architecture. In: ACCV. Springer, pp 213\u2013228"},{"key":"183_CR98","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: ICCV. IEEE, pp 1026\u20131034","DOI":"10.1109\/ICCV.2015.123"},{"key":"183_CR99","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the CVPR. IEEE, pp 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"183_CR100","unstructured":"Hegde V, Zadeh R (2016) Fusionnet: 3d object classification using multiple data representations. arXiv preprint arXiv:1607.05695"},{"key":"183_CR101","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1016\/j.imavis.2017.01.010","volume":"60","author":"S Herath","year":"2017","unstructured":"Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4\u201321","journal-title":"Image Vis Comput"},{"key":"183_CR102","doi-asserted-by":"crossref","unstructured":"Hermans A, Floros G, Leibe B (2014) Dense 3D semantic mapping of indoor scenes from RGB-D images. In: ICRA. IEEE, pp 2631\u20132638","DOI":"10.1109\/ICRA.2014.6907236"},{"key":"183_CR103","doi-asserted-by":"crossref","unstructured":"Hinterstoisser S, Holzer S, Cagniart C, Ilic S, Konolige K, Navab N, Lepetit V (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In: ICCV. IEEE, pp 858\u2013865","DOI":"10.1109\/ICCV.2011.6126326"},{"key":"183_CR104","doi-asserted-by":"crossref","unstructured":"Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K, Navab N (2012) Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: ACCV. Springer, pp 548\u2013562","DOI":"10.1007\/978-3-642-33885-4_60"},{"key":"183_CR105","doi-asserted-by":"crossref","unstructured":"Hinterstoisser S, Lepetit V, Rajkumar N, Konolige K (2016) Going further with point pair features. In: Proceedings of the ECCV. Springer, pp 834\u2013848","DOI":"10.1007\/978-3-319-46487-9_51"},{"key":"183_CR106","doi-asserted-by":"crossref","first-page":"1527","DOI":"10.1162\/neco.2006.18.7.1527","volume":"18","author":"GE Hinton","year":"2006","unstructured":"Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527\u20131554","journal-title":"Neural Comput"},{"key":"183_CR107","doi-asserted-by":"crossref","first-page":"504","DOI":"10.1126\/science.1127647","volume":"313","author":"GE Hinton","year":"2006","unstructured":"Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504\u2013507","journal-title":"Science"},{"key":"183_CR108","unstructured":"Hinton GE, Sejnowski TJ (1986) Learning and releaming in Boltzmann machines. In: Parallel distributed processing: explorations in the microstructure of cognition, vol 1, p 2"},{"key":"183_CR109","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735\u20131780","journal-title":"Neural Comput"},{"key":"183_CR110","doi-asserted-by":"crossref","unstructured":"H\u00f6ft N, Schulz H, Behnke S (2014) Fast semantic segmentation of RGB-D scenes with GPU-accelerated deep neural networks. In: Joint German\/Austrian conference on artificial intelligence. Springer, pp 80\u201385","DOI":"10.1007\/978-3-319-11206-0_9"},{"key":"183_CR111","doi-asserted-by":"crossref","unstructured":"Holmes DR, Workman EL, Robb RA (2005) The NLM-Mayo image collection: common access to uncommon data. In: MICCAI workshop","DOI":"10.54294\/2wypjk"},{"key":"183_CR112","unstructured":"Horn BKP (1984) Extended Gaussian images. In: Proceedings, pp 1671\u20131686"},{"key":"183_CR113","doi-asserted-by":"crossref","unstructured":"Hua BS, Pham QH, Nguyen DT, Tran MK, Yu LF, Yeung SK (2016) Scenenn: a scene meshes dataset with annotations. In: 3DV","DOI":"10.1109\/3DV.2016.18"},{"key":"183_CR114","doi-asserted-by":"crossref","unstructured":"Huang G, Liu Z, Weinberger KQ, van\u00a0der Maaten L (2017) Densely connected convolutional networks. In: Proceedings of the CVPR. IEEE, pp 2261\u20132269","DOI":"10.1109\/CVPR.2017.243"},{"key":"183_CR115","doi-asserted-by":"crossref","unstructured":"Huang L, Yang D, Lang B, Deng J (2018) Decorrelated batch normalization. In: Proceedings of the CVPR. IEEE, pp 791\u2013800","DOI":"10.1109\/CVPR.2018.00089"},{"key":"183_CR116","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.cviu.2016.10.018","volume":"155","author":"H Idrees","year":"2017","unstructured":"Idrees H, Zamir AR, Jiang YG, Gorban A, Laptev I, Sukthankar R, Shah M (2017) The thumos challenge on action recognition for videos \u201cin the wild\u201d. Comput Vis Image Underst 155:1\u201323","journal-title":"Comput Vis Image Underst"},{"key":"183_CR117","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1145\/3042064","volume":"50","author":"A Ioannidou","year":"2017","unstructured":"Ioannidou A, Chatzilari E, Nikolopoulos S, Kompatsiaris I (2017) Deep learning advances in computer vision with 3D data: a survey. ACM Comput Surv 50:20","journal-title":"ACM Comput Surv"},{"key":"183_CR118","unstructured":"Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the ICML, pp 448\u2013456. Omnipress"},{"key":"183_CR119","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1007\/978-1-4471-4640-7_8","volume-title":"Consumer depth cameras for computer vision","author":"A Janoch","year":"2013","unstructured":"Janoch A, Karayev S, Jia Y, Barron JT, Fritz M, Saenko K, Darrell T (2013) A category-level 3D object dataset: putting the kinect to work. In: Fossati A, Gall J, Grabner H, Ren X, Konolige K (eds) Consumer depth cameras for computer vision. Springer, Berlin, pp 141\u2013165"},{"key":"183_CR120","doi-asserted-by":"crossref","unstructured":"Jarrett K, Kavukcuoglu K, LeCun Y, et\u00a0al. (2009) What is the best multi-stage architecture for object recognition? In: ICCV. IEEE, pp 2146\u20132153","DOI":"10.1109\/ICCV.2009.5459469"},{"key":"183_CR121","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","volume":"35","author":"S Ji","year":"2013","unstructured":"Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. Trans Pattern Anal Mach Intell 35:221\u2013231","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR122","unstructured":"Jiang Y, Moseson S, Saxena A (2011) Efficient grasping from RGBD images: learning using a new rectangle representation. In: ICRA. IEEE, pp 3304\u20133311"},{"key":"183_CR123","doi-asserted-by":"crossref","first-page":"352","DOI":"10.1109\/TPAMI.2017.2670560","volume":"40","author":"YG Jiang","year":"2018","unstructured":"Jiang YG, Wu Z, Wang J, Xue X, Chang SF (2018) Exploiting feature and class relationships in video categorization with regularized deep neural networks. Trans Pattern Anal Mach Intell 40:352\u2013364","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR124","doi-asserted-by":"crossref","unstructured":"Jin X, Xu C, Feng J, Wei Y, Xiong J, Yan S (2016) Deep learning with s-shaped rectified linear activation units. In: AAAI conference on artificial intelligence, pp 1737\u20131743","DOI":"10.1609\/aaai.v30i1.10287"},{"key":"183_CR125","doi-asserted-by":"crossref","first-page":"635","DOI":"10.1016\/S0262-8856(98)00074-2","volume":"16","author":"AE Johnson","year":"1998","unstructured":"Johnson AE, Hebert M (1998) Surface matching for object recognition in complex three-dimensional scenes. Image Vis Comput 16:635\u2013651","journal-title":"Image Vis Comput"},{"key":"183_CR126","doi-asserted-by":"crossref","first-page":"433","DOI":"10.1109\/34.765655","volume":"21","author":"AE Johnson","year":"1999","unstructured":"Johnson AE, Hebert M (1999) Using spin images for efficient object recognition in cluttered 3D scenes. Trans Pattern Anal Mach Intell 21:433\u2013449","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR127","doi-asserted-by":"crossref","unstructured":"Kadir T, Brady M (2003) Scale saliency: a novel approach to salient feature and scale selection. In: VIE, pp 25\u201328. IET","DOI":"10.1049\/cp:20030478"},{"key":"183_CR128","unstructured":"Kang SM, Wildes RP (2016) Review of action recognition and detection methods. arXiv preprint arXiv:1610.06906"},{"key":"183_CR129","doi-asserted-by":"crossref","unstructured":"Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the CVPR. IEEE, pp 1725\u20131732","DOI":"10.1109\/CVPR.2014.223"},{"key":"183_CR130","unstructured":"Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P et\u00a0al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950"},{"key":"183_CR131","unstructured":"Ke Y, Sukthankar R, Hebert M (2005) Efficient visual event detection using volumetric features. In: ICCV. IEEE, pp 166\u2013173"},{"key":"183_CR132","doi-asserted-by":"crossref","unstructured":"Kerl C, Sturm J, Cremers D (2013) Dense visual slam for RGB-D cameras. In: IROS. IEEE, pp 2100\u20132106","DOI":"10.1109\/IROS.2013.6696650"},{"key":"183_CR133","doi-asserted-by":"crossref","unstructured":"Khan SH, Bennamoun M, Sohel F, Togneri R (2014) Geometry driven semantic labeling of indoor scenes. In: Proceedings of the ECCV. Springer, pp 679\u2013694","DOI":"10.1007\/978-3-319-10590-1_44"},{"key":"183_CR134","unstructured":"Klambauer G, Unterthiner T, Mayr A, Hochreiter S (2017) Self-normalizing neural networks. In: Advances in neural information processing systems, vol 30. Curran Associates, Inc., pp 971\u2013980"},{"key":"183_CR135","doi-asserted-by":"crossref","unstructured":"Klaser A, Marsza\u0142ek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: BMVC, pp 275\u20131. BMVA Press","DOI":"10.5244\/C.22.99"},{"key":"183_CR136","doi-asserted-by":"crossref","unstructured":"Knopp J, Prasad M, Willems G, Timofte R, Van\u00a0Gool L (2010) Hough transform and 3D surf for robust three dimensional classification. In: Proceedings of the ECCV. Springer, pp 589\u2013602","DOI":"10.1007\/978-3-642-15567-3_43"},{"key":"183_CR137","doi-asserted-by":"crossref","first-page":"367","DOI":"10.1007\/BF00318371","volume":"55","author":"JJ Koenderink","year":"1987","unstructured":"Koenderink JJ, van Doorn AJ (1987) Representation of local geometry in the visual system. Biol Cybern 55:367\u2013375","journal-title":"Biol Cybern"},{"key":"183_CR138","unstructured":"Koppula HS, Anand A, Joachims T, Saxena A (2011) Semantic labeling of 3D point clouds for indoor scenes. In: Advances in neural information processing systems, vol 24. Curran Associates, Inc., pp 244\u2013252"},{"key":"183_CR139","doi-asserted-by":"crossref","unstructured":"Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proceedings of the CVPR. IEEE, pp 2046\u20132053","DOI":"10.1109\/CVPR.2010.5539881"},{"key":"183_CR140","unstructured":"Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, vol 25. Curran Associates, Inc., pp 1097\u20131105"},{"key":"183_CR141","doi-asserted-by":"crossref","unstructured":"Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: ICCV. IEEE, pp 2556\u20132563","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"183_CR142","doi-asserted-by":"crossref","unstructured":"Lai K, Bo L, Ren X, Fox D (2011) A large-scale hierarchical multi-view RGB-D object dataset. In: ICRA. IEEE, pp 1817\u20131824","DOI":"10.1109\/ICRA.2011.5980382"},{"key":"183_CR143","doi-asserted-by":"crossref","unstructured":"Lai K, Bo L, Ren X, Fox D (2013) RGB-D object recognition: features, algorithms, and a large scale benchmark. In: Consumer depth cameras for computer vision. Springer, pp 167\u2013192","DOI":"10.1007\/978-1-4471-4640-7_9"},{"key":"183_CR144","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1007\/s11263-005-1838-7","volume":"64","author":"I Laptev","year":"2005","unstructured":"Laptev I (2005) On space-time interest points. Int J Comput Vis 64:107\u2013123","journal-title":"Int J Comput Vis"},{"key":"183_CR145","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1016\/j.cviu.2006.11.023","volume":"108","author":"I Laptev","year":"2007","unstructured":"Laptev I, Caputo B, Sch\u00fcldt C, Lindeberg T (2007) Local velocity-adapted motion events for spatio-temporal recognition. Comput Vis Image Underst 108:207\u2013229","journal-title":"Comput Vis Image Underst"},{"key":"183_CR146","doi-asserted-by":"crossref","unstructured":"Laptev I, Lindeberg T (2004) Velocity adaptation of space-time interest points. In: ICPR. IEEE, pp 52\u201356","DOI":"10.1109\/ICPR.2004.1334003"},{"key":"183_CR147","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1007\/11676959_8","volume-title":"Spatial coherence for visual motion analysis","author":"I Laptev","year":"2006","unstructured":"Laptev I, Lindeberg T (2006) Local descriptors for spatio-temporal recognition. In: MacLean WJ (ed) Spatial coherence for visual motion analysis. Springer, Berlin, pp 91\u2013103"},{"key":"183_CR148","doi-asserted-by":"crossref","unstructured":"Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of the CVPR. IEEE, pp 1\u20138","DOI":"10.1109\/CVPR.2008.4587756"},{"key":"183_CR149","doi-asserted-by":"crossref","first-page":"6993","DOI":"10.1007\/s11042-016-3330-5","volume":"76","author":"G Lara L\u00f3pez","year":"2017","unstructured":"Lara L\u00f3pez G, Pena P\u00e9rez Negr\u00f3n A, De Antonio Jim\u00e9nez A, Ram\u00edrez Rodr\u00edguez J, Imbert Paredes R (2017) Comparative analysis of shape descriptors for 3D objects. Multimed Tools Appl 76:6993\u20137040","journal-title":"Multimed Tools Appl"},{"key":"183_CR150","doi-asserted-by":"crossref","unstructured":"Laurent C, Pereyra G, Brakel P, Zhang Y, Bengio Y (2016) Batch normalized recurrent neural networks. In: ICASSP. IEEE, pp 2657\u20132661","DOI":"10.1109\/ICASSP.2016.7472159"},{"key":"183_CR151","unstructured":"LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD (1990) Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems, pp 396\u2013404"},{"issue":"11","key":"183_CR152","first-page":"2278","volume":"86","author":"Y LeCun","year":"1998","unstructured":"LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proceedings 86(11):2278\u20132324","journal-title":"Proceedings"},{"key":"183_CR153","unstructured":"Lee CY, Xie S, Gallagher P, Zhang Z, Tu Z (2015) Deeply-supervised nets. In: AISTATS. PMLR, pp 562\u2013570"},{"key":"183_CR154","unstructured":"Li B, Lu Y, Li C, Godil A, Schreck T, Aono M, Burtscher M, Fu H, Furuya T, Johan H, et\u00a0al. (2014) Shrec\u201914 track: extended large scale sketch-based 3D shape retrieval. In: Eurographics workshop on 3DOR, pp 121\u2013130"},{"key":"183_CR155","doi-asserted-by":"crossref","unstructured":"Li B, Zhang T, Xia T (2016) Vehicle detection from 3D lidar using fully convolutional network. arXiv preprint arXiv:1608.07916","DOI":"10.15607\/RSS.2016.XII.042"},{"key":"183_CR156","doi-asserted-by":"crossref","unstructured":"Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: Proceedings of the CVPRW. IEEE, pp 9\u201314","DOI":"10.1109\/CVPRW.2010.5543273"},{"key":"183_CR157","doi-asserted-by":"crossref","first-page":"10323","DOI":"10.1109\/ACCESS.2017.2712789","volume":"5","author":"Y Li","year":"2017","unstructured":"Li Y, Xia R, Huang Q, Xie W, Li X (2017) Survey of spatio-temporal interest point detection algorithms in video. IEEE Access 5:10323\u201310331","journal-title":"IEEE Access"},{"key":"183_CR158","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1007\/s11760-017-1172-x","volume":"12","author":"Y Li","year":"2018","unstructured":"Li Y, Xia R, Xie W (2018) A unified model of appearance and motion of video and its application in stip detection. Signal Image Video Process 12:403\u2013410","journal-title":"Signal Image Video Process"},{"key":"183_CR159","doi-asserted-by":"crossref","unstructured":"Li Z, Gan Y, Liang X, Yu Y, Cheng H, Lin L (2016) Lstm-cf: Unifying context modeling and fusion with LSTMs for RGB-D scene labeling. In: Proceedings of the ECCV. Springer, pp 541\u2013557","DOI":"10.1007\/978-3-319-46475-6_34"},{"key":"183_CR160","unstructured":"Li Z, Gan Y, Liang X, Yu Y, Cheng H, Lin L (2016) RGB-D scene labeling with long short-term memorized fusion model. arXiv preprint arXiv:1604.05000"},{"key":"183_CR161","doi-asserted-by":"crossref","unstructured":"Lin G, Milan A, Shen C, Reid I (2017) Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the CVPR. IEEE","DOI":"10.1109\/CVPR.2017.549"},{"key":"183_CR162","unstructured":"Lin M, Chen Q, Yan S (2013) Network in network. arXiv preprint arXiv:1312.4400"},{"key":"183_CR163","doi-asserted-by":"crossref","unstructured":"Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Doll\u00e1r P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Proceedings of the ECCV. Springer, pp 740\u2013755","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"183_CR164","doi-asserted-by":"crossref","unstructured":"Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Proceedings of the ECCV. Springer, pp 816\u2013833","DOI":"10.1007\/978-3-319-46487-9_50"},{"key":"183_CR165","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s11042-016-4205-5","volume":"77","author":"Y Liu","year":"2018","unstructured":"Liu Y, Guo Y, Georgiou T, Lew MS (2018) Fusion that matters: convolutional fusion networks for visual recognition. Multimed Tools Appl 77:1\u201328","journal-title":"Multimed Tools Appl"},{"key":"183_CR166","doi-asserted-by":"crossref","first-page":"1235","DOI":"10.1016\/j.cviu.2009.06.005","volume":"113","author":"TWR Lo","year":"2009","unstructured":"Lo TWR, Siebert JP (2009) Local feature extraction and matching on range images: 2.5 d SIFT. Comput Vis Image Underst 113:1235\u20131250","journal-title":"Comput Vis Image Underst"},{"key":"183_CR167","doi-asserted-by":"crossref","unstructured":"Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the CVPR. IEEE, pp 3431\u20133440","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"183_CR168","doi-asserted-by":"crossref","unstructured":"Lowe DG (1999) Object recognition from local scale-invariant features. In: ICCV. IEEE, pp 1150\u20131157","DOI":"10.1109\/ICCV.1999.790410"},{"key":"183_CR169","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1023\/B:VISI.0000029664.99615.94","volume":"60","author":"DG Lowe","year":"2004","unstructured":"Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91\u2013110","journal-title":"Int J Comput Vis"},{"key":"183_CR170","unstructured":"Lucas BD, Kanade T et\u00a0al (1981) An iterative image registration technique with an application to stereo vision. In: IJCAI. Vancouver, BC, Canada"},{"key":"183_CR171","unstructured":"Luong MT, Sutskever I, Le QV, Vinyals O, Zaremba W (2014) Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206"},{"key":"183_CR172","unstructured":"Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the ICML. Omnipress, p\u00a03"},{"key":"183_CR173","doi-asserted-by":"crossref","unstructured":"Maes C, Fabry T, Keustermans J, Smeets D, Suetens P, Vandermeulen D (2010) Feature detection on 3D face surfaces for pose normalisation and recognition. In: BTAS. IEEE, pp 1\u20136","DOI":"10.1109\/BTAS.2010.5634543"},{"key":"183_CR174","doi-asserted-by":"crossref","unstructured":"Marcos D, Volpi M, Tuia D (2016) Learning rotation invariant convolutional filters for texture classification. In: ICPR. IEEE, pp 2012\u20132017","DOI":"10.1109\/ICPR.2016.7899932"},{"key":"183_CR175","doi-asserted-by":"crossref","unstructured":"Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: Proceedings of the CVPR. IEEE, pp 2929\u20132936","DOI":"10.1109\/CVPR.2009.5206557"},{"key":"183_CR176","doi-asserted-by":"crossref","unstructured":"Masci J, Meier U, Cire\u015fan D, Schmidhuber J (2011) Stacked convolutional auto-encoders for hierarchical feature extraction. In: ICANN. Springer, pp 52\u201359","DOI":"10.1007\/978-3-642-21735-7_7"},{"key":"183_CR177","doi-asserted-by":"crossref","unstructured":"Matikainen P, Hebert M, Sukthankar R (2009) Trajectons: action recognition through the motion analysis of tracked features. In: ICCVW. IEEE, pp 514\u2013521","DOI":"10.1109\/ICCVW.2009.5457659"},{"key":"183_CR178","doi-asserted-by":"crossref","unstructured":"Matsuda T, Furuya T, Ohbuchi R (2015) Lightweight binary voxel shape features for 3D data matching and retrieval. In: International conference on multimedia big data. IEEE, pp 100\u2013107","DOI":"10.1109\/BigMM.2015.66"},{"key":"183_CR179","doi-asserted-by":"crossref","unstructured":"Maturana D, Scherer S (2015) Voxnet: A 3D convolutional neural network for real-time object recognition. In: IROS. IEEE, pp 922\u2013928","DOI":"10.1109\/IROS.2015.7353481"},{"key":"183_CR180","unstructured":"McCormac J, Handa A, Leutenegger S, Davison AJ (2016) Scenenet RGB-D: 5m photorealistic images of synthetic indoor trajectories with ground truth. arXiv preprint arXiv:1612.05079"},{"key":"183_CR181","doi-asserted-by":"crossref","unstructured":"Memisevic R, Hinton G (2007) Unsupervised learning of image transformations. In: Proceedings of the CVPR. IEEE, pp 1\u20138","DOI":"10.1109\/CVPR.2007.383036"},{"key":"183_CR182","doi-asserted-by":"crossref","unstructured":"Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: ICCV. IEEE, pp 104\u2013111","DOI":"10.1109\/ICCV.2009.5459154"},{"key":"183_CR183","doi-asserted-by":"crossref","first-page":"1615","DOI":"10.1109\/TPAMI.2005.188","volume":"27","author":"K Mikolajczyk","year":"2005","unstructured":"Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. Trans Pattern Anal Mach Intell 27:1615\u20131630","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR184","doi-asserted-by":"crossref","first-page":"271","DOI":"10.1016\/S0262-8856(00)00076-7","volume":"19","author":"F Mokhtarian","year":"2001","unstructured":"Mokhtarian F, Khalili N, Yuen P (2001) Multi-scale free-form 3D object recognition using 3D models. Image Vis Comput 19:271\u2013281","journal-title":"Image Vis Comput"},{"key":"183_CR185","unstructured":"Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan Y, Brown L, Fan Q, Gutfreund D, Vondrick C et\u00a0al. (2019) Moments in time dataset: one million videos for event understanding. Trans Pattern Anal Mach Intell 1\u20131"},{"key":"183_CR186","unstructured":"M\u00fcller AC, Behnke S (2014) Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images. In: ICRA. IEEE, pp 6232\u20136237"},{"key":"183_CR187","doi-asserted-by":"crossref","first-page":"1255","DOI":"10.1109\/TRO.2017.2705103","volume":"33","author":"R Mur-Artal","year":"2017","unstructured":"Mur-Artal R, Tard\u00f3s JD (2017) Orb-slam2: an open-source slam system for monocular, stereo, and RGB-D cameras. Trans Robot 33:1255\u20131262","journal-title":"Trans Robot"},{"key":"183_CR188","unstructured":"Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the ICML. Omnipress, pp 807\u2013814"},{"key":"183_CR189","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1016\/j.neucom.2012.08.064","volume":"120","author":"ER Nascimento","year":"2013","unstructured":"Nascimento ER, Oliveira GL, Vieira AW, Campos MF (2013) On the development of a robust, fast and lightweight keypoint descriptor. Neurocomputing 120:141\u2013155","journal-title":"Neurocomputing"},{"key":"183_CR190","unstructured":"Ng JYH, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the CVPR. IEEE, pp 4694\u20134702"},{"key":"183_CR191","unstructured":"Ngiam J, Chen Z, Koh PW, Ng AY (2011) Learning deep energy models. In: Proceedings of the ICML. Omnipress, pp 1105\u20131112"},{"key":"183_CR192","doi-asserted-by":"crossref","first-page":"559","DOI":"10.1016\/j.compmedimag.2009.05.006","volume":"33","author":"D Ni","year":"2009","unstructured":"Ni D, Chui YP, Qu Y, Yang X, Qin J, Wong TT, Ho SS, Heng PA (2009) Reconstruction of volumetric ultrasound panorama based on improved 3D SIFT. Comput Med Imaging Graph 33:559\u2013566","journal-title":"Comput Med Imaging Graph"},{"key":"183_CR193","doi-asserted-by":"crossref","first-page":"299","DOI":"10.1007\/s11263-007-0122-4","volume":"79","author":"JC Niebles","year":"2008","unstructured":"Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79:299\u2013318","journal-title":"Int J Comput Vis"},{"key":"183_CR194","doi-asserted-by":"crossref","unstructured":"Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: ICCV. IEEE, pp 1520\u20131528","DOI":"10.1109\/ICCV.2015.178"},{"key":"183_CR195","doi-asserted-by":"crossref","unstructured":"Novatnack J, Nishino K (2008) Scale-dependent\/invariant local 3D shape descriptors for fully automatic registration of multiple sets of range images. In: Proceedings of the ECCV. Springer, pp 440\u2013453","DOI":"10.1007\/978-3-540-88690-7_33"},{"key":"183_CR196","doi-asserted-by":"crossref","unstructured":"Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. In: Proceedings of the ECCV. Springer, pp 490\u2013503","DOI":"10.1007\/11744085_38"},{"key":"183_CR197","doi-asserted-by":"crossref","first-page":"710","DOI":"10.1109\/TSMCB.2005.861864","volume":"36","author":"A Oikonomopoulos","year":"2005","unstructured":"Oikonomopoulos A, Patras I, Pantic M (2005) Spatiotemporal salient points for visual recognition of human actions. Trans Syst Man Cybern B (Cybern) 36:710\u2013719","journal-title":"Trans Syst Man Cybern B (Cybern)"},{"key":"183_CR198","doi-asserted-by":"crossref","first-page":"971","DOI":"10.1109\/TPAMI.2002.1017623","volume":"24","author":"T Ojala","year":"2002","unstructured":"Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Trans Pattern Anal Mach Intell 24:971\u2013987","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR199","doi-asserted-by":"crossref","first-page":"831","DOI":"10.1109\/34.868684","volume":"22","author":"NM Oliver","year":"2000","unstructured":"Oliver NM, Rosario B, Pentland AP (2000) A bayesian computer vision system for modeling human interactions. Trans Pattern Anal Mach Intell 22:831\u2013843","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR200","doi-asserted-by":"crossref","unstructured":"Oreifej O, Liu Z (2013) Hon4d: Histogram of oriented 4D normals for activity recognition from depth sequences. In: Proceedings of the CVPR. IEEE, pp 716\u2013723","DOI":"10.1109\/CVPR.2013.98"},{"key":"183_CR201","doi-asserted-by":"crossref","first-page":"807","DOI":"10.1145\/571647.571648","volume":"21","author":"R Osada","year":"2002","unstructured":"Osada R, Funkhouser T, Chazelle B, Dobkin D (2002) Shape distributions. Trans Graph 21:807\u2013832","journal-title":"Shape distributions. Trans Graph"},{"key":"183_CR202","unstructured":"Park SJ, Hong KS, Lee S (2017) Rdfnet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In: ICCV. IEEE, pp 4990\u20134999"},{"key":"183_CR203","doi-asserted-by":"crossref","first-page":"976","DOI":"10.1016\/j.imavis.2009.11.014","volume":"28","author":"R Poppe","year":"2010","unstructured":"Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28:976\u2013990","journal-title":"Image Vis Comput"},{"key":"183_CR204","unstructured":"Poultney C, Chopra S, Cun YL et\u00a0al. (2007) Efficient learning of sparse representations with an energy-based model. In: Advances in neural information processing systems, pp 1137\u20131144"},{"key":"183_CR205","unstructured":"Qi CR, Liu W, Wu C, Su H, Guibas LJ (2017) Frustum pointnets for 3D object detection from RGB-D data. arXiv preprint arXiv:1711.08488"},{"key":"183_CR206","unstructured":"Qi CR, Su H, Mo K, Guibas LJ (2017) Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the CVPR. IEEE"},{"key":"183_CR207","unstructured":"Qi CR, Su H, Nie\u00dfner M, Dai A, Yan M, Guibas LJ (2016) Volumetric and multi-view CNNs for object classification on 3D data. In: Proceedings of the CVPR. IEEE, pp 5648\u20135656"},{"key":"183_CR208","doi-asserted-by":"crossref","unstructured":"Qi X, Liao R, Jia J, Fidler S, Urtasun R (2017) 3D graph neural networks for RGBD semantic segmentation. In: ICCV. IEEE, pp 5199\u20135208","DOI":"10.1109\/ICCV.2017.556"},{"key":"183_CR209","unstructured":"Quadros A, Underwood JP, Douillard B (2013) Sydney urban objects dataset. http:\/\/www.acfr.usyd.edu.au\/papers\/SydneyUrbanObjectsDataset.shtml"},{"key":"183_CR210","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1016\/j.image.2018.03.015","volume":"65","author":"S Quan","year":"2018","unstructured":"Quan S, Ma J, Ma T, Hu F, Fang B (2018) Representing local shape geometry from multi-view silhouette perspective: a distinctive and robust binary 3D feature. Signal Process Image Commun 65:67\u201380","journal-title":"Signal Process Image Commun"},{"key":"183_CR211","doi-asserted-by":"crossref","first-page":"2430","DOI":"10.1109\/TPAMI.2016.2533389","volume":"38","author":"H Rahmani","year":"2016","unstructured":"Rahmani H, Mahmood A, Huynh D, Mian A (2016) Histogram of oriented principal components for cross-view action recognition. Trans Pattern Anal Mach Intell 38:2430\u20132443","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR212","doi-asserted-by":"crossref","unstructured":"Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) Hopc: histogram of oriented principal components of 3D pointclouds for action recognition. In: Proceedings of the ECCV. Springer, pp 742\u2013757","DOI":"10.1007\/978-3-319-10605-2_48"},{"key":"183_CR213","first-page":"25","volume":"1","author":"M Regneri","year":"2013","unstructured":"Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans ACL 1:25\u201336","journal-title":"Trans ACL"},{"key":"183_CR214","unstructured":"Ren M, Liao R, Urtasun R, Sinz FH, Zemel RS (2016) Normalizing the normalizers: comparing and extending network normalization schemes. arXiv preprint arXiv:1611.04520"},{"key":"183_CR215","unstructured":"Ren X, Bo L, Fox D (2012) Rgb-(d) scene labeling: features and algorithms. In: Proceedings of the CVPR. IEEE, pp 2759\u20132766"},{"key":"183_CR216","doi-asserted-by":"crossref","first-page":"1179","DOI":"10.1109\/LRA.2016.2532924","volume":"1","author":"C Rennie","year":"2016","unstructured":"Rennie C, Shome R, Bekris KE, De Souza AF (2016) A dataset for improved RGBD-based object detection and pose estimation for warehouse pick-and-place. Robot Autom Lett 1:1179\u20131185","journal-title":"Robot Autom Lett"},{"key":"183_CR217","doi-asserted-by":"crossref","unstructured":"Richter SR, Vineet V, Roth S, Koltun V (2016) Playing for data: Ground truth from computer games. In: Proceedings of the ECCV. Springer, pp 102\u2013118","DOI":"10.1007\/978-3-319-46475-6_7"},{"key":"183_CR218","unstructured":"Rifai S, Vincent P, Muller X, Glorot X, Bengio Y (2011) Contractive auto-encoders: explicit invariance during feature extraction. In: Proceedings of the ICML. Omnipress, pp 833\u2013840"},{"key":"183_CR219","doi-asserted-by":"crossref","unstructured":"Rios-Cabrera R, Tuytelaars T (2013) Discriminatively trained templates for 3D object detection: a real time scalable approach. In: ICCV. IEEE, pp 2048\u20132055","DOI":"10.1109\/ICCV.2013.256"},{"key":"183_CR220","doi-asserted-by":"crossref","unstructured":"Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of the CVPR. IEEE, pp 1\u20138","DOI":"10.1109\/CVPR.2008.4587727"},{"key":"183_CR221","doi-asserted-by":"crossref","first-page":"219","DOI":"10.1016\/S0262-8856(96)01127-4","volume":"15","author":"K Rohr","year":"1997","unstructured":"Rohr K (1997) On 3D differential operators for detecting point landmarks. Image Vis Comput 15:219\u2013233","journal-title":"Image Vis Comput"},{"key":"183_CR222","doi-asserted-by":"crossref","unstructured":"Ros G, Sellart L, Materzynska J, Vazquez D, Lopez AM (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the CVPR. IEEE, pp 3234\u20133243","DOI":"10.1109\/CVPR.2016.352"},{"key":"183_CR223","doi-asserted-by":"crossref","unstructured":"Rosten E, Drummond T (2006) Machine learning for high-speed corner detection. In: Proceedings of the ECCV. Springer, pp 430\u2013443","DOI":"10.1007\/11744023_34"},{"key":"183_CR224","doi-asserted-by":"crossref","unstructured":"Rublee E, Rabaud V, Konolige K, Bradski GR (2011) Orb: An efficient alternative to SIFT or SURF. In: ICCV, vol\u00a011. Citeseer, p\u00a02","DOI":"10.1109\/ICCV.2011.6126544"},{"key":"183_CR225","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"O Russakovsky","year":"2015","unstructured":"Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211\u2013252","journal-title":"Int J Comput Vis"},{"key":"183_CR226","unstructured":"Rustamov RM (2007) Laplace-beltrami eigenfunctions for deformation invariant shape representation. In: Proceedings of the ESGP. Eurographics Association, pp 225\u2013233"},{"key":"183_CR227","doi-asserted-by":"crossref","unstructured":"Rusu RB, Blodow N, Beetz M (2009) Fast point feature histograms (FPFH) for 3D registration. In: ICRA. IEEE, pp 3212\u20133217","DOI":"10.1109\/ROBOT.2009.5152473"},{"key":"183_CR228","doi-asserted-by":"crossref","unstructured":"Rusu RB, Blodow N, Marton ZC, Beetz M (2008) Aligning point cloud views using persistent feature histograms. In: IROS. IEEE, pp 3384\u20133391","DOI":"10.1109\/IROS.2008.4650967"},{"key":"183_CR229","doi-asserted-by":"crossref","first-page":"206","DOI":"10.1108\/02602280410525995","volume":"24","author":"A Saeed Mian","year":"2004","unstructured":"Saeed Mian A, Bennamoun M, Owens R (2004) Automated 3D model-based free-form object recognition. Sens Rev 24:206\u2013215","journal-title":"Sens Rev"},{"key":"183_CR230","unstructured":"Salakhutdinov R (2008) Learning and evaluating boltzmann machines. Technical Report, Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto"},{"key":"183_CR231","unstructured":"Salakhutdinov R, Hinton G (2009) Deep boltzmann machines. In: AISTATS. PMLR, pp 448\u2013455"},{"key":"183_CR232","unstructured":"Salakhutdinov R, Larochelle H (2010) Efficient learning of deep boltzmann machines. In: AISTATS. PMLR, pp 693\u2013700"},{"key":"183_CR233","unstructured":"Salimans T, Kingma DP (2016) Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: Advances in neural information processing systems, vol 29. Curran Associates, Inc., pp 901\u2013909"},{"key":"183_CR234","doi-asserted-by":"crossref","unstructured":"Saputra MRU, Markham A, Trigoni N (2018) Visual slam and structure from motion in dynamic environments: a survey. CSUR p.\u00a037","DOI":"10.1145\/3177853"},{"key":"183_CR235","doi-asserted-by":"crossref","unstructured":"Savarese S, Fei-Fei L (2007) 3D generic object categorization, localization and pose estimation. In: ICCV. IEEE, pp 1\u20138","DOI":"10.1109\/ICCV.2007.4408987"},{"key":"183_CR236","doi-asserted-by":"crossref","unstructured":"Savva M, Chang AX, Hanrahan P (2015) Semantically-enriched 3D models for common-sense knowledge. In: Proceedings of the CVPRW. IEEE, pp 24\u201331","DOI":"10.1109\/CVPRW.2015.7301289"},{"key":"183_CR237","doi-asserted-by":"crossref","unstructured":"Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: ICPR. IEEE, pp 32\u201336","DOI":"10.1109\/ICPR.2004.1334462"},{"key":"183_CR238","doi-asserted-by":"crossref","first-page":"2673","DOI":"10.1109\/78.650093","volume":"45","author":"M Schuster","year":"1997","unstructured":"Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. Trans Signal Process 45:2673\u20132681","journal-title":"Trans Signal Process"},{"key":"183_CR239","doi-asserted-by":"crossref","unstructured":"Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descriptor and its application to action recognition. In: Proceedings of the ICM, pp 357\u2013360. ACM","DOI":"10.1145\/1291233.1291311"},{"key":"183_CR240","doi-asserted-by":"crossref","unstructured":"Sebe N, Lew MS, Huang TS (2004) The state-of-the-art in human\u2013computer interaction. In: International workshop on computer vision in human\u2013computer interaction. Springer, pp 1\u20136","DOI":"10.1007\/978-3-540-24837-8_1"},{"key":"183_CR241","unstructured":"Sedaghat N, Zolfaghari M, Amiri E, Brox T (2016) Orientation-boosted voxel nets for 3D object recognition. arXiv preprint arXiv:1604.03351"},{"key":"183_CR242","doi-asserted-by":"crossref","unstructured":"Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb+ d: a large scale dataset for 3D human activity analysis. In: Proceedings of the CVPR. IEEE, pp 1010\u20131019","DOI":"10.1109\/CVPR.2016.115"},{"key":"183_CR243","doi-asserted-by":"crossref","unstructured":"Shechtman E, Irani M (2005) Space-time behavior based correlation. In: Proceedings of the CVPR. IEEE, pp 405\u2013412","DOI":"10.1109\/CVPR.2005.328"},{"key":"183_CR244","doi-asserted-by":"crossref","first-page":"2045","DOI":"10.1109\/TPAMI.2007.1119","volume":"29","author":"E Shechtman","year":"2007","unstructured":"Shechtman E, Irani M (2007) Space-time behavior-based correlation-or-how to tell if two underlying motion fields are similar without computing them? Trans Pattern Anal Mach Intell 29:2045\u20132056","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR245","doi-asserted-by":"crossref","first-page":"2339","DOI":"10.1109\/LSP.2015.2480802","volume":"22","author":"B Shi","year":"2015","unstructured":"Shi B, Bai S, Zhou Z, Bai X (2015) Deeppano: Deep panoramic representation for 3-d shape recognition. Signal Process Lett 22:2339\u20132343","journal-title":"Signal Process Lett"},{"key":"183_CR246","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1016\/j.patcog.2006.04.034","volume":"40","author":"JL Shih","year":"2007","unstructured":"Shih JL, Lee CH, Wang JT (2007) A new 3D model retrieval approach based on the elevation descriptor. Pattern Recognit 40:283\u2013295","journal-title":"Pattern Recognit"},{"key":"183_CR247","doi-asserted-by":"crossref","unstructured":"Shilane P, Min P, Kazhdan M, Funkhouser T (2004) The princeton shape benchmark. In: Shape modeling applications, 2004. Proceedings. IEEE, pp 167\u2013178","DOI":"10.1109\/SMI.2004.1314504"},{"issue":"2","key":"183_CR248","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1016\/S0378-3758(00)00115-4","volume":"90","author":"H Shimodaira","year":"2000","unstructured":"Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227\u2013244","journal-title":"J Stat Plan Inference"},{"key":"183_CR249","doi-asserted-by":"crossref","unstructured":"Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: Proceedings of the CVPR. IEEE, pp 1297\u20131304","DOI":"10.1109\/CVPR.2011.5995316"},{"key":"183_CR250","doi-asserted-by":"crossref","unstructured":"Silberman N, Fergus R (2011) Indoor scene segmentation using a structured light sensor. In: ICCVW. IEEE, pp 601\u2013608","DOI":"10.1109\/ICCVW.2011.6130298"},{"key":"183_CR251","doi-asserted-by":"crossref","unstructured":"Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: Proceedings of the ECCV. Springer, pp 746\u2013760","DOI":"10.1007\/978-3-642-33715-4_54"},{"key":"183_CR252","unstructured":"Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034"},{"key":"183_CR253","unstructured":"Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, vol 27. Curran Associates, Inc., pp 568\u2013576"},{"key":"183_CR254","unstructured":"Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556"},{"key":"183_CR255","doi-asserted-by":"crossref","unstructured":"Singh A, Sha J, Narayan KS, Achim T, Abbeel P (2014) Bigbird: a large-scale 3D database of object instances. In: ICRA. IEEE, pp 509\u2013516","DOI":"10.1109\/ICRA.2014.6906903"},{"key":"183_CR256","doi-asserted-by":"crossref","first-page":"1107","DOI":"10.1007\/s10462-018-9651-1","volume":"52","author":"T Singh","year":"2019","unstructured":"Singh T, Vishwakarma DK (2019) Video benchmarks of human action datasets: a review. Artif Intell Rev 52:1107\u20131154","journal-title":"Artif Intell Rev"},{"key":"183_CR257","unstructured":"Socher R, Huval B, Bath BP, Manning CD, Ng AY (2012) Convolutional-recursive deep learning for 3d object classification. In: Advances in neural information processing systems. Curran Associates, Inc., p\u00a08"},{"key":"183_CR258","doi-asserted-by":"crossref","unstructured":"Song S, Lichtenberg SP, Xiao J (2015) Sun RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the CVPR. IEEE, pp 567\u2013576","DOI":"10.1109\/CVPR.2015.7298655"},{"key":"183_CR259","doi-asserted-by":"crossref","unstructured":"Song S, Yu F, Zeng A, Chang AX, Savva M, Funkhouser T (2017) Semantic scene completion from a single depth image. In: Proceedings of the CVPR. IEEE, pp 1746\u20131754","DOI":"10.1109\/CVPR.2017.28"},{"key":"183_CR260","doi-asserted-by":"crossref","unstructured":"Song Y, Morency LP, Davis R (2013) Action recognition by hierarchical sequence summarization. In: Proceedings of the CVPR. IEEE, pp 3562\u20133569","DOI":"10.1109\/CVPR.2013.457"},{"key":"183_CR261","unstructured":"Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402"},{"key":"183_CR262","unstructured":"Srivastava RK, Greff K, Schmidhuber J (2015) Highway networks. arXiv preprint arXiv:1505.00387"},{"key":"183_CR263","doi-asserted-by":"crossref","unstructured":"Strasdat H, Davison AJ, Montiel JM, Konolige K (2011) Double window optimisation for constant time visual slam. In: ICCV. IEEE, pp 2352\u20132359","DOI":"10.1109\/ICCV.2011.6126517"},{"key":"183_CR264","doi-asserted-by":"crossref","unstructured":"St\u00fcckler J, Biresev N, Behnke S (2012) Semantic mapping using object-class segmentation of RGB-D images. In: IROS. IEEE, pp 3005\u20133010","DOI":"10.1109\/IROS.2012.6385983"},{"key":"183_CR265","doi-asserted-by":"crossref","first-page":"599","DOI":"10.1007\/s11554-013-0379-5","volume":"10","author":"J St\u00fcckler","year":"2015","unstructured":"St\u00fcckler J, Waldvogel B, Schulz H, Behnke S (2015) Dense real-time mapping of object-class semantics from RGB-D video. J Real-Time Image Process 10:599\u2013609","journal-title":"J Real-Time Image Process"},{"key":"183_CR266","doi-asserted-by":"crossref","unstructured":"Su H, Maji S, Kalogerakis E, Learned-Miller E (2015) Multi-view convolutional neural networks for 3d shape recognition. In: ICCV. IEEE, pp 945\u2013953","DOI":"10.1109\/ICCV.2015.114"},{"key":"183_CR267","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1007\/s11263-013-0644-x","volume":"106","author":"D Sun","year":"2014","unstructured":"Sun D, Roth S, Black MJ (2014) A quantitative analysis of current practices in optical flow estimation and the principles behind them. Int J Comput Vis 106:115\u2013137","journal-title":"Int J Comput Vis"},{"key":"183_CR268","doi-asserted-by":"crossref","unstructured":"Sun J, Ovsjanikov M, Guibas L (2009) A concise and provably informative multi-scale signature based on heat diffusion. In: Computer graphics forum. Wiley Online Library, pp 1383\u20131392","DOI":"10.1111\/j.1467-8659.2009.01515.x"},{"key":"183_CR269","unstructured":"Sun J, Wu X, Yan S, Cheong LF, Chua TS, Li J (2009) Hierarchical spatio-temporal context modeling for action recognition. In: Proceedings of the CVPR. IEEE, pp 2004\u20132011"},{"key":"183_CR270","doi-asserted-by":"crossref","unstructured":"Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI conference on artificial intelligence","DOI":"10.1609\/aaai.v31i1.11231"},{"key":"183_CR271","doi-asserted-by":"crossref","unstructured":"Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A, et\u00a0al. (2015) Going deeper with convolutions. In: Proceedings of the CVPR. IEEE, pp 1\u20139","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"183_CR272","unstructured":"Tang S, Wang X, Lv X, Han TX, Keller J, He Z, Skubic M, Lao S (2012) Histogram of oriented normal vectors for object recognition with a depth sensor. In: ACCV. Springer, pp 525\u2013538"},{"key":"183_CR273","unstructured":"Tangelder JW, Veltkamp RC (2004) A survey of content based 3D shape retrieval methods. In: Shape modeling applications, 2004. IEEE, pp 145\u2013156"},{"key":"183_CR274","doi-asserted-by":"crossref","unstructured":"Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Proceedings of the ECCV. Springer, pp 140\u2013153","DOI":"10.1007\/978-3-642-15567-3_11"},{"key":"183_CR275","doi-asserted-by":"crossref","unstructured":"Teichman A, Levinson J, Thrun S (2011) Towards 3D object recognition via classification of arbitrary object tracks. In: ICRA. IEEE, pp 4034\u20134041","DOI":"10.1109\/ICRA.2011.5979636"},{"key":"183_CR276","doi-asserted-by":"crossref","first-page":"804","DOI":"10.1177\/0278364912442751","volume":"31","author":"A Teichman","year":"2012","unstructured":"Teichman A, Thrun S (2012) Tracking-based semi-supervised learning. Int J Robot Res 31:804\u2013818","journal-title":"Int J Robot Res"},{"key":"183_CR277","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1109\/TPAMI.2017.2665623","volume":"40","author":"A Tejani","year":"2017","unstructured":"Tejani A, Kouskouridas R, Doumanoglou A, Tang D, Kim TK (2017) Latent-class hough forests for 6 DoF object pose estimation. Trans Pattern Anal Mach Intell 40:119\u2013132","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR278","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1109\/TPAMI.2017.2665623","volume":"40","author":"A Tejani","year":"2018","unstructured":"Tejani A, Kouskouridas R, Doumanoglou A, Tang D, Kim TK (2018) Latent-class hough forests for 6 dof object pose estimation. Trans Pattern Anal Mach Intell 40:119\u2013132","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR279","doi-asserted-by":"crossref","unstructured":"Thomee B, Huiskes MJ, Bakker E, Lew MS (2008) Large scale image copy detection evaluation. In: ICMIR. ACM, pp 59\u201366","DOI":"10.1145\/1460096.1460108"},{"key":"183_CR280","unstructured":"Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li LJ (2015) The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817"},{"key":"183_CR281","doi-asserted-by":"crossref","unstructured":"Tombari F, Salti S, Di\u00a0Stefano L (2010) Unique signatures of histograms for local surface description. In: Proceedings of the ECCV. Springer, pp 356\u2013369","DOI":"10.1007\/978-3-642-15558-1_26"},{"key":"183_CR282","doi-asserted-by":"crossref","unstructured":"Tombari F, Salti S, Di\u00a0Stefano L (2011) A combined texture-shape descriptor for enhanced 3D feature matching. In: ICIP. IEEE, pp 809\u2013812","DOI":"10.1109\/ICIP.2011.6116679"},{"key":"183_CR283","doi-asserted-by":"crossref","first-page":"198","DOI":"10.1007\/s11263-012-0545-4","volume":"102","author":"F Tombari","year":"2013","unstructured":"Tombari F, Salti S, Di Stefano L (2013) Performance evaluation of 3D keypoint detectors. Int J Comput Vis 102:198\u2013220","journal-title":"Int J Comput Vis"},{"key":"183_CR284","doi-asserted-by":"crossref","unstructured":"Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: ICCV. IEEE, pp 4489\u20134497","DOI":"10.1109\/ICCV.2015.510"},{"key":"183_CR285","doi-asserted-by":"crossref","unstructured":"Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the CVPR. IEEE, pp 6450\u20136459","DOI":"10.1109\/CVPR.2018.00675"},{"key":"183_CR286","doi-asserted-by":"crossref","unstructured":"Trottier L, Gigu P, Chaib-draa B, et\u00a0al. (2017) Parametric exponential linear unit for deep convolutional neural networks. In: ICMLA. IEEE, pp 207\u2013214","DOI":"10.1109\/ICMLA.2017.00038"},{"key":"183_CR287","unstructured":"Ulyanov D, Vedaldi A, Lempitsky V (2016) Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022"},{"key":"183_CR288","doi-asserted-by":"crossref","unstructured":"Valada A, Mohan R, Burgard W (2019) Self-supervised model adaptation for multimodal semantic segmentation. Int J Comput Vis","DOI":"10.1007\/s11263-019-01188-y"},{"key":"183_CR289","doi-asserted-by":"crossref","first-page":"1510","DOI":"10.1109\/TPAMI.2017.2712608","volume":"40","author":"G Varol","year":"2017","unstructured":"Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. Trans Pattern Anal Mach Intell 40:1510\u20131517","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR290","doi-asserted-by":"crossref","unstructured":"Vieira AW, Nascimento ER, Oliveira GL, Liu Z, Campos MF (2012) Stop: space-time occupancy patterns for 3D action recognition from depth map sequences. In: Iberoamerican congress on pattern recognition. Springer, pp 252\u2013259","DOI":"10.1007\/978-3-642-33275-3_31"},{"key":"183_CR291","doi-asserted-by":"crossref","unstructured":"Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the ICML, pp 1096\u20131103. ACM","DOI":"10.1145\/1390156.1390294"},{"key":"183_CR292","first-page":"3371","volume":"11","author":"P Vincent","year":"2010","unstructured":"Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371\u20133408","journal-title":"J Mach Learn Res"},{"key":"183_CR293","unstructured":"Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the CVPR. IEEE, p\u00a03"},{"key":"183_CR294","doi-asserted-by":"crossref","unstructured":"Wang A, Lu J, Wang G, Cai J, Cham TJ (2014) Multi-modal unsupervised feature learning for RGB-D scene labeling. In: Proceedings of the ECCV. Springer, pp 453\u2013467","DOI":"10.1007\/978-3-319-10602-1_30"},{"key":"183_CR295","unstructured":"Wang C, Pelillo M, Siddiqi K (2019) Dominant set clustering and pooling for multi-view 3D object recognition. arXiv preprint arXiv:1906.01592"},{"key":"183_CR296","unstructured":"Wang DZ, Posner I, Newman P (2012) What could move? finding cars, pedestrians and bicyclists in 3D laser data. In: ICRA. IEEE, pp 4038\u20134044"},{"key":"183_CR297","unstructured":"Wang G, Luo P, Wang X, Lin L, et\u00a0al. (2018) Kalman normalization: Normalizing internal representations across network layers. In: Advances in neural information processing systems, vol 31. Curran Associates, Inc., pp 21\u201331"},{"key":"183_CR298","doi-asserted-by":"crossref","unstructured":"Wang H, Kl\u00e4ser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of the CVPR. IEEE, pp 3169\u20133176","DOI":"10.1109\/CVPR.2011.5995407"},{"key":"183_CR299","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1007\/s11263-012-0594-8","volume":"103","author":"H Wang","year":"2013","unstructured":"Wang H, Kl\u00e4ser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103:60\u201379","journal-title":"Int J Comput Vis"},{"key":"183_CR300","doi-asserted-by":"crossref","unstructured":"Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV. IEEE, pp 3551\u20133558","DOI":"10.1109\/ICCV.2013.441"},{"key":"183_CR301","doi-asserted-by":"crossref","unstructured":"Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: Proceedings of the CVPR. IEEE, pp 1290\u20131297","DOI":"10.1109\/CVPR.2012.6247813"},{"key":"183_CR302","doi-asserted-by":"crossref","first-page":"914","DOI":"10.1109\/TPAMI.2013.198","volume":"36","author":"J Wang","year":"2014","unstructured":"Wang J, Liu Z, Wu Y (2014) Learning actionlet ensemble for 3D human action recognition. Trans Pattern Anal Mach Intell 36:914\u2013927","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR303","doi-asserted-by":"crossref","unstructured":"Wang J, Wang Z, Tao D, See S, Wang G (2016) Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In: Proceedings of the ECCV. Springer, pp 664\u2013679","DOI":"10.1007\/978-3-319-46454-1_40"},{"key":"183_CR304","doi-asserted-by":"crossref","unstructured":"Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the CVPR. IEEE, pp 4305\u20134314","DOI":"10.1109\/CVPR.2015.7299059"},{"key":"183_CR305","unstructured":"Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159"},{"key":"183_CR306","doi-asserted-by":"crossref","first-page":"498","DOI":"10.1109\/THMS.2015.2504550","volume":"46","author":"P Wang","year":"2016","unstructured":"Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2016) Action recognition from depth maps using deep convolutional neural networks. Trans Hum Mach Syst 46:498\u2013509","journal-title":"Trans Hum Mach Syst"},{"key":"183_CR307","doi-asserted-by":"crossref","first-page":"1310","DOI":"10.1109\/TPAMI.2010.214","volume":"33","author":"Y Wang","year":"2011","unstructured":"Wang Y, Mori G (2011) Hidden part models for human action recognition: probabilistic versus max margin. Trans Pattern Anal Mach Intell 33:1310\u20131323","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR308","doi-asserted-by":"crossref","first-page":"1697","DOI":"10.1177\/0278364916669237","volume":"35","author":"T Whelan","year":"2016","unstructured":"Whelan T, Salas-Moreno RF, Glocker B, Davison AJ, Leutenegger S (2016) Elasticfusion: real-time dense SLAM and light source estimation. Int J Robot Res 35:1697\u20131716","journal-title":"Int J Robot Res"},{"key":"183_CR309","doi-asserted-by":"crossref","unstructured":"Willems G, Becker JH, Tuytelaars T, Van\u00a0Gool LJ (2009) Exemplar-based action recognition in video. In: BMVC. BMVA Press, p 3","DOI":"10.5244\/C.23.90"},{"key":"183_CR310","doi-asserted-by":"crossref","unstructured":"Willems G, Tuytelaars T, Van\u00a0Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Proceedings of the ECCV. Springer, pp 650\u2013663","DOI":"10.1007\/978-3-540-88688-4_48"},{"key":"183_CR311","unstructured":"Wong SF, Cipolla R (2007) Extracting spatiotemporal interest points using global information. In: ICCV. IEEE, pp 1\u20138"},{"key":"183_CR312","unstructured":"Wu J, Zhang C, Xue T, Freeman B, Tenenbaum J (2016) Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: Advances in neural information processing systems, vol 29. Curran Associates, Inc., pp 82\u201390"},{"key":"183_CR313","doi-asserted-by":"crossref","unstructured":"Wu Y, He K (2018) Group normalization. In: Proceedings of the ECCV. Springer, pp 3\u201319","DOI":"10.1007\/978-3-030-01261-8_1"},{"key":"183_CR314","unstructured":"Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X, Xiao J (2015) 3D shapenets: A deep representation for volumetric shapes. In: Proceedings of the CVPR. IEEE, pp 1912\u20131920"},{"key":"183_CR315","doi-asserted-by":"crossref","unstructured":"Xia L, Aggarwal J (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: Proceedings of the CVPR. IEEE, pp 2834\u20132841","DOI":"10.1109\/CVPR.2013.365"},{"key":"183_CR316","doi-asserted-by":"crossref","unstructured":"Xiao J, Owens A, Torralba A (2013) Sun3d: A database of big spaces reconstructed using sfm and object labels. In: ICCV. IEEE, pp 1625\u20131632","DOI":"10.1109\/ICCV.2013.458"},{"key":"183_CR317","unstructured":"Xu H, He K, Sigal L, Sclaroff S, Saenko K (2018) Text-to-clip video retrieval with early fusion and re-captioning. arXiv preprint arXiv:1804.05113"},{"key":"183_CR318","doi-asserted-by":"crossref","unstructured":"Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using hidden markov model. In: Proceedings of the CVPR. IEEE, pp 379\u2013385","DOI":"10.1109\/CVPR.1992.223161"},{"key":"183_CR319","doi-asserted-by":"crossref","first-page":"163","DOI":"10.1016\/j.ins.2016.01.095","volume":"346","author":"J Yang","year":"2016","unstructured":"Yang J, Cao Z, Zhang Q (2016) A fast and robust local descriptor for 3D point cloud registration. Information Sciences 346:163\u2013179","journal-title":"Information Sciences"},{"key":"183_CR320","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1016\/j.patcog.2016.11.019","volume":"65","author":"J Yang","year":"2017","unstructured":"Yang J, Zhang Q, Xiao Y, Cao Z (2017) Toldi: an effective and robust approach for 3D local shape description. Pattern Recognit 65:175\u2013187","journal-title":"Pattern Recognit"},{"key":"183_CR321","doi-asserted-by":"crossref","unstructured":"Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: Proceedings of the CVPR. IEEE, pp 804\u2013811","DOI":"10.1109\/CVPR.2014.108"},{"key":"183_CR322","unstructured":"Yang X, Tian YL (2012) Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In: Proceedings of the CVPR. IEEE, pp 14\u201319"},{"key":"183_CR323","doi-asserted-by":"crossref","unstructured":"Yeffet L, Wolf L (2009) Local trinary patterns for human action recognition. In: ICCV. IEEE, pp 492\u2013497","DOI":"10.1109\/ICCV.2009.5459201"},{"key":"183_CR324","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1016\/j.neucom.2018.03.037","volume":"304","author":"H Yu","year":"2018","unstructured":"Yu H, Yang Z, Tan L, Wang Y, Sun W, Sun M, Tang Y (2018) Methods and datasets on semantic segmentation: a review. Neurocomputing 304:82\u2013103","journal-title":"Neurocomputing"},{"key":"183_CR325","doi-asserted-by":"crossref","unstructured":"Yu TH, Kim TK, Cipolla R (2010) Real-time action recognition by spatiotemporal semantic and structural forests. In: BMVC. BMVA Press, p\u00a06","DOI":"10.5244\/C.24.52"},{"key":"183_CR326","unstructured":"Yu W, Yang K, Bai Y, Yao H, Rui Y (2014) Visualizing and comparing convolutional neural networks. arXiv preprint arXiv:1412.6631"},{"key":"183_CR327","doi-asserted-by":"crossref","first-page":"86","DOI":"10.1145\/2766908","volume":"34","author":"ME Yumer","year":"2015","unstructured":"Yumer ME, Chaudhuri S, Hodgins JK, Kara LB (2015) Semantic shape editing using deformation handles. ACM Trans Graph 34:86","journal-title":"ACM Trans Graph"},{"key":"183_CR328","unstructured":"Yumer ME, Mitra NJ (2016) Learning semantic deformation flows with 3D convolutional networks. In: Proceedings of the ECCV. Springer, pp 294\u2013311"},{"key":"183_CR329","doi-asserted-by":"crossref","unstructured":"Zaharescu A, Boyer E, Varanasi K, Horaud R (2009) Surface feature detection and description with applications to mesh matching. In: Proceedings of the CVPR. IEEE, pp 373\u2013380","DOI":"10.1109\/CVPR.2009.5206748"},{"key":"183_CR330","unstructured":"Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329"},{"key":"183_CR331","unstructured":"Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Proceedings of the ECCV. Springer, pp 818\u2013833"},{"key":"183_CR332","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1109\/MMUL.2012.24","volume":"19","author":"Z Zhang","year":"2012","unstructured":"Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE Multimed 19:4\u201310","journal-title":"IEEE Multimed"},{"key":"183_CR333","doi-asserted-by":"crossref","unstructured":"Zhao R, Ali H, Van\u00a0der Smagt P (2017) Two-stream RNN\/CNN for action recognition in 3D videos. In: IROS. IEEE, pp 4260\u20134267","DOI":"10.1109\/IROS.2017.8206288"},{"issue":"5","key":"183_CR334","doi-asserted-by":"crossref","first-page":"1224","DOI":"10.1109\/TPAMI.2017.2709749","volume":"40","author":"L Zheng","year":"2017","unstructured":"Zheng L, Yang Y, Tian Q (2017) SIFT meets CNN: a decade survey of instance retrieval. Trans Pattern Anal Mach Intell 40(5):1224\u20131244","journal-title":"Trans Pattern Anal Mach Intell"},{"key":"183_CR335","doi-asserted-by":"crossref","unstructured":"Zhong Y (2009) Intrinsic shape signatures: a shape descriptor for 3D object recognition. In: ICCVW. IEEE, pp 689\u2013696","DOI":"10.1109\/ICCVW.2009.5457637"},{"key":"183_CR336","doi-asserted-by":"crossref","first-page":"522","DOI":"10.1016\/j.patcog.2017.11.029","volume":"76","author":"Y Zou","year":"2018","unstructured":"Zou Y, Wang X, Zhang T, Liang B, Song J, Liu H (2018) BRoPH: an efficient and compact binary descriptor for 3D point clouds. Pattern Recognit 76:522\u2013536","journal-title":"Pattern Recognit"}],"container-title":["International Journal of Multimedia Information Retrieval"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s13735-019-00183-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1007\/s13735-019-00183-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s13735-019-00183-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,10,6]],"date-time":"2022-10-06T17:53:35Z","timestamp":1665078815000},"score":1,"resource":{"primary":{"URL":"http:\/\/link.springer.com\/10.1007\/s13735-019-00183-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,22]]},"references-count":336,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,9]]}},"alternative-id":["183"],"URL":"https:\/\/doi.org\/10.1007\/s13735-019-00183-w","relation":{},"ISSN":["2192-6611","2192-662X"],"issn-type":[{"value":"2192-6611","type":"print"},{"value":"2192-662X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,11,22]]},"assertion":[{"value":"10 July 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 October 2019","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 October 2019","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 November 2019","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}