{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T04:02:13Z","timestamp":1773806533285,"version":"3.50.1"},"reference-count":92,"publisher":"World Scientific Pub Co Pte Ltd","issue":"04","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62071421"],"award-info":[{"award-number":["62071421"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2023YFE0204200"],"award-info":[{"award-number":["2023YFE0204200"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Int. J. Wavelets Multiresolut Inf. Process."],"published-print":{"date-parts":[[2024,7]]},"abstract":"<jats:p> Spatio-temporal action detection (STAD) aims to classify the actions present in a video and localize them in space and time. It has become a particularly active area of research in computer vision because of its explosively emerging real-world applications, such as autonomous driving, visual surveillance and entertainment. Many efforts have been devoted in recent years to build a robust and effective framework for STAD. This paper provides a comprehensive review of the state-of-the-art deep learning-based methods for STAD. First, a taxonomy is developed to organize these methods. Next, the linking algorithms, which aim to associate the frame- or clip-level detection results together to form action tubes, are reviewed. Then, the commonly used benchmark datasets and evaluation metrics are introduced, and the performance of state-of-the-art models is compared. At last, this paper is concluded, and a set of potential research directions of STAD are discussed. <\/jats:p>","DOI":"10.1142\/s0219691323500662","type":"journal-article","created":{"date-parts":[[2024,1,18]],"date-time":"2024-01-18T14:27:25Z","timestamp":1705588045000},"source":"Crossref","is-referenced-by-count":13,"title":["A survey on deep learning-based spatio-temporal action detection"],"prefix":"10.1142","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0016-6273","authenticated-orcid":false,"given":"Peng","family":"Wang","sequence":"first","affiliation":[{"name":"College of Computer Science, Zhejiang University, Hangzhou, Zhejiang 310007, P. R. China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-7872-8610","authenticated-orcid":false,"given":"Fanwei","family":"Zeng","sequence":"additional","affiliation":[{"name":"Ant Group, Hangzhou, Zhejiang 310007, P. R. China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7418-5891","authenticated-orcid":false,"given":"Yuntao","family":"Qian","sequence":"additional","affiliation":[{"name":"College of Computer Science, Zhejiang University, Hangzhou, Zhejiang 310007, P. R. China"}]}],"member":"219","published-online":{"date-parts":[[2024,2,9]]},"reference":[{"key":"S0219691323500662BIB001","doi-asserted-by":"publisher","DOI":"10.1007\/s12652-021-03323-5"},{"key":"S0219691323500662BIB002","doi-asserted-by":"publisher","DOI":"10.3390\/electronics12051165"},{"key":"S0219691323500662BIB003","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2018.2887283"},{"key":"S0219691323500662BIB006","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2005.28"},{"key":"S0219691323500662BIB007","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"S0219691323500662BIB008","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2010.5539875"},{"key":"S0219691323500662BIB009","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"S0219691323500662BIB010","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"S0219691323500662BIB012","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00807"},{"key":"S0219691323500662BIB014","doi-asserted-by":"publisher","DOI":"10.1109\/WACVW54805.2022.00018"},{"key":"S0219691323500662BIB015","doi-asserted-by":"publisher","DOI":"10.1109\/ITSC.2017.8317865"},{"key":"S0219691323500662BIB017","volume-title":"Advances in Neural Information Processing Systems","author":"Duarte K.","year":"2018"},{"key":"S0219691323500662BIB018","doi-asserted-by":"publisher","DOI":"10.1109\/WACV56688.2023.00334"},{"key":"S0219691323500662BIB019","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00028"},{"key":"S0219691323500662BIB020","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00630"},{"key":"S0219691323500662BIB022","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00033"},{"key":"S0219691323500662BIB023","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.81"},{"key":"S0219691323500662BIB024","first-page":"759","volume-title":"Proc. IEEE Conf. Computer Vision Pattern Recognition","author":"Gkioxari G.","year":"2014"},{"key":"S0219691323500662BIB026","first-page":"5842","volume-title":"Proc. IEEE Int. Conf. Computer Vision","author":"Goyal R.","year":"2017"},{"key":"S0219691323500662BIB027","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00633"},{"key":"S0219691323500662BIB028","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2018.00044"},{"key":"S0219691323500662BIB029","first-page":"6840","volume-title":"Advances in Neural Information Processing Systems","author":"Ho J.","year":"2020"},{"key":"S0219691323500662BIB030","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.620"},{"key":"S0219691323500662BIB032","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2016.10.018"},{"key":"S0219691323500662BIB033","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.100"},{"key":"S0219691323500662BIB034","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.396"},{"key":"S0219691323500662BIB035","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.59"},{"key":"S0219691323500662BIB036","volume-title":"Proc. IEEE Conf. Computer Vision Pattern Recognition Workshop","author":"Jiang J.","year":"2018"},{"key":"S0219691323500662BIB038","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.472"},{"key":"S0219691323500662BIB040","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2007.4409011"},{"key":"S0219691323500662BIB041","doi-asserted-by":"publisher","DOI":"10.3390\/app9224963"},{"key":"S0219691323500662BIB042","doi-asserted-by":"publisher","DOI":"10.1109\/CAI54212.2023.00061"},{"key":"S0219691323500662BIB044","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"S0219691323500662BIB045","first-page":"2003","volume-title":"Proc. IEEE Int. Conf. Computer Vision","author":"Lan T.","year":"2011"},{"key":"S0219691323500662BIB046","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"S0219691323500662BIB047","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01328"},{"key":"S0219691323500662BIB048","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58517-4_30"},{"key":"S0219691323500662BIB049","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6811"},{"key":"S0219691323500662BIB050","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01231-1_19"},{"key":"S0219691323500662BIB051","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-03243-2_63-1"},{"key":"S0219691323500662BIB052","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350978"},{"key":"S0219691323500662BIB054","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"S0219691323500662BIB056","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2021.02.001"},{"key":"S0219691323500662BIB057","doi-asserted-by":"publisher","DOI":"10.1016\/j.jvcir.2023.103879"},{"key":"S0219691323500662BIB059","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN52387.2021.9533300"},{"key":"S0219691323500662BIB061","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475503"},{"key":"S0219691323500662BIB063","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00053"},{"key":"S0219691323500662BIB064","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_45"},{"key":"S0219691323500662BIB065","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00015"},{"key":"S0219691323500662BIB066","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2017.33"},{"key":"S0219691323500662BIB067","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.91"},{"key":"S0219691323500662BIB068","first-page":"1","volume-title":"Advances in Neural Information Processing Systems","author":"Ren S.","year":"2015"},{"key":"S0219691323500662BIB069","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2008.4587727"},{"key":"S0219691323500662BIB070","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.473"},{"key":"S0219691323500662BIB071","doi-asserted-by":"publisher","DOI":"10.5244\/C.30.58"},{"key":"S0219691323500662BIB072","volume-title":"Proc. Int. Conf. Learning Representation","author":"Simonyan K.","year":"2015"},{"key":"S0219691323500662BIB074","first-page":"1","volume-title":"Proc. Asian Conf. Computer Vision","author":"Singh G.","year":"2018"},{"key":"S0219691323500662BIB075","first-page":"3657","volume-title":"Proc. IEEE Int. Conf. Computer Vision","author":"Singh G.","year":"2016"},{"key":"S0219691323500662BIB076","doi-asserted-by":"publisher","DOI":"10.5244\/C.25.65"},{"key":"S0219691323500662BIB077","volume-title":"Advances in Neural Information Processing Systems","author":"Song Y.","year":"2019"},{"key":"S0219691323500662BIB078","volume-title":"Proc. Int. Conf. Learning and Representation","author":"Song J.","year":"2021"},{"key":"S0219691323500662BIB079","volume-title":"Proc. Int. Conf. Learning Representation","author":"Song Y.","year":"2021"},{"key":"S0219691323500662BIB080","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01226"},{"key":"S0219691323500662BIB082","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01229"},{"key":"S0219691323500662BIB084","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01252-6_20"},{"key":"S0219691323500662BIB085","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58555-6_5"},{"key":"S0219691323500662BIB086","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.341"},{"key":"S0219691323500662BIB087","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00972"},{"key":"S0219691323500662BIB089","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2021.103187"},{"key":"S0219691323500662BIB090","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093617"},{"key":"S0219691323500662BIB091","first-page":"1","volume":"1","author":"Vahdani E.","year":"2021","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"S0219691323500662BIB092","first-page":"9662","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani A.","year":"2017"},{"key":"S0219691323500662BIB093","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.296"},{"key":"S0219691323500662BIB094","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00813"},{"key":"S0219691323500662BIB095","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.362"},{"key":"S0219691323500662BIB096","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00037"},{"key":"S0219691323500662BIB097","first-page":"440","volume-title":"Proc. European Conf. Computer Vision","author":"Wu J.","year":"2020"},{"key":"S0219691323500662BIB098","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.634"},{"key":"S0219691323500662BIB099","volume-title":"Proc. European Conf. Computer Vision","author":"Xie S.","year":"2018"},{"key":"S0219691323500662BIB100","first-page":"568","volume-title":"Proc. IEEE Int. Conf. Multimedia Expo","author":"Xu Q."},{"key":"S0219691323500662BIB102","doi-asserted-by":"publisher","DOI":"10.5244\/C.31.95"},{"key":"S0219691323500662BIB103","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00035"},{"key":"S0219691323500662BIB104","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-017-1013-y"},{"key":"S0219691323500662BIB105","first-page":"2403","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognition","author":"Yu F.","year":"2017"},{"key":"S0219691323500662BIB106","first-page":"2442","volume-title":"Proc. IEEE Conf. Computer Vision Pattern Recognition","author":"Yuan J.","year":"2009"},{"key":"S0219691323500662BIB107","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2011.38"},{"key":"S0219691323500662BIB108","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01021"},{"key":"S0219691323500662BIB109","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547980"},{"key":"S0219691323500662BIB110","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01017"},{"key":"S0219691323500662BIB111","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01323"},{"key":"S0219691323500662BIB113","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_26"},{"key":"S0219691323500662BIB114","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.316"}],"container-title":["International Journal of Wavelets, Multiresolution and Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S0219691323500662","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T04:29:55Z","timestamp":1721708995000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/10.1142\/S0219691323500662"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,9]]},"references-count":92,"journal-issue":{"issue":"04","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["10.1142\/S0219691323500662"],"URL":"https:\/\/doi.org\/10.1142\/s0219691323500662","relation":{},"ISSN":["0219-6913","1793-690X"],"issn-type":[{"value":"0219-6913","type":"print"},{"value":"1793-690X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,9]]},"article-number":"2350066"}}