{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,3]],"date-time":"2026-07-03T18:32:01Z","timestamp":1783103521321,"version":"3.54.6"},"reference-count":48,"publisher":"SAGE Publications","issue":"4","license":[{"start":{"date-parts":[[2025,1,19]],"date-time":"2025-01-19T00:00:00Z","timestamp":1737244800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Integrated Computer-Aided Engineering"],"published-print":{"date-parts":[[2025,11]]},"abstract":"<jats:p>Detecting actions as they occur is essential for applications like video surveillance, autonomous driving, and human-robot interaction. Known as online action detection, this task requires classifying actions in streaming videos, handling background noise, and coping with incomplete actions. Transformer architectures are the current state-of-the-art, yet the potential of recent advancements in computer vision, particularly vision-language models (VLMs), remains largely untapped for this problem, partly due to high computational costs. In this paper, we introduce TOAD: A Text-driven Online Action Detection architecture that supports zero-shot and few-shot learning. TOAD leverages CLIP (Contrastive Language-Image Pretraining) textual embeddings, enabling efficient use of VLMs without significant computational overhead. Our model achieves 82.46% mAP on the THUMOS14 dataset, outperforming existing methods, and sets new baselines for zero-shot and few-shot performance on the THUMOS14 and TVSeries datasets.<\/jats:p>","DOI":"10.1177\/10692509241308069","type":"journal-article","created":{"date-parts":[[2025,10,20]],"date-time":"2025-10-20T11:14:37Z","timestamp":1760958877000},"page":"415-423","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":1,"title":["Text-driven online action detection"],"prefix":"10.1177","volume":"32","author":[{"given":"Manuel","family":"Benavent-Lledo","sequence":"first","affiliation":[{"name":"Department of Computer Technology, University of Alicante, Alicante, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"David","family":"Mulero-P\u00e9rez","sequence":"additional","affiliation":[{"name":"Department of Computer Technology, University of Alicante, Alicante, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"David","family":"Ortiz-Perez","sequence":"additional","affiliation":[{"name":"Department of Computer Technology, University of Alicante, Alicante, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jose","family":"Garcia-Rodriguez","sequence":"additional","affiliation":[{"name":"Department of Computer Technology, University of Alicante, Alicante, Spain"},{"name":"ValgrAI - Valencian Graduate School and Research Network of Artificial Intelligence, Valencia, Spain"},{"name":"Institute of Informatics Research, University of Alicante, Alicante, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"179","published-online":{"date-parts":[[2025,1,19]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.dcan.2020.05.004"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2011.02.007"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.3233\/ICA-230706"},{"key":"e_1_3_3_5_2","doi-asserted-by":"crossref","unstructured":"Kim J Misu T Chen YT et\u00a0al. Grounding human-to-vehicle advice for self-driving vehicles. In: CVPR 2019.","DOI":"10.1109\/CVPR.2019.01084"},{"key":"e_1_3_3_6_2","doi-asserted-by":"crossref","unstructured":"Ramanishka V Chen YT et\u00a0al. Toward driving scene understanding: a dataset for learning driver behavior and causal reasoning. In: CVPR 2018.","DOI":"10.1109\/CVPR.2018.00803"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1111\/mice.12995"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.3233\/ICA-220694"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSEN.2022.3148431"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2017.05.027"},{"key":"e_1_3_3_11_2","doi-asserted-by":"crossref","unstructured":"De Geest R Gavves E Ghodrati A et\u00a0al. Online action detection. In: Computer vision\u2013ECCV 2016: 14th european conference amsterdam the netherlands October 11-14 2016 proceedings Part V 14 2016 pp.269\u2013284. Springer.","DOI":"10.1007\/978-3-319-46454-1_17"},{"key":"e_1_3_3_12_2","doi-asserted-by":"crossref","unstructured":"Eun H Moon J Park J et\u00a0al. Learning to discriminate information for online action detection. In: CVPR 2020.","DOI":"10.1109\/CVPR42600.2020.00089"},{"key":"e_1_3_3_13_2","doi-asserted-by":"crossref","unstructured":"Gao J Yang Z Nevatia R. Red: Reinforced encoder-decoder networks for action anticipation. arXiv preprint arXiv:1707.04818 2017.","DOI":"10.5244\/C.31.92"},{"key":"e_1_3_3_14_2","doi-asserted-by":"crossref","unstructured":"Xu M Gao M Chen YT et\u00a0al. Temporal recurrent networks for online action detection. In: ICCV 2019.","DOI":"10.1109\/ICCV.2019.00563"},{"key":"e_1_3_3_15_2","doi-asserted-by":"crossref","unstructured":"An J Kang H Han SH et\u00a0al. Miniroad: Minimal rnn framework for online action detection. In: ICCV 2023 pp.10341\u201310350.","DOI":"10.1109\/ICCV51070.2023.00949"},{"key":"e_1_3_3_16_2","unstructured":"Zhao WX Zhao K Li J et al. A survey of large language models. arXiv preprint\u00a0arXiv:2303.18223 2023."},{"key":"e_1_3_3_17_2","unstructured":"Dosovitskiy A Beyer L Kolesnikov A et al. An image is worth 16\u00d716 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 2021."},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1142\/S0129065723500351"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1111\/mice.12954"},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1111\/mice.13181"},{"key":"e_1_3_3_21_2","unstructured":"Arnab A Dehghani M Heigold G et al. arXiv preprint arXiv:2103.15691 2021."},{"key":"e_1_3_3_22_2","doi-asserted-by":"crossref","unstructured":"Piergiovanni A Kuo W Angelova A. Rethinking video vits: sparse video tubes for joint image and video learning. arXiv preprint arXiv:2212.03229 2022.","DOI":"10.1109\/CVPR52729.2023.00220"},{"key":"e_1_3_3_23_2","doi-asserted-by":"crossref","unstructured":"Wang X Zhang S Qing Z et\u00a0al. Oadtr: online action detection with transformers. In: ICCV 2021 pp.7565\u20137575.","DOI":"10.1109\/ICCV48922.2021.00747"},{"key":"e_1_3_3_24_2","doi-asserted-by":"crossref","unstructured":"Li R Yan L Peng Y et\u00a0al. Lighter transformer for online action detection. ICIGP \u201923 Association for Computing Machinery 2023 p.161\u2013167. ISBN 9781450398572.","DOI":"10.1145\/3582649.3582656"},{"key":"e_1_3_3_25_2","unstructured":"Xu M Xiong Y Chen H et\u00a0al. Long short-term transformer for online action detection. In: NeurIPS 2021."},{"key":"e_1_3_3_26_2","doi-asserted-by":"crossref","unstructured":"Zhao Y Kr\u00e4henb\u00fchl P. Real-time online video detection with temporal smoothing transformers. In: European conference on computer vision (ECCV) 2022.","DOI":"10.1007\/978-3-031-19830-4_28"},{"key":"e_1_3_3_27_2","unstructured":"Wang M Xing J Liu Y. Actionclip: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 2021."},{"key":"e_1_3_3_28_2","doi-asserted-by":"crossref","unstructured":"Wu W Wang X Luo H et\u00a0al. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition 2023.","DOI":"10.1109\/CVPR52729.2023.00640"},{"key":"e_1_3_3_29_2","unstructured":"Radford A Kim JW Hallacy C et al. Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR 2021 pp.8748\u20138763."},{"key":"e_1_3_3_30_2","doi-asserted-by":"crossref","unstructured":"Cheng F Wang X Lei J et al. Vindlu: a recipe for effective video-and-language pretraining. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 2023 pp.10739\u201310750.","DOI":"10.1109\/CVPR52729.2023.01034"},{"key":"e_1_3_3_31_2","doi-asserted-by":"crossref","unstructured":"Ju C Han T Zheng K et al. Prompting visual-language models for efficient video understanding. In: European Conference on Computer Vision. Cham: Springer Nature Switzerland 2022 pp.105\u20131242022.","DOI":"10.1007\/978-3-031-19833-5_7"},{"key":"e_1_3_3_32_2","doi-asserted-by":"crossref","unstructured":"Papalampidi P Koppula S Pathak S et al. A simple recipe for contrastively pre-training video-first encoders beyond 16 frames. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition 2024 pp.14386\u201314397.","DOI":"10.1109\/CVPR52733.2024.01364"},{"key":"e_1_3_3_33_2","unstructured":"Li Z et al. A strong baseline for temporal video-text alignment. arXiv preprint arXiv:2312.14055 2023."},{"key":"e_1_3_3_34_2","doi-asserted-by":"crossref","unstructured":"Wu W Sun Z Ouyang W. Revisiting classifier: transferring vision-language models for video recognition. In: AAAI Conf. volume 37 2023 pp.2847\u20132855.","DOI":"10.1609\/aaai.v37i3.25386"},{"key":"e_1_3_3_35_2","doi-asserted-by":"crossref","unstructured":"Benavent-Lledo M Mulero-P\u00e9rez D Ortiz-Perez D et\u00a0al. Exploring text-driven approaches for online action detection. In: Ferr\u00e1ndez Vicente JM Val Calvo M and Adeli H (eds.) Bioinspired Systems for Translational Applications: From Robotics to Social Engineering. Cham: Springer Nature Switzerland 2024 pp.55\u201364. ISBN 978-3-031-61137-7.","DOI":"10.1007\/978-3-031-61137-7_6"},{"key":"e_1_3_3_36_2","unstructured":"Brown TB Mann B Ryder N et al. Language models are few-shot learners arXiv preprint arXiv:2005.14165 2020."},{"key":"e_1_3_3_37_2","unstructured":"Touvron H Martin L Stone K et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 2023."},{"key":"e_1_3_3_38_2","unstructured":"Devlin J Chang MW Lee K et al. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint\u00a0arXiv:181004805 2018."},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TETCI.2019.2892755"},{"key":"e_1_3_3_40_2","doi-asserted-by":"crossref","unstructured":"Xu H Ghosh G Huang PY et al. Videoclip: contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 2021.","DOI":"10.18653\/v1\/2021.emnlp-main.544"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.03.069"},{"key":"e_1_3_3_42_2","doi-asserted-by":"crossref","unstructured":"Yang L Han J Zhang D. Colar: Effective and efficient online action detection by consulting exemplars. In: CVPR 2022.","DOI":"10.1109\/CVPR52688.2022.00316"},{"key":"e_1_3_3_43_2","doi-asserted-by":"crossref","unstructured":"Gao M Zhou Y Xu R et\u00a0al. Woad: Weakly supervised online action detection in untrimmed videos. In: CVPR 2021 pp.1915\u20131923.","DOI":"10.1109\/CVPR46437.2021.00195"},{"key":"e_1_3_3_44_2","unstructured":"Jiang YG Liu J et\u00a0al. Thumos challenge: action recognition with a large number of classes 2014."},{"key":"e_1_3_3_45_2","unstructured":"Kay W Carreira J Simonyan K et\u00a0al. The kinetics human action video dataset. arXiv preprint arXiv:170506950 2017."},{"key":"e_1_3_3_46_2","doi-asserted-by":"crossref","unstructured":"He K Zhang X Ren S et\u00a0al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition 2016 pp.770\u2013778.","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2022.3190448"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-019-04359-7"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2023.101945"}],"container-title":["Integrated Computer-Aided Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10692509241308069","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/10692509241308069","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10692509241308069","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T09:14:59Z","timestamp":1777454099000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/10692509241308069"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,19]]},"references-count":48,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,11]]}},"alternative-id":["10.1177\/10692509241308069"],"URL":"https:\/\/doi.org\/10.1177\/10692509241308069","relation":{},"ISSN":["1069-2509","1875-8835"],"issn-type":[{"value":"1069-2509","type":"print"},{"value":"1875-8835","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,19]]}}}