{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T13:02:32Z","timestamp":1776085352087,"version":"3.50.1"},"reference-count":81,"publisher":"Springer Science and Business Media LLC","issue":"11","license":[{"start":{"date-parts":[[2025,8,11]],"date-time":"2025-08-11T00:00:00Z","timestamp":1754870400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,8,11]],"date-time":"2025-08-11T00:00:00Z","timestamp":1754870400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100002860","name":"China Sponsorship Council","doi-asserted-by":"publisher","award":["202109210007"],"award-info":[{"award-number":["202109210007"]}],"id":[{"id":"10.13039\/501100002860","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2025,11]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Determining when people are struggling allows for a finer-grained understanding of actions that complements conventional action classification and error detection. Struggle detection, as defined in this paper, is a distinct and important task that can be identified without explicit step or activity knowledge. We introduce the first struggle dataset with three real-world problem-solving activities that are labelled by both expert and crowd-source annotators. Video segments were scored w.r.t. their level of struggle using a forced choice 4-point scale. This dataset contains 5.1 hours of video from 73 participants. We conducted a series of experiments to identify the most suitable modelling approaches for struggle determination. Additionally, we compared various deep learning models, establishing baseline results for struggle classification, struggle regression, and struggle label distribution learning. Our results indicate that struggle detection in video can achieve up to\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$88.24\\%$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:mrow>\n                            <mml:mn>88.24<\/mml:mn>\n                            <mml:mo>%<\/mml:mo>\n                          <\/mml:mrow>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    accuracy in binary classification, while detecting the level of struggle in a four-way classification setting performs lower, with an overall accuracy of\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$52.45\\%$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:mrow>\n                            <mml:mn>52.45<\/mml:mn>\n                            <mml:mo>%<\/mml:mo>\n                          <\/mml:mrow>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    . Our work is motivated toward a more comprehensive understanding of action in video and potentially the improvement of assistive systems that analyse struggle and can better support users during manual activities.\n                  <\/jats:p>","DOI":"10.1007\/s11263-025-02559-4","type":"journal-article","created":{"date-parts":[[2025,8,11]],"date-time":"2025-08-11T15:14:48Z","timestamp":1754925288000},"page":"7817-7854","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos"],"prefix":"10.1007","volume":"133","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8115-8795","authenticated-orcid":false,"given":"Shijia","family":"Feng","sequence":"first","affiliation":[]},{"given":"Michael","family":"Wray","sequence":"additional","affiliation":[]},{"given":"Brian","family":"Sullivan","sequence":"additional","affiliation":[]},{"given":"Youngkyoon","family":"Jang","sequence":"additional","affiliation":[]},{"given":"Casimir","family":"Ludwig","sequence":"additional","affiliation":[]},{"given":"Iain","family":"Gilchrist","sequence":"additional","affiliation":[]},{"given":"Walterio","family":"Mayol-Cuevas","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,8,11]]},"reference":[{"key":"2559_CR1","doi-asserted-by":"crossref","unstructured":"Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., & Schmid, C. (2021). Vivit: A video vision transformer. arXiv. https:\/\/arxiv.org\/abs\/2103.15691","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"2559_CR2","unstructured":"Athiwaratkun, B., Finzi, M., Izmailov, P., & Wilson, A.G. (2019). There are many consistent explanations of unlabeled data: Why you should average."},{"key":"2559_CR3","unstructured":"Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? CoRR, abs\/2102.05095, https:\/\/arxiv.org\/abs\/2102.05095"},{"key":"2559_CR4","doi-asserted-by":"crossref","unstructured":"Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. proceedings of the ieee conference on computer vision and pattern recognition (pp. 6299\u20136308).","DOI":"10.1109\/CVPR.2017.502"},{"key":"2559_CR5","unstructured":"Chattopadhyay, A., Sarkar, A., Howlader, P., & Balasubramanian, V.N. (2017). Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. CoRR, abs\/1710.11063, , http:\/\/arxiv.org\/abs\/1710.11063"},{"key":"2559_CR6","doi-asserted-by":"crossref","unstructured":"Chen, S., Sun, P., Xie, E., Ge, C., Wu, J., Ma, L., & Luo, P. (2021). Watch only once: An end-to-end video action detection framework. Proceedings of the ieee\/cvf international conference on computer vision (iccv) (p.8178-8187).","DOI":"10.1109\/ICCV48922.2021.00807"},{"key":"2559_CR7","doi-asserted-by":"crossref","unstructured":"Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., & Wray, M. (2018). Scaling egocentric vision: The epic-kitchens dataset. European conference on computer vision (eccv).","DOI":"10.1007\/978-3-030-01225-0_44"},{"key":"2559_CR8","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv. https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"2559_CR9","volume-title":"Who\u2019s Better? Who\u2019s Bst?","author":"H Doughty","year":"2018","unstructured":"Doughty, H., Damen, D., & Mayol-Cuevas, W. (2018). Who\u2019s Better? Who\u2019s Bst? The ieee conference on computer vision and pattern recognition (cvpr): Pairwise Deep Ranking for Skill Determination."},{"key":"2559_CR10","doi-asserted-by":"crossref","unstructured":"Doughty, H., Mayol-Cuevas, W., & Damen, D. (2019). The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos. The ieee conference on computer vision and pattern recognition (cvpr).","DOI":"10.1109\/CVPR.2019.00805"},{"key":"2559_CR11","doi-asserted-by":"crossref","unstructured":"Duka, E., Kukleva, A., & Schiele, B. (2022). Leveraging self-supervised training for unintentional action recognition. arXiv. https:\/\/arxiv.org\/abs\/2209.11870","DOI":"10.1007\/978-3-031-25069-9_5"},{"key":"2559_CR12","doi-asserted-by":"crossref","unstructured":"Fabian Caba\u00a0Heilbron, B.G., Victor\u00a0Escorcia, & Niebles, J.C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 961\u2013970).","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"2559_CR13","unstructured":"Fan, H., Li, Y., Xiong, B., Lo, W.Y., & Feichtenhofer, C. (2020). Pyslowfast. (https:\/\/github.com\/facebookresearch\/slowfast. Last accessed on 24-05-2023)"},{"key":"2559_CR14","doi-asserted-by":"crossref","unstructured":"Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. Proceedings of the ieee\/cvf international conference on computer vision (iccv) (p.6824-6835).","DOI":"10.1109\/ICCV48922.2021.00675"},{"key":"2559_CR15","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C. (2020). X3D: expanding architectures for efficient video recognition. CoRR, abs\/2004.04730, , https:\/\/arxiv.org\/abs\/2004.04730","DOI":"10.1109\/CVPR42600.2020.00028"},{"key":"2559_CR16","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. Proceedings of the ieee\/cvf international conference on computer vision (pp. 6202\u20136211).","DOI":"10.1109\/ICCV.2019.00630"},{"key":"2559_CR17","doi-asserted-by":"crossref","unstructured":"Flaborea, A., di Melendugno, G.M.D., Plini, L., Scofano, L., De\u00a0Matteis, E., Furnari, A., & Galasso, F. (2024). Prego: Online mistake detection in procedural egocentric videos. Proceedings of the ieee\/cvf conference on computer vision and pattern recognition (cvpr) (p.18483-18492).","DOI":"10.1109\/CVPR52733.2024.01749"},{"key":"2559_CR18","doi-asserted-by":"crossref","unstructured":"Gabeur, V., Sun, C., Alahari, K., & Schmid, C. (2020). Multi-modal transformer for video retrieval. Computer vision\u2013eccv 2020: 16th european conference, glasgow, uk, august 23\u201328, 2020, proceedings, part iv 16 (pp. 214\u2013229).","DOI":"10.1007\/978-3-030-58548-8_13"},{"key":"2559_CR19","unstructured":"Gao, Y., Vedula, S.S., Reiley, C.E., Ahmidi, N., Varadarajan, B., Lin, H.C., & others (2014). Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. Miccai workshop: M2cai (Vol.\u00a03, p.3)."},{"key":"2559_CR20","doi-asserted-by":"crossref","unstructured":"Ghoddoosian, R., Dwivedi, I., Agarwal, N., & Dariush, B. (2023). Weakly-supervised action segmentation and unseen error detection in anomalous instructional videos. Proceedings of the ieee\/cvf international conference on computer vision (iccv) (p.10128-10138).","DOI":"10.1109\/ICCV51070.2023.00929"},{"key":"2559_CR21","doi-asserted-by":"crossref","unstructured":"Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., & Memisevic, R. (2017). The \"something something\" video database for learning and evaluating visual common sense. arXiv. https:\/\/arxiv.org\/abs\/1706.04261","DOI":"10.1109\/ICCV.2017.622"},{"key":"2559_CR22","doi-asserted-by":"crossref","unstructured":"Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., & Malik, J. (2022). Ego4d: Around the World in 3,000 Hours of Egocentric Video. Ieee\/cvf computer vision and pattern recognition (cvpr).","DOI":"10.1109\/CVPR52688.2022.01842"},{"key":"2559_CR23","doi-asserted-by":"crossref","unstructured":"Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., & Wray, M. (2024). Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. https:\/\/arxiv.org\/abs\/2311.18259","DOI":"10.1109\/CVPR52733.2024.01834"},{"key":"2559_CR24","doi-asserted-by":"crossref","unstructured":"Hipiny, I., Ujir, H., Alias, A.A., Shanat, M., & Ishak, M.K. (2023). Who danced better? ranked tiktok dance video dataset and pairwise action quality assessment method. International Journal of Advances in Intelligent Informatics, 9(1), 96-107, (Name - TikTok Inc; Copyright - $$\\copyright $$ 2023. This article is published under https:\/\/creativecommons.org\/licenses\/by\/4.0\/ (the \u201cLicense\u201d). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License; Last updated - 2023-04-28)","DOI":"10.26555\/ijain.v9i1.919"},{"issue":"8","key":"2559_CR25","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735\u20131780.","journal-title":"Neural Computation"},{"key":"2559_CR26","doi-asserted-by":"crossref","unstructured":"Jang, Y., Sullivan, B., Ludwig, C., Gilchrist, I.D., Damen, D., & Mayol-Cuevas, W. (2019). Epic-tent: An egocentric video dataset for camping tent assembly. 2019 ieee\/cvf international conference on computer vision workshop (iccvw) (p.4461-4469). Los Alamitos, CA, USA: IEEE Computer Society.","DOI":"10.1109\/ICCVW.2019.00547"},{"key":"2559_CR27","doi-asserted-by":"crossref","unstructured":"Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Action tubelet detector for spatio-temporal action localization. Proceedings of the ieee international conference on computer vision (pp. 4405\u20134413).","DOI":"10.1109\/ICCV.2017.472"},{"key":"2559_CR28","unstructured":"Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., & Zisserman, A. (2017). The kinetics human action video dataset. arXiv. https:\/\/arxiv.org\/abs\/1705.06950"},{"key":"2559_CR29","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: a large video database for human motion recognition. Proceedings of the international conference on computer vision (iccv).","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"2559_CR30","doi-asserted-by":"crossref","unstructured":"Li, Y., Li, Y., & Vasconcelos, N. (2018). Resound: Towards action recognition without representation bias. Proceedings of the european conference on computer vision (eccv) (pp. 513\u2013528).","DOI":"10.1007\/978-3-030-01231-1_32"},{"key":"2559_CR31","doi-asserted-by":"crossref","unstructured":"Li, Y., Wu, C.Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Mvitv2: Improved multiscale vision transformers for classification and detection.","DOI":"10.1109\/CVPR52688.2022.00476"},{"key":"2559_CR32","doi-asserted-by":"crossref","unstructured":"Li, Z., Huang, Y., Cai, M., & Sato, Y. (2019). Manipulation-skill assessment from videos with spatial attention network. Proceedings of the ieee\/cvf international conference on computer vision workshops (pp. 0\u20130).","DOI":"10.1109\/ICCVW.2019.00539"},{"key":"2559_CR33","doi-asserted-by":"crossref","unstructured":"Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., & Yuan, L. (2023). Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.","DOI":"10.18653\/v1\/2024.emnlp-main.342"},{"key":"2559_CR34","unstructured":"Liu, H., Dai, Z., So, D.R., & Le, Q.V. (2021). Pay attention to mlps. CoRR, abs\/2105.08050, https:\/\/arxiv.org\/abs\/2105.08050"},{"key":"2559_CR35","doi-asserted-by":"crossref","unstructured":"Liu, W., W.\u00a0Luo, D.L., & Gao, S. (2018). Future frame prediction for anomaly detection \u2013 a new baseline. 2018 ieee conference on computer vision and pattern recognition (cvpr).","DOI":"10.1109\/CVPR.2018.00684"},{"key":"2559_CR36","unstructured":"Liu, Y., Albanie, S., Nagrani, A., & Zisserman, A. (2019). Use what you have: Video retrieval using representations from collaborative experts. British machine vision conference."},{"key":"2559_CR37","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"2559_CR38","unstructured":"lucidrains (2021). gmlp - pytorch. (https:\/\/github.com\/lucidrains\/g-mlp-pytorch\/. Last accessed on 24-05-2023)"},{"key":"2559_CR39","unstructured":"lucidrains (2023). Vision transformer - pytorch. (https:\/\/github.com\/lucidrains\/vit-pytorch. Last accessed on 24-05-2023)"},{"key":"2559_CR40","doi-asserted-by":"crossref","unstructured":"Ma, F., Zhu, L., Yang, Y., Zha, S., Kundu, G., Feiszli, M., & Shou, Z. (2020). Sf-net: Single-frame supervision for temporal action localization. https:\/\/arxiv.org\/abs\/2003.06845","DOI":"10.1007\/978-3-030-58548-8_25"},{"key":"2559_CR41","doi-asserted-by":"crossref","unstructured":"Mahadevan, V., Li, W., Bhalodia, V., & Vasconcelos, N. (2010). Anomaly detection in crowded scenes. 2010 ieee computer society conference on computer vision and pattern recognition (pp. 1975\u20131981).","DOI":"10.1109\/CVPR.2010.5539872"},{"key":"2559_CR42","doi-asserted-by":"crossref","unstructured":"Moltisanti, D., Wray, M., Mayol-Cuevas, W., & Damen, D. (2017). Trespassing the boundaries: Labeling temporal bounds for object interactions in egocentric video. Proceedings of the ieee international conference on computer vision (pp. 2886\u20132894).","DOI":"10.1109\/ICCV.2017.314"},{"key":"2559_CR43","doi-asserted-by":"crossref","unstructured":"Neimark, D., Bar, O., Zohar, M., & Asselmann, D. (2021). Video transformer network. Proceedings of the ieee\/cvf international conference on computer vision (iccv) workshops (p.3163-3172).","DOI":"10.1109\/ICCVW54120.2021.00355"},{"issue":"1","key":"2559_CR44","doi-asserted-by":"publisher","first-page":"213","DOI":"10.1146\/annurev.ps.42.020191.001241","volume":"42","author":"K Newell","year":"1991","unstructured":"Newell, K. (1991). Motor skill acquisition. Annual review of psychology, 42(1), 213\u2013237.","journal-title":"Annual review of psychology"},{"issue":"6","key":"2559_CR45","doi-asserted-by":"publisher","first-page":"1039","DOI":"10.1007\/s11548-022-02581-8","volume":"17","author":"BB O\u011ful","year":"2022","unstructured":"O\u011ful, B. B., Gilgien, M., & \u00d6zdemir, S. (2022). Ranking surgical skills using an attention-enhanced siamese network with piecewise aggregated kinematic data. International Journal of Computer Assisted Radiology and Surgery, 17(6), 1039\u20131048. https:\/\/doi.org\/10.1007\/s11548-022-02581-8","journal-title":"International Journal of Computer Assisted Radiology and Surgery"},{"key":"2559_CR46","doi-asserted-by":"crossref","unstructured":"O\u011ful, B.B., Gilgien, M.F., & \u015eahin, P.D. (2019). Ranking robot-assisted surgery skills using kinematic sensors. I.\u00a0Chatzigiannakis, B.\u00a0De\u00a0Ruyter, and I.\u00a0Mavrommati (Eds.), Ambient intelligence (pp. 330\u2013336). Cham: Springer International Publishing.","DOI":"10.1007\/978-3-030-34255-5_24"},{"key":"2559_CR47","doi-asserted-by":"crossref","unstructured":"Parmar, P., & Morris, B. (2019). Action quality assessment across multiple actions. 2019 ieee winter conference on applications of computer vision (wacv) (pp. 1468\u20131476).","DOI":"10.1109\/WACV.2019.00161"},{"key":"2559_CR48","doi-asserted-by":"crossref","unstructured":"Parmar, P., & Morris, B.T. (2016). Learning to score olympic events. https:\/\/arxiv.org\/abs\/1611.05125","DOI":"10.1109\/CVPRW.2017.16"},{"key":"2559_CR49","doi-asserted-by":"crossref","unstructured":"Parmar, P., & Tran\u00a0Morris, B. (2019). What and how well you performed? a multitask learning approach to action quality assessment. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 304\u2013313).","DOI":"10.1109\/CVPR.2019.00039"},{"key":"2559_CR50","doi-asserted-by":"crossref","unstructured":"Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Assessing the quality of actions. European conference on computer vision (pp. 556\u2013571).","DOI":"10.1007\/978-3-319-10599-4_36"},{"key":"2559_CR51","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2023.103764","author":"F Ragusa","year":"2023","unstructured":"Ragusa, F., Furnari, A., & Farinella, G. M. (2023). Meccano: A multimodal egocentric dataset for humans behavior understanding in the industrial-like domain. Computer Vision and Image Understanding (CVIU). https:\/\/doi.org\/10.1016\/j.cviu.2023.103764https:\/\/iplab.dmi.unict.it\/MECCANO\/.","journal-title":"Computer Vision and Image Understanding (CVIU)"},{"key":"2559_CR52","doi-asserted-by":"crossref","unstructured":"Ragusa, F., Furnari, A., Livatino, S., & Farinella, G.M. (2021). The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. Proceedings of the ieee\/cvf winter conference on applications of computer vision (wacv) (p.1569-1578).","DOI":"10.1109\/WACV48630.2021.00161"},{"issue":"3","key":"2559_CR53","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"O Russakovsky","year":"2015","unstructured":"Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., & Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3), 211\u201325. https:\/\/doi.org\/10.1007\/s11263-015-0816-y","journal-title":"International Journal of Computer Vision (IJCV)"},{"key":"2559_CR54","doi-asserted-by":"crossref","unstructured":"Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., & Yao, A. (2022). Assembly101: A large-scale multi-view video dataset for understanding procedural activities. https:\/\/arxiv.org\/abs\/2203.14712","DOI":"10.1109\/CVPR52688.2022.02042"},{"key":"2559_CR55","doi-asserted-by":"crossref","unstructured":"Shao, D., Zhao, Y., Dai, B., & Lin, D. (2020). Finegym: A hierarchical video dataset for fine-grained action understanding. Ieee conference on computer vision and pattern recognition (cvpr).","DOI":"10.1109\/CVPR42600.2020.00269"},{"key":"2559_CR56","unstructured":"Song, Y., Byrne, E., Nagarajan, T., Wang, H., Martin, M., & Torresani, L. (2023). Ego4d goal-step: Toward hierarchical understanding of procedural activities. A.\u00a0Oh, T.\u00a0Naumann, A.\u00a0Globerson, K.\u00a0Saenko, M.\u00a0Hardt, and S.\u00a0Levine (Eds.), Advances in neural information processing systems (Vol.\u00a036, pp. 38863\u201338886). Curran Associates, Inc."},{"key":"2559_CR57","unstructured":"Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402."},{"issue":"3","key":"2559_CR58","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1167\/jov.21.3.13","volume":"21","author":"B Sullivan","year":"2021","unstructured":"Sullivan, B., Ludwig, C. J., Damen, D., Mayol-Cuevas, W., & Gilchrist, I. D. (2021). Look-ahead fixations during visuomotor behavior: Evidence from assembling a camping tent. Journal of vision, 21(3), 13\u201313.","journal-title":"Journal of vision"},{"key":"2559_CR59","doi-asserted-by":"crossref","unstructured":"Sultani, W., Chen, C., & Shah, M. (2018a). Real-world anomaly detection in surveillance videos. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 6479\u20136488).","DOI":"10.1109\/CVPR.2018.00678"},{"key":"2559_CR60","doi-asserted-by":"crossref","unstructured":"Sultani, W., Chen, C., & Shah, M. (2018b). Real-world anomaly detection in surveillance videos. Proceedings of the ieee conference on computer vision and pattern recognition (cvpr).","DOI":"10.1109\/CVPR.2018.00678"},{"key":"2559_CR61","doi-asserted-by":"crossref","unstructured":"Tang, Y., Ni, Z., Zhou, J., Zhang, D., Lu, J., Wu, Y., & Zhou, J. (2020). Uncertainty-aware score distribution learning for action quality assessment. Proceedings of the ieee\/cvf conference on computer vision and pattern recognition (pp. 9839\u20139848).","DOI":"10.1109\/CVPR42600.2020.00986"},{"key":"2559_CR62","unstructured":"The National Archives. (2023). Non Commercial Government Licence \u2014 nationalarchives.gov.uk. (https:\/\/www.nationalarchives.gov.uk\/doc\/non-commercial-government-licence\/version\/2\/. Last accessed on 06-Jun-2023)"},{"key":"2559_CR63","doi-asserted-by":"crossref","unstructured":"Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W., & Carneiro, G. (2021). Weakly-supervised video anomaly detection with robust temporal feature magnitude learning.","DOI":"10.1109\/ICCV48922.2021.00493"},{"key":"2559_CR64","unstructured":"Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., & Dosovitskiy, A. (2021). Mlp-mixer: An all-mlp architecture for vision. CoRR, abs\/2105.01601. https:\/\/arxiv.org\/abs\/2105.01601"},{"key":"2559_CR65","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the ieee international conference on computer vision (pp. 4489\u20134497).","DOI":"10.1109\/ICCV.2015.510"},{"key":"2559_CR66","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani, L., & Feiszli, M. (2019). Video classification with channel-separated convolutional networks. Proceedings of the ieee\/cvf international conference on computer vision (pp. 5552\u20135561).","DOI":"10.1109\/ICCV.2019.00565"},{"key":"2559_CR67","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. Proceedings of the ieee conference on computer vision and pattern recognition (pp. 6450\u20136459).","DOI":"10.1109\/CVPR.2018.00675"},{"key":"2559_CR68","doi-asserted-by":"crossref","unstructured":"Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Gool, L.V. (2016). Temporal segment networks: Towards good practices for deep action recognition. European conference on computer vision (pp. 20\u201336).","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"2559_CR69","doi-asserted-by":"crossref","unstructured":"Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., & Pollefeys, M. (2023). Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. Proceedings of the ieee\/cvf international conference on computer vision (iccv) (p.20270-20281).","DOI":"10.1109\/ICCV51070.2023.01854"},{"issue":"5765","key":"2559_CR70","doi-asserted-by":"publisher","first-page":"1301","DOI":"10.1126\/science.1121448","volume":"311","author":"F Warneken","year":"2006","unstructured":"Warneken, F., & Tomasello, M. (2006). Altruistic helping in human infants and young chimpanzees. Science, 311(5765), 1301\u20131303.","journal-title":"Science"},{"key":"2559_CR71","unstructured":"Warneken, F., & Tomasello, M. (2006b). Helping in infants and chimpanzees. (https:\/\/sites.lsa.umich.edu\/warneken\/study-videos\/. Last accessed on 01\/10\/2024)"},{"key":"2559_CR72","doi-asserted-by":"crossref","unstructured":"Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. Proceedings of the european conference on computer vision (eccv) (pp. 305\u2013321).","DOI":"10.1007\/978-3-030-01267-0_19"},{"key":"2559_CR73","doi-asserted-by":"crossref","unstructured":"Xu, J., Rao, Y., Yu, X., Chen, G., Zhou, J., & Lu, J. (2022). Finediving: A fine-grained dataset for procedure-aware action quality assessment. https:\/\/arxiv.org\/abs\/2204.03646","DOI":"10.1109\/CVPR52688.2022.00296"},{"key":"2559_CR74","unstructured":"yjxiong, & Line290. (2019). Tsn-pytorch. (https:\/\/github.com\/yjxiong\/tsn-pytorch. Last accessed on 24-05-2023)"},{"key":"2559_CR75","doi-asserted-by":"crossref","unstructured":"Zatsarynna, O., Farha, Y.A., & Gall, J. (2022). Self-supervised learning for unintentional action prediction.","DOI":"10.1007\/978-3-031-16788-1_26"},{"key":"2559_CR76","unstructured":"Zhang, B., Chen, J., Xu, Y., Zhang, H., Yang, X., & Geng, X. (2021). Auto-encoding score distribution regression for action quality assessment. arXiv preprint arXiv:2111.11029"},{"key":"2559_CR77","doi-asserted-by":"crossref","unstructured":"Zhang, C., Gupta, A., & Zisserman, A. (2021). Temporal query networks for fine-grained video understanding. Proceedings of the ieee\/cvf conference on computer vision and pattern recognition (cvpr) (p.4486-4496).","DOI":"10.1109\/CVPR46437.2021.00446"},{"key":"2559_CR78","doi-asserted-by":"crossref","unstructured":"Zhao, J., Zhang, Y., Li, X., Chen, H., Shuai, B., Xu, M., & Tighe, J. (2022). Tuber: Tubelet transformer for video action detection. Proceedings of the ieee\/cvf conference on computer vision and pattern recognition (cvpr) (p.13598-13607).","DOI":"10.1109\/CVPR52688.2022.01323"},{"key":"2559_CR79","unstructured":"Zhou, C., & Huang, Y. (2022). Uncertainty-driven action quality assessment. https:\/\/arxiv.org\/abs\/2207.14513"},{"key":"2559_CR80","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Bao, W., & Yu, Q. (2022). Towards open set video anomaly detection. https:\/\/arxiv.org\/abs\/2208.11113","DOI":"10.1007\/978-3-031-19830-4_23"},{"key":"2559_CR81","unstructured":"Zhu, Y., & Newsam, S.D. (2019). Motion-aware feature for improved video anomaly detection. CoRR, abs\/1907.10211, http:\/\/arxiv.org\/abs\/1907.10211"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02559-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-025-02559-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02559-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T06:30:04Z","timestamp":1762929004000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-025-02559-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,11]]},"references-count":81,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025,11]]}},"alternative-id":["2559"],"URL":"https:\/\/doi.org\/10.1007\/s11263-025-02559-4","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,11]]},"assertion":[{"value":"16 February 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 July 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 August 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}