{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T16:18:21Z","timestamp":1771604301898,"version":"3.50.1"},"reference-count":75,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,12,28]],"date-time":"2025-12-28T00:00:00Z","timestamp":1766880000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,12,28]],"date-time":"2025-12-28T00:00:00Z","timestamp":1766880000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2026,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    This work proposes UIL-AQA for long-term Action Quality Assessment AQA designed to be clip-level interpretable and uncertainty-aware. AQA evaluates the execution quality of actions in videos. However, the complexity and diversity of actions, especially in long videos, increase the difficulty of AQA. Existing AQA methods solve this by limiting themselves generally to short-term videos. These approaches lack detailed semantic interpretation for individual clips and fail to account for the impact of human biases and subjectivity in the data during model training. Moreover, although query-based Transformer networks demonstrate strong capabilities in long-term modelling, their interpretability in AQA remains insufficient. This is primarily due to a phenomenon we identified, termed\n                    <jats:italic>Temporal Skipping<\/jats:italic>\n                    , where the model skips self-attention layers to prevent output degradation. We introduce an Attention Loss function and a Query Initialization Module to enhance the modelling capability of query-based Transformer networks. Additionally, we incorporate a Gaussian Noise Injection Module to simulate biases in human scoring, mitigating the influence of uncertainty and improving model reliability. Furthermore, we propose a Difficulty-Quality Regression Module, which decomposes each clip\u2019s action score into independent difficulty and quality components, enabling a more fine-grained and interpretable evaluation. Our extensive quantitative and qualitative analysis demonstrates that our proposed method achieves state-of-the-art performance on three long-term real-world AQA datasets. Our code is available at:\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/dx199771\/Interpretability-AQA\" ext-link-type=\"uri\">https:\/\/github.com\/dx199771\/Interpretability-AQA<\/jats:ext-link>\n                  <\/jats:p>","DOI":"10.1007\/s11263-025-02638-6","type":"journal-article","created":{"date-parts":[[2025,12,28]],"date-time":"2025-12-28T02:32:30Z","timestamp":1766889150000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["UIL-AQA: Uncertainty-Aware Clip-Level Interpretable Action Quality Assessment"],"prefix":"10.1007","volume":"134","author":[{"given":"Xu","family":"Dong","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xinran","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wanqing","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Anthony","family":"Adeyemi-Ejeye","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andrew","family":"Gilbert","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,12,28]]},"reference":[{"key":"2638_CR1","doi-asserted-by":"crossref","unstructured":"Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lu\u010di\u0107, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE\/CVF international conference on computer vision (ICCV), pages 6836\u20136846.","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"2638_CR2","doi-asserted-by":"crossref","unstructured":"Ashutosh, K., Nagarajan, T., Pavlakos, G., Kitani, K., & Grauman, K. (2025). Expertaf: Expert actionable feedback from video.","DOI":"10.1109\/CVPR52734.2025.01268"},{"key":"2638_CR3","doi-asserted-by":"crossref","unstructured":"Bai, Y., Zhou, D., Zhang, S., Wang, J., Ding, E., Guan, Y., Long, Y., & Wang, J. (2022). Action quality assessment with temporal parsing transformer. In Proceedings of the European Conference on Computer Vision (ECCV), pages 422\u2013438.","DOI":"10.1007\/978-3-031-19772-7_25"},{"key":"2638_CR4","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 213\u2013229.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"2638_CR5","doi-asserted-by":"crossref","unstructured":"Carreira, J. & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724\u20134733.","DOI":"10.1109\/CVPR.2017.502"},{"key":"2638_CR6","doi-asserted-by":"crossref","unstructured":"Chen, Z., Sun, W., Tian, Y., Jia, J., Zhang, Z., Wang, J., Huang, R., Min, X., Zhai, G., & Zhang, W. (2024). Gaia: Rethinking action quality assessment for ai-generated videos. In Advances in Neural Information Processing Systems (NeurIPS).","DOI":"10.52202\/079017-1267"},{"key":"2638_CR7","unstructured":"Dong, X., Liu, X., Li, W., Adeyemi-Ejeye, A., & Gilbert, A. (2024). Interpretable long-term action quality assessment. In Proceedings of the British Machine Vision Conference (BMVC)."},{"key":"2638_CR8","doi-asserted-by":"crossref","unstructured":"Doughty, H., Damen, D., & Mayol-Cuevas, W. (2018). Who\u2019s better? who\u2019s best? pairwise deep ranking for skill determination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).","DOI":"10.1109\/CVPR.2018.00634"},{"key":"2638_CR9","doi-asserted-by":"crossref","unstructured":"Doughty, H., Mayol-Cuevas, W., & Damen, D. (2019). The pros and cons: Rank-aware temporal attention for skill determination in long videos. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR).","DOI":"10.1109\/CVPR.2019.00805"},{"key":"2638_CR10","doi-asserted-by":"crossref","unstructured":"Du, Z., He, D., Wang, X., & Wang, Q. (2023). Learning semantics-guided representations for scoring figure skating. IEEE Transactions on Multimedia.","DOI":"10.1109\/TMM.2023.3328180"},{"key":"2638_CR11","doi-asserted-by":"crossref","unstructured":"Du, Z., He, D., Wang, X., & Wang, Q. (2024). Learning semantics-guided representations for scoring figure skating. In IEEE Transactions on Multimedia (TMM), pages 4987\u20134997.","DOI":"10.1109\/TMM.2023.3328180"},{"key":"2638_CR12","doi-asserted-by":"crossref","unstructured":"Farabi, S., Himel, H., Gazzali, F., Hasan, M.\u00a0B., Kabir, M.\u00a0H., & Farazi, M. (2022). Improving action quality assessment using weighted aggregation. In Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), pages 576\u2013587.","DOI":"10.1007\/978-3-031-04881-4_46"},{"key":"2638_CR13","doi-asserted-by":"publisher","first-page":"1217","DOI":"10.1007\/s11548-019-01995-1","volume":"14","author":"I Funke","year":"2019","unstructured":"Funke, I., Mees, S. T., Weitz, J., & Speidel, S. (2019). Video-based surgical skill assessment using 3d convolutional neural networks. Int. J. Comput. Assist. Radiol. Surg.,14, 1217\u20131225.","journal-title":"Int. J. Comput. Assist. Radiol. Surg."},{"key":"2638_CR14","doi-asserted-by":"crossref","unstructured":"Gao, J., Sun, C., Yang, Z., & Nevatia, R. (2017). Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision (ICCV), pages 5267\u20135275.","DOI":"10.1109\/ICCV.2017.563"},{"key":"2638_CR15","unstructured":"Gao, Y., Vedula, S.\u00a0S., Reiley, C.\u00a0E., Ahmidi, N., Varadarajan, B., Lin, H.\u00a0C., Tao, L., Zappella, L., B\u00e9jar, B., Yuh, D.\u00a0D., et\u00a0al. (2014). Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In MICCAI workshop, volume\u00a03, page\u00a03."},{"key":"2638_CR16","doi-asserted-by":"crossref","unstructured":"Han, R., Zhou, K., Atapour-Abarghouei, A., Liang, X., & Shum, H. P.\u00a0H. (2025). Finecausal: A causal-based framework for interpretable fine-grained action quality assessment. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).","DOI":"10.1109\/CVPRW67362.2025.00599"},{"key":"2638_CR17","doi-asserted-by":"crossref","unstructured":"Ji, Y., Ye, L., Huang, H., Mao, L., Zhou, Y., & Gao, L. (2023). Localization-assisted uncertainty score disentanglement network for action quality assessment. In Proceedings of ACM International Conference on Multimedia (ACM MM), pages 8590\u20138597.","DOI":"10.1145\/3581783.3613795"},{"key":"2638_CR18","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2024.120347","volume":"664","author":"X Ke","year":"2024","unstructured":"Ke, X., Xu, H., Lin, X., & Guo, W. (2024). Two-path target-aware contrastive regression for action quality assessment. Inf. Sci.,664, Article 120347.","journal-title":"Inf. Sci."},{"key":"2638_CR19","unstructured":"Kendall, A. & Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of Neural Information Processing Systems (NIPS)."},{"key":"2638_CR20","doi-asserted-by":"crossref","unstructured":"Kim, J., Lee, M., & Heo, J.-P. (2023). Self-feedback detr for temporal action detection. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), pages 10286\u201310296.","DOI":"10.1109\/ICCV51070.2023.00944"},{"key":"2638_CR21","doi-asserted-by":"crossref","unstructured":"Li, J., Xue, J., Cao, R., Du, X., Mo, S., Ran, K., & Zhang, Z. (2024). Finerehab: A multi-modality and multi-task dataset for rehabilitation analysis. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3184\u20133193.","DOI":"10.1109\/CVPRW63382.2024.00324"},{"key":"2638_CR22","first-page":"16191","volume":"53","author":"W Li","year":"2023","unstructured":"Li, W., Li, X., Li, B., Wang, S., Ma, L., Liu, Y., & Shi, Z. (2023). Label-reconstruction-based pseudo-subscore learning for action quality assessment in sporting events. Appl. Intell.,53, 16191\u201316207.","journal-title":"Appl. Intell."},{"key":"2638_CR23","doi-asserted-by":"publisher","first-page":"8658","DOI":"10.1609\/aaai.v33i01.33018658","volume":"33","author":"X Li","year":"2019","unstructured":"Li, X., Song, J., Gao, L., Liu, X., Huang, W., He, X., & Gan, C. (2019). Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI conference on artificial intelligence,33, 8658\u20138665.","journal-title":"In Proceedings of the AAAI conference on artificial intelligence"},{"key":"2638_CR24","unstructured":"Li, Y.-M., Wang, A.-L., Lin, K.-Y., Tang, Y.-M., Zeng, L.-A., Hu, J.-F., & Zheng, W.-S. (2025). Techcoach: Towards technical-point-aware descriptive action coaching."},{"key":"2638_CR25","doi-asserted-by":"crossref","unstructured":"Li, Z., Huang, Y., Cai, M., & Sato, Y. (2019b). Manipulation-skill assessment from videos with spatial attention network. In Proceedings of the IEEE\/CVF international conference on computer vision workshops (ICCVW), pages 4385\u20134395.","DOI":"10.1109\/ICCVW.2019.00539"},{"key":"2638_CR26","doi-asserted-by":"publisher","first-page":"5427","DOI":"10.1109\/TIP.2022.3195321","volume":"31","author":"X Liu","year":"2022","unstructured":"Liu, X., Wang, Q., Hu, Y., Tang, X., Zhang, S., Bai, S., & Bai, X. (2022). End-to-end temporal action detection with transformer. IEEE Trans. Image Process.,31, 5427\u20135441.","journal-title":"IEEE Trans. Image Process."},{"key":"2638_CR27","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE\/CVF International Conference on Computer Vision (ICCV), pages 9992\u201310002.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"2638_CR28","doi-asserted-by":"crossref","unstructured":"Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022b). Video swin transformer. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202\u20133211.","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"2638_CR29","doi-asserted-by":"crossref","unstructured":"Locatello, F., Bauer, S., Lucic, M., R\u00e4tsch, G., Gelly, S., Sch\u00f6lkopf, B., & Bachem, O. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations.","DOI":"10.1609\/aaai.v34i09.7120"},{"key":"2638_CR30","doi-asserted-by":"crossref","unstructured":"Moon, W., Hyun, S., Park, S., Park, D., & Heo, J.-P. (2023). Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23023\u201323033.","DOI":"10.1109\/CVPR52729.2023.02205"},{"key":"2638_CR31","doi-asserted-by":"crossref","unstructured":"Okamoto, L. & Parmar, P. (2024). Hierarchical neurosymbolic approach for comprehensive and explainable action quality assessment.","DOI":"10.1109\/CVPRW63382.2024.00326"},{"key":"2638_CR32","doi-asserted-by":"crossref","unstructured":"Pan, J.-H., Gao, J., & Zheng, W.-S. (2019). Action assessment by joint relation graphs. In Proceedings of the IEEE\/CVF international conference on computer vision (ICCV), pages 6331\u20136340.","DOI":"10.1109\/ICCV.2019.00643"},{"key":"2638_CR33","doi-asserted-by":"crossref","unstructured":"Parmar, P., Gharat, A., & Rhodin, H. (2022). Domain knowledge-informed self-supervised representations for workout form assessment. In Proceedings of the European Conference on Computer Vision (ECCV),, pages 105\u2013123 Springer.","DOI":"10.1007\/978-3-031-19839-7_7"},{"key":"2638_CR34","doi-asserted-by":"crossref","unstructured":"Parmar, P. & Morris, B. (2017). Learning to score olympic events. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 76\u201384.","DOI":"10.1109\/CVPRW.2017.16"},{"key":"2638_CR35","doi-asserted-by":"crossref","unstructured":"Parmar, P. & Morris, B. (2019). Action quality assessment across multiple actions. In Proceedings of the IEEE winter conference on applications of computer vision (WACV), pages 1468\u20131476.","DOI":"10.1109\/WACV.2019.00161"},{"key":"2638_CR36","doi-asserted-by":"crossref","unstructured":"Parmar, P. & Tran\u00a0Morris, B. (2019). What and how well you performed? a multitask learning approach to action quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 304\u2013313.","DOI":"10.1109\/CVPR.2019.00039"},{"key":"2638_CR37","doi-asserted-by":"crossref","unstructured":"Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Assessing the quality of actions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 556\u2013571.","DOI":"10.1007\/978-3-319-10599-4_36"},{"key":"2638_CR38","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5533\u20135541.","DOI":"10.1109\/ICCV.2017.590"},{"key":"2638_CR39","doi-asserted-by":"crossref","unstructured":"Ren, S., Yao, L., Li, S., Sun, X., & Hou, L. (2023). Timechat: A time-sensitive multimodal large language model for long video understanding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), volume abs\/2312.02051.","DOI":"10.1109\/CVPR52733.2024.01357"},{"key":"2638_CR40","doi-asserted-by":"crossref","unstructured":"Roditakis, K., Makris, A., & Argyros, A. (2021). Towards improved and interpretable action quality assessment with self-supervised alignment. In Proceedings of the PErvasive Technologies Related to Assistive Environments Conference, pages 507\u2013513.","DOI":"10.1145\/3453892.3461624"},{"key":"2638_CR41","unstructured":"Sharma, Y., Bettadapura, V., Hammerla, N., Mellor, S., McNaney, R., Olivier, P., Deshmukh, S., McCaskie, A., Essa, I., et\u00a0al. (2014). Video based assessment of osats using sequential motion textures. In Workshop on Modeling and Monitoring of Computer Assisted Interventions 2014. Springer."},{"key":"2638_CR42","doi-asserted-by":"crossref","unstructured":"Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y., Lu, Y., Hwang, J.-N., & Wang, G. (2024). Moviechat: From dense token to sparse memory for long video understanding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR).","DOI":"10.1109\/CVPR52733.2024.01725"},{"key":"2638_CR43","unstructured":"Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2016). Unsupervised learning of video representations using lstms. In Proceedings of the 32nd International Conference on Machine Learning (ICML), volume\u00a037, pages 843\u2013852."},{"key":"2638_CR44","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR).","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"2638_CR45","doi-asserted-by":"crossref","unstructured":"Tang, Y., Ni, Z., Zhou, J., Zhang, D., Lu, J., Wu, Y., & Zhou, J. (2020). Uncertainty-aware score distribution learning for action quality assessment. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pages 9839\u20139848.","DOI":"10.1109\/CVPR42600.2020.00986"},{"key":"2638_CR46","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV), pages 4489\u20134497.","DOI":"10.1109\/ICCV.2015.510"},{"key":"2638_CR47","first-page":"1","volume":"67","author":"V Venkataraman","year":"2015","unstructured":"Venkataraman, V., Vlachos, I., & Turaga, P. K. (2015). Dynamical regularity for action analysis. In Proceedings of the British Machine Vision Conference (BMVC),67, 1\u201312.","journal-title":"In Proceedings of the British Machine Vision Conference (BMVC)"},{"key":"2638_CR48","doi-asserted-by":"crossref","unstructured":"Wang, S., Yang, D., Zhai, P., Chen, C., & Zhang, L. (2021a). Tsa-net: Tube self-attention network for action quality assessment. In Proceedings of the 29th ACM international conference on multimedia (ACM MM), pages 4902\u20134910.","DOI":"10.1145\/3474085.3475438"},{"key":"2638_CR49","doi-asserted-by":"crossref","unstructured":"Wang, T., Wang, Y., & Li, M. (2020). Towards accurate and interpretable surgical skill assessment: A video-based method incorporating recognized surgical gestures and skill levels. In Proceedings of Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 668\u2013678. Springer.","DOI":"10.1007\/978-3-030-59716-0_64"},{"key":"2638_CR50","doi-asserted-by":"crossref","unstructured":"Wang, T., Zhang, R., Lu, Z., Zheng, F., Cheng, R., & Luo, P. (2021b). End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), pages 6847\u20136857.","DOI":"10.1109\/ICCV48922.2021.00677"},{"key":"2638_CR51","doi-asserted-by":"crossref","unstructured":"Wang, T., Zhang, R., Lu, Z., Zheng, F., Cheng, R., & Luo, P. (2021c). End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), pages 6847\u20136857.","DOI":"10.1109\/ICCV48922.2021.00677"},{"key":"2638_CR52","doi-asserted-by":"crossref","unstructured":"Wnuk, K. & Soatto, S. (2010). Analyzing diving: A dataset for judging action quality. In Asian conference on computer vision (ACCV), pages 266\u2013276. Springer.","DOI":"10.1007\/978-3-642-22822-3_27"},{"key":"2638_CR53","doi-asserted-by":"crossref","unstructured":"Wu, C.-Y., Li, Y., Mangalam, K., Fan, H., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13587\u201313597.","DOI":"10.1109\/CVPR52688.2022.01322"},{"key":"2638_CR54","doi-asserted-by":"crossref","unstructured":"Xia, J., Zhuge, M., Geng, T., Fan, S., Wei, Y., He, Z., & Zheng, F. (2023). Skating-mixer: Long-term sport audio-visual modeling with mlps. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 5140\u20135148.","DOI":"10.1609\/aaai.v37i3.25392"},{"key":"2638_CR55","doi-asserted-by":"crossref","unstructured":"Xiang, X., Tian, Y., Reiter, A., Hager, G.\u00a0D., & Tran, T.\u00a0D. (2018). S3d: Stacking segmental p3d for action quality assessment. In 25th IEEE international conference on image processing (ICIP), pages 928\u2013932. IEEE.","DOI":"10.1109\/ICIP.2018.8451364"},{"key":"2638_CR56","doi-asserted-by":"crossref","unstructured":"Xu, A., Zeng, L.-A., & Zheng, W.-S. (2022a). Likert scoring with grade decoupling for long-term action assessment. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3232\u20133241.","DOI":"10.1109\/CVPR52688.2022.00323"},{"issue":"12","key":"2638_CR57","doi-asserted-by":"publisher","first-page":"4578","DOI":"10.1109\/TCSVT.2019.2927118","volume":"30","author":"C Xu","year":"2019","unstructured":"Xu, C., Fu, Y., Zhang, B., Chen, Z., Jiang, Y.-G., & Xue, X. (2019). Learning to score figure skating sport videos. IEEE Trans. Circuits Syst. Video Technol.,30(12), 4578\u20134590.","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"2638_CR58","doi-asserted-by":"crossref","unstructured":"Xu, H., Ke, X., Wu, H., Xu, R., Li, Y., & Guo, W. (2025a). Language-guided audio-visual learning for long-term sports assessment. In In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23967\u201323977.","DOI":"10.1109\/CVPR52734.2025.02232"},{"key":"2638_CR59","doi-asserted-by":"crossref","unstructured":"Xu, H., Ke, X., Wu, H., Xu, R., Li, Y., Xu, P., & Guo, W. (2025b). Dancefix: An exploration in group dance neatness assessment through fixing abnormal challenges of human pose. In In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 8869\u20138877.","DOI":"10.1609\/aaai.v39i8.32959"},{"key":"2638_CR60","doi-asserted-by":"crossref","unstructured":"Xu, H., Wu, H., Ke, X., Li, Y., Xu, R., & Guo, W. (2025c). Quality-guided vision-language learning for long-term action quality assessment. In IEEE Transactions on Multimedia (TMM), pages 1\u201313.","DOI":"10.1109\/TMM.2025.3599078"},{"key":"2638_CR61","doi-asserted-by":"crossref","unstructured":"Xu, J., Rao, Y., Yu, X., Chen, G., Zhou, J., & Lu, J. (2022b). Finediving: A fine-grained dataset for procedure-aware action quality assessment. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pages 2949\u20132958.","DOI":"10.1109\/CVPR52688.2022.00296"},{"key":"2638_CR62","doi-asserted-by":"crossref","unstructured":"Xu, J., Yin, S., Zhao, G., Wang, Z., & Peng, Y. (2024). Fineparser: A fine-grained spatio-temporal action parser for human-centric action quality assessment. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14628\u201314637.","DOI":"10.1109\/CVPR52733.2024.01386"},{"key":"2638_CR63","doi-asserted-by":"crossref","unstructured":"Yang, A., Nagrani, A., Seo, P.\u00a0H., Miech, A., Pont-Tuset, J., Laptev, I., Sivic, J., & Schmid, C. (2023). Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10714\u201310726.","DOI":"10.1109\/CVPR52729.2023.01032"},{"key":"2638_CR64","doi-asserted-by":"crossref","unstructured":"Yu, X., Rao, Y., Zhao, W., Lu, J., & Zhou, J. (2021). Group-aware contrastive regression for action quality assessment. In Proceedings of the IEEE\/CVF international conference on computer vision (ICCV), pages 7919\u20137928.","DOI":"10.1109\/ICCV48922.2021.00782"},{"key":"2638_CR65","doi-asserted-by":"crossref","unstructured":"Zeng, L.-A., Hong, F.-T., Zheng, W.-S., Yu, Q.-Z., Zeng, W., Wang, Y.-W., & Lai, J.-H. (2020). Hybrid dynamic-static context-aware attention network for action assessment in long videos. In Proceedings of ACM International Conference on Multimedia (ACM MM).","DOI":"10.1145\/3394171.3413560"},{"key":"2638_CR66","doi-asserted-by":"crossref","unstructured":"Zeng, L.-A. & Zheng, W.-S. (2024). Multimodal action quality assessment. In IEEE Transactions on Image Processing (TIP).","DOI":"10.1109\/TIP.2024.3362135"},{"issue":"2","key":"2638_CR67","doi-asserted-by":"publisher","first-page":"929","DOI":"10.1007\/s00521-023-09068-w","volume":"36","author":"B Zhang","year":"2024","unstructured":"Zhang, B., Chen, J., Xu, Y., Zhang, H., Yang, X., & Geng, X. (2024). Auto-encoding score distribution regression for action quality assessment. Neural Comput. Appl.,36(2), 929\u2013942.","journal-title":"Neural Comput. Appl."},{"key":"2638_CR68","doi-asserted-by":"crossref","unstructured":"Zhang, C., Gupta, A., & Zisserman, A. (2021). Temporal query networks for fine-grained video understanding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4486\u20134496.","DOI":"10.1109\/CVPR46437.2021.00446"},{"key":"2638_CR69","first-page":"492","volume":"13664","author":"C Zhang","year":"2022","unstructured":"Zhang, C., Wu, J., & Li, Y. (2022). Actionformer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision (ECCV),13664, 492\u2013510.","journal-title":"In Proceedings of the European Conference on Computer Vision (ECCV)"},{"key":"2638_CR70","doi-asserted-by":"crossref","unstructured":"Zhang, S., Bai, S., Chen, G., Chen, L., Lu, J., Wang, J., & Tang, Y. (2024b). Narrative action evaluation with prompt-guided multimodal interaction. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition.","DOI":"10.1109\/CVPR52733.2024.01744"},{"key":"2638_CR71","doi-asserted-by":"crossref","unstructured":"Zhang, S., Dai, W., Wang, S., Shen, X., Lu, J., Zhou, J., & Tang, Y. (2023). Logo: A long-form video dataset for group action quality assessment. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2405\u20132414.","DOI":"10.1109\/CVPR52729.2023.00238"},{"key":"2638_CR72","unstructured":"Zhou, C., Huang, Y., & Ling, H. (2022). Uncertainty-driven action quality assessment. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"2638_CR73","unstructured":"Zhou, K., Li, J., Cai, R., Wang, L., Zhang, X., & Liang, X. (2024). Cofinal: Enhancing action quality assessment with coarse-to-fine instruction alignment. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), page 1771\u20131779. International Joint Conferences on Artificial Intelligence Organization."},{"key":"2638_CR74","doi-asserted-by":"crossref","unstructured":"Zhou, K., Shum, H. P.\u00a0H., Li, F. W.\u00a0B., Zhang, X., & Liang, X. (2025). Phi: Bridging domain shift in long-term action quality assessment via progressive hierarchical instruction. pages 3718\u20133732.","DOI":"10.1109\/TIP.2025.3574938"},{"key":"2638_CR75","unstructured":"Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable DETR: deformable transformers for end-to-end object detection. In 9th International Conference on Learning Representations (ICLR)."}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02638-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-025-02638-6","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02638-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T15:43:50Z","timestamp":1771602230000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-025-02638-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,28]]},"references-count":75,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1]]}},"alternative-id":["2638"],"URL":"https:\/\/doi.org\/10.1007\/s11263-025-02638-6","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,28]]},"assertion":[{"value":"28 February 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 October 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 December 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"24"}}