{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T07:03:31Z","timestamp":1763795011431,"version":"3.45.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>This study introduces an automated system for fine-grained stroke recognition in broadcast table tennis videos, designed to address challenges in manual annotation and tactical analysis during international competitions. The proposed framework integrates an Adaptive Temporal Difference Model with a Transformer Encoder (ATDT), leveraging a combination of Temporal Difference Networks (TDN) and Temporal Adaptive Modules (TAM) to enhance spatial and temporal feature extraction. To enhance feature discriminability, we employ supervised contrastive learning, which promotes better representation learning for fine-grained action recognition. The system is divided into two primary modules: the Action Segmentation Module (ASM) and the Action Recognition Module (ARM). ASM precisely identifies the start and end times of each stroke action by incorporating ball trajectory analysis to identify precise hit timings and placements. The precise segmentation facilitates the subsequent ARM to implement a three-stage recognition process: forehand and backhand classification, group-based classification, and intra-group action classification. This hierarchical approach improves the system\u2019s ability to differentiate between subtle stroke variations, even under the constraints of low-resolution broadcast footage. To validate the framework, the MISTT dataset was collected, comprising 3,618 stroke action clips from 18 international matches, with professional player annotations. The proposed ATDT model outperformed existing methods, achieving a top-1 accuracy improvement of 18% for forehand strokes and 25.58% for backhand strokes compared to baseline models. Moreover, our automatic annotation system takes only 1\/30 of the time compared to the manual annotation process, demonstrating its efficiency.<\/jats:p>","DOI":"10.1145\/3769299","type":"journal-article","created":{"date-parts":[[2025,9,30]],"date-time":"2025-09-30T15:05:51Z","timestamp":1759244751000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Fine-grained Stroke Recognition in Broadcast Table Tennis Videos with ATDT"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9529-9070","authenticated-orcid":false,"given":"Tang-Chen","family":"Chang","sequence":"first","affiliation":[{"name":"Institute of Information Systems and Applications, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-8590-2034","authenticated-orcid":false,"given":"Duen-Chian","family":"Jheng","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0311-0061","authenticated-orcid":false,"given":"Hsuan-Ya","family":"Liang","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5716-6350","authenticated-orcid":false,"given":"Bill Louis","family":"Harchan","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5339-4700","authenticated-orcid":false,"given":"Pu","family":"Ching","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1446-4634","authenticated-orcid":false,"given":"Tsung-Hsun","family":"Tsai","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-8569-2890","authenticated-orcid":false,"given":"Chih-Yi","family":"Chang","sequence":"additional","affiliation":[{"name":"Taiwan Institute of Sports Science, Kaohsiung, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8174-7898","authenticated-orcid":false,"given":"Te-Cheng","family":"Wu","sequence":"additional","affiliation":[{"name":"Physical Education Office, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0475-3689","authenticated-orcid":false,"given":"Yung-Hui","family":"Li","sequence":"additional","affiliation":[{"name":"AI Research Center, Hon Hai Research Institute, Taipei City, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8570-1575","authenticated-orcid":false,"given":"Tse-Yu","family":"Pan","sequence":"additional","affiliation":[{"name":"National Taiwan University of Science and Technology, Taipei City, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7153-4411","authenticated-orcid":false,"given":"Hung-Kuo","family":"Chu","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1917-2155","authenticated-orcid":false,"given":"Min-Chun","family":"Hu","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,11,21]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"e_1_3_2_3_2","unstructured":"Gedas Bertasius Heng Wang and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding? In Proceedings of the 38th International Conference on Machine Learning Vol. 2 4."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3633516"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_6_2","unstructured":"Brandon Castellano. 2024. Home - PySceneDetect. 2024. Retrieved from https:\/\/www.scenedetect.com"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01311"},{"key":"e_1_3_2_8_2","unstructured":"Dartfish. 2024. Dartfish: Video Analysis Solutions to Improve Teams\u2019 and Athletes\u2019 Performance. Retrieved from https:\/\/www.dartfish.com"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00298"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.392"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i8.28692"},{"key":"e_1_3_2_13_2","unstructured":"Glenn Jocher Ayush Chaurasia and Jing Qiu. 2023. Ultralytics YOLO. Retrieved from https:\/\/github.com\/ultralytics\/ultralytics"},{"key":"e_1_3_2_14_2","first-page":"18661","article-title":"Supervised contrastive learning","volume":"33","author":"Khosla Prannay","year":"2020","unstructured":"Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. In Advances in Neural Information Processing Systems, Vol. 33, 18661\u201318673.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_15_2","doi-asserted-by":"crossref","unstructured":"Kaustubh Milind Kulkarni Rohan S. Jamadagni Jeffrey Aaron Paul and Sucheth Shenoy. 2023. Table tennis stroke detection and recognition using ball trajectory data. arXiv:2302.09657. Retrieved from https:\/\/arxiv.org\/abs\/2302.09657","DOI":"10.2139\/ssrn.4159539"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW53098.2021.00515"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.113"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00958"},{"key":"e_1_3_2_19_2","unstructured":"Sangyoun Lee Juho Jung Changdae Oh and Sunghee Yun. 2024. Enhancing temporal action localization: Advanced S6 modeling with recurrent mechanism. arXiv:2407.13078. Retrieved from https:\/\/arxiv.org\/abs\/2407.13078"},{"key":"e_1_3_2_20_2","unstructured":"Kunchang Li Yali Wang Yinan He Yizhuo Li Yi Wang Limin Wang and Yu Qiao. 2022. Uniformerv2: Spatiotemporal learning by arming image VITS with video uniformer. arXiv:2211.09552. Retrieved from https:\/\/arxiv.org\/abs\/2211.09552"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00099"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00718"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00399"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_1"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3271811"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3195321"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01345"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-60639-8_40"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2019.8803382"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CBMI.2018.8516488"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR48806.2021.9412742"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3475722.3482793"},{"key":"e_1_3_2_34_2","unstructured":"Pierre-Etienne Martin Jordan Calandre Boris Mansencal Jenny Benois-Pineau Renaud P\u00e9teri Laurent Mascarilla and Julien Morlier. 2021. Sports video: Fine-grained action detection and classification of table tennis strokes from videos for mediaeval 2021. arXiv:2112.11384. Retrieved from https:\/\/arxiv.org\/abs\/2112.11384"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00055"},{"key":"e_1_3_2_36_2","doi-asserted-by":"crossref","unstructured":"Zheng Shou Dongang Wang and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.","DOI":"10.1109\/CVPR.2016.119"},{"key":"e_1_3_2_37_2","unstructured":"Tom\u00e1\u0161 Sou\u010dek and Jakub Loko\u010d. 2020. Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv:2008.04838. Retrieved from https:\/\/arxiv.org\/abs\/2008.04838"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00675"},{"key":"e_1_3_2_40_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00450"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.3390\/s23167190"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00193"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01432"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00716"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01340"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.317"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3769299","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T07:00:10Z","timestamp":1763794810000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3769299"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,21]]},"references-count":48,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3769299"],"URL":"https:\/\/doi.org\/10.1145\/3769299","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2025,11,21]]},"assertion":[{"value":"2024-12-31","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}