{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,2]],"date-time":"2026-03-02T19:50:38Z","timestamp":1772481038992,"version":"3.50.1"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"7","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,7,31]]},"abstract":"<jats:p>\n                    To maximize the utility of lecture videos, in today\u2019s fast-paced society with dwindling attention spans, various e-learning technologies are introduced, e.g., non-linear learning, bite-sized learning, and personalized lecture video fragment recommendation. In this article, we conduct a detailed performance study on a key enabler for aforementioned technologies:\n                    <jats:italic toggle=\"yes\">Lecture Video Fragmentation<\/jats:italic>\n                    by\n                    <jats:italic toggle=\"yes\">Key Point Identification<\/jats:italic>\n                    in lecture videos. We begin with a taxonomy of existing methods, which are classified into two categories: boundary-based methods, where the fragmentation is achieved using specific methods depending on the modality, and representation-based methods, where the fragmentation task is formulated as a boundary prediction task based on representations of smaller video chunks. Various configurations of these methods are also examined in detail. To conduct an extensive, comprehensive, and objective comparison study, we address the limitations of existing datasets by introducing a new lecture video fragmentation dataset, MITFLD, without any synthetic videos. We also propose a unified framework\n                    <jats:monospace>kpi<\/jats:monospace>\n                    , which includes the implementation of datasets, metrics, and compared methods to facilitate the experiments and future research on lecture video fragmentation. The experiments cover different configurations of existing methods on two large datasets (AVLecture and MITFLD). Further experiments are also conducted for ablation studies, such as the effect of feature combinations and the influence of lecture modes. Through the experiments, the representation-based method BiLSTM with self-supervised learning representations is found to exhibit promising performance. Key insights and potential future directions are also discussed.\n                  <\/jats:p>","DOI":"10.1145\/3746640","type":"journal-article","created":{"date-parts":[[2025,7,3]],"date-time":"2025-07-03T11:45:45Z","timestamp":1751543145000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Towards Key Point Identification (KPI) for Lecture Videos: Approaches and Performance Evaluation"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-8847-9194","authenticated-orcid":false,"given":"Jiaqi","family":"Wang","sequence":"first","affiliation":[{"name":"The University of Hong Kong, Hong Kong SAR, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0727-4376","authenticated-orcid":false,"given":"Ricky Y.-K.","family":"Kwok","sequence":"additional","affiliation":[{"name":"Hong Kong Metropolitan University, Hong Kong SAR, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3454-8731","authenticated-orcid":false,"given":"Edith C. H.","family":"Ngai","sequence":"additional","affiliation":[{"name":"The University of Hong Kong, Hong Kong SAR, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,7,20]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3112535"},{"key":"e_1_3_2_3_2","unstructured":"Felipe Almeida and Geraldo Xex\u00e9o. 2019. Word embeddings: A survey. arXiv:1901.09069. Retrieved from https:\/\/arxiv.org\/abs\/1901.09069"},{"key":"e_1_3_2_4_2","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, Vol. 33, 12449\u201312460.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1187\/cbe.16-03-0125"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/584792.584829"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICICT50521.2020.00034"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2008.2008924"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/2502081.2508115"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00947"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01265"},{"key":"e_1_3_2_12_2","first-page":"447","volume-title":"European Conference on Artificial Intelligence","author":"Dimitsas Markos","year":"2023","unstructured":"Markos Dimitsas and Jochen L. Leidner. 2023. Topic segmentation of educational video lectures using audio and text. In European Conference on Artificial Intelligence. Springer, 447\u2013458."},{"key":"e_1_3_2_13_2","doi-asserted-by":"crossref","unstructured":"Guodong Ding Fadime Sener and Angela Yao. 2023. Temporal action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence.","DOI":"10.1109\/TPAMI.2023.3327284"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3073596"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-05716-9_21"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/2843948"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/2556325.2566239"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV56688.2023.00520"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.5555\/972684.972687"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1137\/0202019"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58548-8_41"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/2428955.2429011"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.4236\/jss.2024.124015"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.4018\/jthi.2005040102"},{"key":"e_1_3_2_27_2","first-page":"9","volume-title":"Proceedings of the 37th Annual Hawaii International Conference on System Sciences","author":"Lin Ming","year":"2004","unstructured":"Ming Lin, Jay F. Nunamaker, Michael Chau, and Hsinchun Chen. 2004. Segmentation of lecture videos based on text: A method combining multiple linguistic features. In Proceedings of the 37th Annual Hawaii International Conference on System Sciences. IEEE, 1\u20139."},{"key":"e_1_3_2_28_2","unstructured":"Ben Mann N. Ryder M. Subbiah J. Kaplan P. Dhariwal A. Neelakantan P. Shyam G. Sastry A. Askell S. Agarwal et al. 2020. Language models are few-shot learners. arXiv:2005.14165. Retrieved from https:\/\/arxiv.org\/abs\/2005.14165"},{"key":"e_1_3_2_29_2","doi-asserted-by":"crossref","unstructured":"Guilherme de A. P. Marques Jos\u00e9 Matheus C. Boaro Antonio Jos\u00e9 G. Busson Alan L. V. Guedes Julio Cesar Duarte and S\u00e9rgio Colcher. 2024. Action segmentation through self-supervised video features and positional-encoded embeddings. ACM Transactions on Multimedia Computing Communications and Applications 20 9 (2024) 1\u201323.","DOI":"10.1145\/3649465"},{"key":"e_1_3_2_30_2","first-page":"1","volume-title":"ACM Computing Surveys (CSUR)","author":"Thi Tuyet","year":"2021","unstructured":"Tuyet Thi, Nguyen Hai, Jatowt Adam, Coustaty Mickael, and Antoine Doucet. 2021. Survey of post-OCR processing approaches. ACM Computing Surveys (CSUR) 54, 6 (2021), 1\u201337."},{"key":"e_1_3_2_31_2","first-page":"28492","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2023","unstructured":"Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 28492\u201328518."},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01016"},{"key":"e_1_3_2_33_2","doi-asserted-by":"crossref","unstructured":"N. Reimers. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv:1908.10084. Retrieved from https:\/\/arxiv.org\/abs\/1908.10084","DOI":"10.18653\/v1\/D19-1410"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01107"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00914"},{"key":"e_1_3_2_36_2","unstructured":"Sander Schulhoff Michael Ilie Nishant Balepur Konstantine Kahadze Amanda Liu Chenglei Si Yinheng Li Aayush Gupta HyoJung Han Sevien Schulhoff et al. 2024. The prompt report: A systematic survey of prompting techniques. arXiv:2406.06608. Retrieved from https:\/\/arxiv.org\/abs\/2406.06608"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2656407"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISM.2015.18"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664815"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3323503.3349548"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3654671"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3630257"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3654669"},{"key":"e_1_3_2_44_2","first-page":"5998","article-title":"Attention is all you need","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998\u20136008.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397481.3450672"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3700645"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3232034"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01363"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-023-17654-2"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.634"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3595923"},{"key":"e_1_3_2_52_2","doi-asserted-by":"crossref","unstructured":"Duzhen Zhang Yahan Yu Chenxing Li Jiahua Dong Su Dan Chu and Chenhui Dong Yu. 2024. MM-LLMS: Recent advances in multimodal large language models. arXiv:2401.13601. Retrieved from https:\/\/arxiv.org\/abs\/2401.13601","DOI":"10.18653\/v1\/2024.findings-acl.738"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1007\/s13042-010-0001-0"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746640","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,2]],"date-time":"2026-03-02T18:43:21Z","timestamp":1772477001000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746640"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,20]]},"references-count":52,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2025,7,31]]}},"alternative-id":["10.1145\/3746640"],"URL":"https:\/\/doi.org\/10.1145\/3746640","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,20]]},"assertion":[{"value":"2024-11-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-17","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}