{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,21]],"date-time":"2026-05-21T01:18:58Z","timestamp":1779326338216,"version":"3.51.4"},"reference-count":38,"publisher":"MDPI AG","issue":"21","license":[{"start":{"date-parts":[[2022,10,28]],"date-time":"2022-10-28T00:00:00Z","timestamp":1666915200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Project","award":["2018YFB1800304"],"award-info":[{"award-number":["2018YFB1800304"]}]},{"name":"National Key Research and Development Project","award":["61472316"],"award-info":[{"award-number":["61472316"]}]},{"name":"National Key Research and Development Project","award":["xzy012020112"],"award-info":[{"award-number":["xzy012020112"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["2018YFB1800304"],"award-info":[{"award-number":["2018YFB1800304"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61472316"],"award-info":[{"award-number":["61472316"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["xzy012020112"],"award-info":[{"award-number":["xzy012020112"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Fundamental Research Funds for the Central Universities","award":["2018YFB1800304"],"award-info":[{"award-number":["2018YFB1800304"]}]},{"name":"Fundamental Research Funds for the Central Universities","award":["61472316"],"award-info":[{"award-number":["61472316"]}]},{"name":"Fundamental Research Funds for the Central Universities","award":["xzy012020112"],"award-info":[{"award-number":["xzy012020112"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Video summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the original thinking mode based on video feature information. Moreover, it introduces new challenges to the extraction of video depth and breadth features. In addition, the diversity of user requirements creates additional complications for more accurate keyframe screening issues. To overcome these challenges, this paper proposes a hierarchical spatial\u2013temporal cross-attention scheme for video summarization based on comparative learning. Graph attention networks (GAT) and the multi-head convolutional attention cell are used to extract local and depth features, while the GAT-adjusted bidirection ConvLSTM (DB-ConvLSTM) is used to extract global and breadth features. Furthermore, a spatial\u2013temporal cross-attention-based ConvLSTM is developed for merging hierarchical characteristics and achieving more accurate screening in similar keyframes clusters. Verification experiments and comparative analysis demonstrate that our method outperforms state-of-the-art methods.<\/jats:p>","DOI":"10.3390\/s22218275","type":"journal-article","created":{"date-parts":[[2022,10,30]],"date-time":"2022-10-30T10:47:57Z","timestamp":1667126877000},"page":"8275","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["A Hierarchical Spatial\u2013Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9789-7541","authenticated-orcid":false,"given":"Xiaoyu","family":"Teng","sequence":"first","affiliation":[{"name":"Department of Faculty of Electronic and Information Engineering, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"},{"name":"Shaanxi Province Key Laboratory of Computer Network, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaolin","family":"Gui","sequence":"additional","affiliation":[{"name":"Department of Faculty of Electronic and Information Engineering, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"},{"name":"Shaanxi Province Key Laboratory of Computer Network, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pan","family":"Xu","sequence":"additional","affiliation":[{"name":"Department of Faculty of Electronic and Information Engineering, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"},{"name":"Shaanxi Province Key Laboratory of Computer Network, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jianglei","family":"Tong","sequence":"additional","affiliation":[{"name":"Department of Faculty of Electronic and Information Engineering, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"},{"name":"Shaanxi Province Key Laboratory of Computer Network, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jian","family":"An","sequence":"additional","affiliation":[{"name":"Department of Faculty of Electronic and Information Engineering, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"},{"name":"Shaanxi Province Key Laboratory of Computer Network, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yang","family":"Liu","sequence":"additional","affiliation":[{"name":"Medical College, Northwest Minzu University, Lanzhou 730030, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Huilan","family":"Jiang","sequence":"additional","affiliation":[{"name":"ONYCOM Co., Ltd., Seoul 04519, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,10,28]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"663","DOI":"10.1109\/LSP.2021.3066349","article-title":"Graph attention networks adjusted bi-LSTM for video summarization","volume":"28","author":"Zhong","year":"2021","journal-title":"IEEE Signal Proc. Lett."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Yoon, U.-N., Hong, M.-D., and Jo, G.-S. (2021). Interp-SUM: Unsupervised Video Summarization with Piecewise Linear Interpolation. Sensors, 21.","DOI":"10.3390\/s21134562"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1573","DOI":"10.1109\/TIP.2022.3143699","article-title":"Video summarization through reinforcement learning with a 3D spatio-temporal u-net","volume":"31","author":"Liu","year":"2022","journal-title":"IEEE Trans. Image Proc."},{"key":"ref_4","first-page":"1","article-title":"From coarse to fine: Hierarchical structure-aware video summarization","volume":"18","author":"Li","year":"2022","journal-title":"ACM Trans. Mult. Comput. Commun. Appl. TOMM"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Zhang, K., Chao, W.-L., Sha, F., and Grauman, K. (2016, January 11\u201314). Video summarization with long short-term memory. Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46478-7_47"},{"key":"ref_6","first-page":"2793","article-title":"Reconstructive sequence-graph network for video summarization","volume":"44","author":"Zhao","year":"2021","journal-title":"IEEE Trans. Patt. Anal. Mach. Intell."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"406","DOI":"10.1016\/j.neucom.2022.07.077","article-title":"A Multi-Flexible Video Summarization Scheme Using Property-Constraint Decision Tree","volume":"506","author":"Teng","year":"2022","journal-title":"Neurocomputing"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"406","DOI":"10.1016\/j.neucom.2018.12.038","article-title":"Multi-video summarization with query-dependent weighted archetypal analysis","volume":"332","author":"Ji","year":"2019","journal-title":"Neurocomputing"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Rafiq, M., Rafiq, G., Agyeman, R., Choi, G.S., and Jin, S.-I. (2020). Scene classification for sports video summarization using transfer learning. Sensors, 20.","DOI":"10.3390\/s20061702"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"3017","DOI":"10.1109\/TIP.2022.3163855","article-title":"Relational Reasoning Over Spatial-Temporal Graphs for Video Summarization","volume":"31","author":"Zhu","year":"2022","journal-title":"IEEE Trans. Image Proc."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.patrec.2010.08.004","article-title":"VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method","volume":"32","author":"Lopes","year":"2011","journal-title":"Patt. Recognit. Lett."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zhao, B., Li, X., and Lu, X. (2017, January 23\u201327). Hierarchical recurrent neural network for video summarization. Proceedings of the 25th ACM International Conference on Multimedia, New York, NY, USA.","DOI":"10.1145\/3123266.3123328"},{"key":"ref_13","unstructured":"An, Y., and Zhao, S. (2021). A Video Summarization Method Using Temporal Interest Detection and Key Frame Prediction. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"185","DOI":"10.1016\/j.patrec.2021.03.013","article-title":"First person video summarization using different graph representations","volume":"146","author":"Sahu","year":"2021","journal-title":"Patt. Recognit. Lett."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1016\/j.patrec.2020.12.016","article-title":"Self-attention binary neural tree for video summarization","volume":"143","author":"Fu","year":"2021","journal-title":"Patt. Recognit. Lett."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1765","DOI":"10.1109\/TNNLS.2020.2991083","article-title":"Deep attentive video summarization with distribution consistency learning","volume":"32","author":"Ji","year":"2020","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_17","unstructured":"K\u00f6pr\u00fc, B., and Erzin, E. (2021). Use of Affective Visual Information for Summarization of Human-Centric Videos. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Mi, L., and Chen, Z. (2020, January 13\u201319). Hierarchical Graph Attention Network for Visual Relationship Detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01390"},{"key":"ref_19","unstructured":"Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., and Liu, W. (November, January 27). Ccnet: Criss-cross attention for semantic segmentation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_20","unstructured":"Lin, W., Deng, Y., Gao, Y., Wang, N., Zhou, J., Liu, L., Zhang, L., and Wang, P. (2021). CAT: Cross-Attention Transformer for One-Shot Object Detection. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Sanabria, M., Precioso, F., and Menguy, T. (2021, January 10\u201315). Hierarchical multimodal attention for deep video summarization. Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy.","DOI":"10.1109\/ICPR48806.2021.9413097"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Petit, O., Thome, N., Rambour, C., Themyr, L., Collins, T., and Soler, L. (2021, January 27). U-net transformer: Self and cross attention for medical image segmentation. Proceedings of the International Workshop on Machine Learning in Medical Imaging, Strasbourg, France.","DOI":"10.1007\/978-3-030-87589-3_28"},{"key":"ref_23","unstructured":"Veli\u010dkovi\u0107, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. (2017). Graph attention networks. arXiv."},{"key":"ref_24","unstructured":"Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo, W.-c. (2015, January 7\u201312). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal ON, Canada."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Song, H., Wang, W., Zhao, S., Shen, J., and Lam, K.-M. (2018, January 8\u201314). Pyramid dilated deeper convlstm for video salient object detection. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01252-6_44"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Gao, T., Yao, X., and Chen, D. (2021). Simcse: Simple contrastive learning of sentence embeddings. arXiv.","DOI":"10.18653\/v1\/2021.emnlp-main.552"},{"key":"ref_27","unstructured":"Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016, January 2\u20134). Continuous control with deep reinforcement learning. Proceedings of the International Conference on Learning Representations 2016, San Juan, Puerto Rico."},{"key":"ref_28","unstructured":"Song, Y., Vallmitjana, J., Stent, A., and Jaimes, A. (2018, January 18\u201323). Tvsum: Summarizing web videos using titles. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Gygli, M., Grabner, H., Riemenschneider, H., and Van Gool, L. (2014). Creating Summaries from User Videos, Springer. Lecture Notes in Computer Science.","DOI":"10.1007\/978-3-319-10584-0_33"},{"key":"ref_30","unstructured":"(2022, September 22). Open Video Project. Available online: https:\/\/open-video.org\/."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Otani, M., Nakashima, Y., Rahtu, E., and Heikkila, J. (2019, January 15\u201320). Rethinking the evaluation of video summaries. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00778"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Zhao, B., Li, X., and Lu, X. (2018, January 18\u201323). Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00773"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"107618","DOI":"10.1016\/j.compeleceng.2021.107618","article-title":"Deep hierarchical LSTM networks with attention for video summarization","volume":"97","author":"Lin","year":"2022","journal-title":"Comput. Electr. Eng."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.neucom.2021.09.015","article-title":"Video summarization with a dual-path attentive network","volume":"467","author":"Liang","year":"2022","journal-title":"Neurocomputing"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"108","DOI":"10.1016\/j.patcog.2021.108312","article-title":"Learning multiscale hierarchical attention for video summarization","volume":"122","author":"Zhu","year":"2022","journal-title":"Patt. Recognit."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"1709","DOI":"10.1109\/TCSVT.2019.2904996","article-title":"Video summarization with attention-based encoder\u2013decoder networks","volume":"30","author":"Ji","year":"2019","journal-title":"IEEE Trans. Circ. Syst. Video Technol."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"902","DOI":"10.1631\/FITEE.2000429","article-title":"Video summarization with a graph convolutional attention network","volume":"22","author":"Li","year":"2021","journal-title":"Front. Inform. Technol. Electr. Eng."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Park, J., Lee, J., Kim, I.-J., and Sohn, K. (2020, January 23\u201328). Sumgraph: Video summarization via recursive graph modeling. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58595-2_39"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/21\/8275\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:05:03Z","timestamp":1760144703000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/21\/8275"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,28]]},"references-count":38,"journal-issue":{"issue":"21","published-online":{"date-parts":[[2022,11]]}},"alternative-id":["s22218275"],"URL":"https:\/\/doi.org\/10.3390\/s22218275","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,10,28]]}}}