{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T04:38:34Z","timestamp":1764995914936,"version":"3.46.0"},"reference-count":99,"publisher":"Springer Science and Business Media LLC","issue":"12","license":[{"start":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T00:00:00Z","timestamp":1756944000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T00:00:00Z","timestamp":1756944000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Video summarization aims to select keyframes that are visually diverse and can represent the whole story of a given video. Previous approaches have focused on global interlinkability between frames in a video by temporal modeling. However, fine-grained visual entities, such as objects, are also highly related to the main content of the video. Moreover, language-guided video summarization, which has recently been studied, requires a comprehensive linguistic understanding of complex real-world videos. To consider how all the objects are semantically related to each other, this paper regards video summarization as a language-guided spatiotemporal graph modeling problem. We present recursive spatiotemporal graph networks, called\n                    <jats:italic>VideoGraph<\/jats:italic>\n                    , which formulate the objects and frames as nodes of the spatial and temporal graphs, respectively. The nodes in each graph are connected and aggregated with graph edges, representing the semantic relationships between the nodes. To prevent the edges from being configured with visual similarity, we incorporate language queries derived from the video into the graph node representations, enabling them to contain semantic knowledge. In addition, we adopt a recursive strategy to refine initial graphs and correctly classify each frame node as a keyframe. In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and query-focused video summarization in both supervised and unsupervised manners. The code is available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/park-jungin\/videograph\" ext-link-type=\"uri\">https:\/\/github.com\/park-jungin\/videograph<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1007\/s11263-025-02577-2","type":"journal-article","created":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T17:40:11Z","timestamp":1757007611000},"page":"8617-8641","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization"],"prefix":"10.1007","volume":"133","author":[{"given":"Jungin","family":"Park","sequence":"first","affiliation":[]},{"given":"Jiyoung","family":"Lee","sequence":"additional","affiliation":[]},{"given":"Kwanghoon","family":"Sohn","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,9,4]]},"reference":[{"key":"2577_CR1","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. IEEE Conf. Comput. Vis. Pattern Recog., 6077\u20136086.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"2577_CR2","doi-asserted-by":"crossref","unstructured":"Bhattacharya, U., Mittal, T., Chandra, R., Randhavane, T., Bera, A., & Manocha, D. (2020). Step: Spatial temporal graph convolutional networks for emotion perception from gaits. AAAI Conf. Art. Intell., 1342\u20131350.","DOI":"10.1609\/aaai.v34i02.5490"},{"key":"2577_CR3","doi-asserted-by":"crossref","unstructured":"Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. IEEE Conf. Comput. Vis. Pattern Recog., 4733\u20134742.","DOI":"10.1109\/CVPR.2016.512"},{"issue":"4","key":"2577_CR4","doi-asserted-by":"publisher","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","volume":"40","author":"L-C Chen","year":"2018","unstructured":"Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4), 834\u2013848.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"issue":"1","key":"2577_CR5","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1016\/j.patrec.2010.08.004","volume":"32","author":"SEF De Avila","year":"2011","unstructured":"De Avila, S. E. F., Lopes, A. P. B., Luz, A., Jr., & Albuquerque Ara\u00fajo, A. (2011). Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Patt. Rec. Letters, 32(1), 56\u201368.","journal-title":"Patt. Rec. Letters"},{"key":"2577_CR6","doi-asserted-by":"crossref","unstructured":"Gao, K., Chen, L., Niu, Y., Shao, J., & Xiao, J. (2022). Classification-then-grounding: Reformulating video scene graphs as temporal bipartite graphs. IEEE Conf. Comput. Vis. Pattern Recog., 19497\u201319506.","DOI":"10.1109\/CVPR52688.2022.01889"},{"key":"2577_CR7","doi-asserted-by":"crossref","unstructured":"Gao, J., Zhang, T., & Xu, C. (2019). Graph convolutional tracking. IEEE Conf. Comput. Vis. Pattern Recog., 4649\u20134659.","DOI":"10.1109\/CVPR.2019.00478"},{"key":"2577_CR8","unstructured":"Gong, B., Chao, W.-L., Grauman, K., & Sha, F. (2014). Diverse sequential subset selection for supervised video summarization. Adv. Neural Inform. Process. Syst., 2069\u20132077."},{"key":"2577_CR9","unstructured":"Grandvalet, Y., & Bengio, Y. (2004). Semi-supervised learning by entropy minimization. Adv. Neural Inform. Process. Syst., 529\u2013536."},{"key":"2577_CR10","doi-asserted-by":"crossref","unstructured":"Gygli, M., Grabner, H., & Gool, L. V. (2015). Video summarization by learning submodular mixtures of objectives. IEEE Conf. Comput. Vis. Pattern Recog., 3090\u20133098.","DOI":"10.1109\/CVPR.2015.7298928"},{"key":"2577_CR11","doi-asserted-by":"crossref","unstructured":"Gygli, M., Grabner, H., Riemenschneider, H., & Gool, L. V. (2014). Creating summaries from user videos. Eur. Conf. Comput. Vis.","DOI":"10.1007\/978-3-319-10584-0_33"},{"key":"2577_CR12","doi-asserted-by":"crossref","unstructured":"He, X., Hua, Y., Song, T., Zhang, Z., Xue, Z., Ma, R., Robertson, N., & Guan, H. (2019). Unsupervised video summarization with attentive conditional generative adversarial networks. ACM Int. Conf. Multimedia, 2296\u20132304.","DOI":"10.1145\/3343031.3351056"},{"key":"2577_CR13","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. IEEE Conf. Comput. Vis. Pattern Recog., 770\u2013778.","DOI":"10.1109\/CVPR.2016.90"},{"key":"2577_CR14","doi-asserted-by":"crossref","unstructured":"Hosang, J., Benenson, R., & Schiele, B. (2017). Learning non-maximum suppression. IEEE Conf. Comput. Vis. Pattern Recog., 4507\u20134515.","DOI":"10.1109\/CVPR.2017.685"},{"key":"2577_CR15","doi-asserted-by":"crossref","unstructured":"Iashin, V., & Rahtu, E. (2020). A better use of audio-visual cues: Dense video captioning with bi-modal transformer. Brit. Mach. Vis. Conf.","DOI":"10.5244\/C.34.29"},{"key":"2577_CR16","doi-asserted-by":"crossref","unstructured":"Jain, A., Zamir, A. R., Savarese, S., & Saxena, A. (2016). Structural-rnn: Deep learning on spatio-temporal graphs. IEEE Conf. Comput. Vis. Pattern Recog., 5308\u20135317.","DOI":"10.1109\/CVPR.2016.573"},{"key":"2577_CR17","doi-asserted-by":"crossref","unstructured":"Jiang, H., & Mu, Y. (2022). Joint video summarization and moment localization by cross-task sample transfer. IEEE Conf. Comput. Vis. Pattern Recog., 16388\u201316398.","DOI":"10.1109\/CVPR52688.2022.01590"},{"key":"2577_CR18","doi-asserted-by":"crossref","unstructured":"Jiang, B., Zhang, Z., Lin, D., Tang, J., & Luo, B. (2019). Semi-supervised learning with graph learning-convolutional networks. IEEE Conf. Comput. Vis. Pattern Recog., 11313\u201311320.","DOI":"10.1109\/CVPR.2019.01157"},{"issue":"6","key":"2577_CR19","doi-asserted-by":"publisher","first-page":"1709","DOI":"10.1109\/TCSVT.2019.2904996","volume":"30","author":"Z Ji","year":"2020","unstructured":"Ji, Z., Xiong, K., Pang, Y., & Li, X. (2020). Video summarization with attention-based encoder-decoder networks. IEEE Trans. Circuit Syst. Vid. Tech., 30(6), 1709\u20131717.","journal-title":"IEEE Trans. Circuit Syst. Vid. Tech."},{"issue":"4","key":"2577_CR20","doi-asserted-by":"publisher","first-page":"1765","DOI":"10.1109\/TNNLS.2020.2991083","volume":"32","author":"Z Ji","year":"2021","unstructured":"Ji, Z., Zhao, Y., Pang, Y., Li, X., & Han, J. (2021). Deep attentive video summarization with distribution consistency learning. IEEE Trans. Neural Net. and Learn. Syst., 32(4), 1765\u20131775.","journal-title":"IEEE Trans. Neural Net. and Learn. Syst."},{"issue":"4","key":"2577_CR21","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2766954","volume":"34","author":"N Joshi","year":"2015","unstructured":"Joshi, N., Kienzle, W., Toelle, M., Uyttendaele, M., & Cohen, M. F. (2015). Real-time hyperlapse creation via optimal frame selection. ACM Trans. on Graphics, 34(4), 1\u20139.","journal-title":"ACM Trans. on Graphics"},{"key":"2577_CR22","unstructured":"Kang, H.-W., Matsushita, Y., Tang, X., & Chen, X.-Q. (2006). Space-time video montage. IEEE Conf. Comput. Vis. Pattern Recog., 1331\u20131338."},{"key":"2577_CR23","doi-asserted-by":"crossref","unstructured":"Kaseris, M., Mademlis, I., & Pitas, I. (2022). Exploiting caption diversity for unsupervised video summarization. ICASSP, 6519\u20136527.","DOI":"10.1109\/ICASSP43922.2022.9747592"},{"issue":"3","key":"2577_CR24","doi-asserted-by":"publisher","first-page":"239","DOI":"10.1093\/biomet\/33.3.239","volume":"33","author":"MG Kendall","year":"1945","unstructured":"Kendall, M. G. (1945). The treatment of ties in ranking problems. Biometrika, 33(3), 239\u2013251.","journal-title":"Biometrika"},{"key":"2577_CR25","unstructured":"Kingma, D.P., & Ba, J. (2015). Adam: A method for stochastic optimization. Int. Conf. Learn. Represent."},{"key":"2577_CR26","unstructured":"Kipf, T.N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. Int. Conf. Learn. Represent."},{"key":"2577_CR27","unstructured":"Kr\u00e4henb\u00fchl, P., & Koltun, V. (2011). Efficient inference in fully connected crfs with gaussian edge potentials. Adv. Neural Inform. Process. Syst., 109\u2013117."},{"key":"2577_CR28","doi-asserted-by":"crossref","unstructured":"Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M. S., & Fei-Fei, L. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 32\u201373.","DOI":"10.1007\/s11263-016-0981-7"},{"key":"2577_CR29","unstructured":"Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Int. Conf. Mach. Learn., 282\u2013289."},{"key":"2577_CR30","unstructured":"Lee, Y.J., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. IEEE Conf. Comput. Vis. Pattern Recog."},{"key":"2577_CR31","doi-asserted-by":"crossref","unstructured":"Li, J., Gao, X., & Jiang, T. (2020). Graph networks for multiple object tracking. Winter Applications Comput. Vis., 719\u2013728.","DOI":"10.1109\/WACV45572.2020.9093347"},{"key":"2577_CR32","doi-asserted-by":"crossref","unstructured":"Li, H., Ke, Q., Gong, M., & Drummond, T. (2023). Progressive video summarization via multimodal self-supervised learning. Winter Applications Comput. Vis., 5584\u20135593.","DOI":"10.1109\/WACV56688.2023.00554"},{"key":"2577_CR33","doi-asserted-by":"crossref","unstructured":"Li, H., Ke, Q., Gong, M., & Zhang, R. (2022). Video joint modelling based on hierarchical transformer for co-summarization. IEEE Trans. Pattern Anal. Mach. Intell., 1\u201314.","DOI":"10.1109\/TPAMI.2022.3186506"},{"key":"2577_CR34","doi-asserted-by":"crossref","unstructured":"Li, M., Wang, H., Zhang, W., Miao, J., Zhao, Z., Zhang, S., Ji, W., & Wu, F. (2023). Winner: Weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding. IEEE Conf. Comput. Vis. Pattern Recog., 23090\u201323099.","DOI":"10.1109\/CVPR52729.2023.02211"},{"key":"2577_CR35","doi-asserted-by":"crossref","unstructured":"Liu, T., & Kender, J. R. (2002). Optimization algorithms for the selection of key frame sequences of variable length. Eur. Conf. Comput. Vis., 403\u2013417.","DOI":"10.1007\/3-540-47979-1_27"},{"key":"2577_CR36","doi-asserted-by":"crossref","unstructured":"Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., & Zhang, Y. (2020). Graph structured network for image-text matching. IEEE Conf. Comput. Vis. Pattern Recog., 10921\u201310930.","DOI":"10.1109\/CVPR42600.2020.01093"},{"key":"2577_CR37","doi-asserted-by":"crossref","unstructured":"Lu, Z., & Grauman, K. (2013). Story-driven summarization for egocentric video. IEEE Conf. Comput. Vis. Pattern Recog.","DOI":"10.1109\/CVPR.2013.350"},{"key":"2577_CR38","doi-asserted-by":"crossref","unstructured":"Mahasseni, B., Lam, M., & Todorovic, S. (2017). Unsupervised video summarization with adversarial lstm networks. IEEE Conf. Comput. Vis. Pattern Recog., 202\u2013211.","DOI":"10.1109\/CVPR.2017.318"},{"key":"2577_CR39","first-page":"313","volume":"89","author":"B Mukhoty","year":"2019","unstructured":"Mukhoty, B., Gopakumar, G., Jain, P., & Kar, P. (2019). Globally-convergent iteratively reweighted least squares for robust regression problems. Proceedings of Machine Learning Research, 89, 313\u2013322.","journal-title":"Proceedings of Machine Learning Research"},{"key":"2577_CR40","unstructured":"Narasimhan, M., Rohrbach, A., & Darrell, T. (2021). Clip-it! language-guided video summarization. Adv. Neural Inform. Process. Syst., 13988\u201314000."},{"key":"2577_CR41","doi-asserted-by":"crossref","unstructured":"Nassar, A.S., D Aronco, S., Lef\u00e8vre, S., & Wegner, J.D. (2020). Geograph: Graph-based multi-view object detection with geometric cues end-to-end. Eur. Conf. Comput. Vis., 488\u2013504.","DOI":"10.1007\/978-3-030-58571-6_29"},{"key":"2577_CR42","unstructured":"Ngo, C.-W., Ma, Y.-F., & Zhang, H.-J. (2003). Automatic video summarization by graph modeling. IEEE Int. Conf. Comput. Vis."},{"key":"2577_CR43","unstructured":"Open video project. https:\/\/open-video.org\/."},{"key":"2577_CR44","doi-asserted-by":"crossref","unstructured":"Otani, M., Nakashima, Y., Rahtu, E., & Heikkil\u00e4, J. (2019). Rethinking the evaluation of video summaries. IEEE Conf. Comput. Vis. Pattern Recog., 7596\u20137604.","DOI":"10.1109\/CVPR.2019.00778"},{"key":"2577_CR45","doi-asserted-by":"crossref","unstructured":"Pan, B., Cai, H., Huang, D.-A., Lee, K.-H., Gaidon, A., Adeli, E., & Niebles, J. C. (2020). Spatio-temporal graph for video captioning with knowledge distillation. IEEE Conf. Comput. Vis. Pattern Recog., 10870\u201310879.","DOI":"10.1109\/CVPR42600.2020.01088"},{"key":"2577_CR46","unstructured":"Park, J., Kwoun, K., Lee, C., & Lim, H. (2022). Multimodal frame-scoring transformer for video summarization. arXiv preprint arXiv:2207.01814."},{"key":"2577_CR47","doi-asserted-by":"crossref","unstructured":"Park, J., Lee, J., & Sohn, K. (2021). Bridge to answer: Structure-aware graph interaction network for video question answering. IEEE Conf. Comput. Vis. Pattern Recog., 15526\u201315535.","DOI":"10.1109\/CVPR46437.2021.01527"},{"key":"2577_CR48","doi-asserted-by":"crossref","unstructured":"Park, J., Lee, J., Jeon, S., & Sohn, K. (2019). Video summarization by learning relationships between action and scene. IEEE Int. Conf. Comput. Vis. Worksh.","DOI":"10.1109\/ICCVW.2019.00193"},{"key":"2577_CR49","doi-asserted-by":"crossref","unstructured":"Park, J., Lee, J., Jeon, S., Kim, S., & Sohn, K. (2019). Graph regularization network with semantic affinity for weakly-supervised temporal action localization. IEEE Int. Conf. Image Process.","DOI":"10.1109\/ICIP.2019.8803589"},{"key":"2577_CR50","doi-asserted-by":"crossref","unstructured":"Park, J., Lee, J., Kim, I.-J., & Sohn, K. (2020). Sumgraph: Video summarization via recursive graph modeling. Eur. Conf. Comput. Vis., 647\u2013663.","DOI":"10.1007\/978-3-030-58595-2_39"},{"key":"2577_CR51","doi-asserted-by":"crossref","unstructured":"Poleg, Y., Halperin, T., Arora, C., & Peleg, S. (2015). Egosampling: Fast-forward and stereo for egocentric videos. IEEE Conf. Comput. Vis. Pattern Recog., 4768\u20134776.","DOI":"10.1109\/CVPR.2015.7299109"},{"key":"2577_CR52","doi-asserted-by":"crossref","unstructured":"Potapov, D., Douze, M., Harchaoui, Z., & Schmid, C. (2014). Category-specific video summarization. Eur. Conf. Comput. Vis., 540\u2013555.","DOI":"10.1007\/978-3-319-10599-4_35"},{"key":"2577_CR53","doi-asserted-by":"crossref","unstructured":"Pritch, Y., Rav-Acha, A., Gutman, A., & Peleg, S. (2007). Webcam synopsis: Peeking around the world. IEEE Int. Conf. Comput. Vis.","DOI":"10.1109\/ICCV.2007.4408934"},{"key":"2577_CR54","doi-asserted-by":"crossref","unstructured":"Qian, X., Zhuang, Y., Li, Y., Xiao, S., Pu, S., & Xiao, J. (2020). Video relation detection with spatio-temporal graph. ACM Int. Conf. Multimedia, 84\u201393.","DOI":"10.1145\/3343031.3351058"},{"key":"2577_CR55","unstructured":"Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. Int. Conf. Mach. Learn., 8748\u20138763."},{"key":"2577_CR56","unstructured":"Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inform. Process. Syst., 91\u201399."},{"key":"2577_CR57","doi-asserted-by":"crossref","unstructured":"Rochan, M., & Wang, Y. (2019). Video summarization by learning from unpaired data. IEEE Conf. Comput. Vis. Pattern Recog., 7902\u20137911.","DOI":"10.1109\/CVPR.2019.00809"},{"key":"2577_CR58","doi-asserted-by":"crossref","unstructured":"Rochan, M., Ye, L., & Wang, Y. (2018). Video summarization using fully convolutional sequence networks. Eur. Conf. Comput. Vis., 347\u2013363.","DOI":"10.1007\/978-3-030-01258-8_22"},{"key":"2577_CR59","unstructured":"Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., & Komatsuzaki, A. (2021). Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. NeurIPS Workshop."},{"key":"2577_CR60","doi-asserted-by":"crossref","unstructured":"Sharghi, A., Gong, B., & Shah, M. (2016). Query-focused extractive video summarization. Eur. Conf. Comput. Vis., 3\u201319.","DOI":"10.1007\/978-3-319-46484-8_1"},{"key":"2577_CR61","doi-asserted-by":"crossref","unstructured":"Sharghi, A., Laurel, J. S., & Gong, B. (2017). Query-focused video summarization: Dataset, evaluation, and a memory network based approach. IEEE Conf. Comput. Vis. Pattern Recog., 4788\u20134797.","DOI":"10.1109\/CVPR.2017.229"},{"key":"2577_CR62","doi-asserted-by":"crossref","unstructured":"Shi, W., & Rajkumar, R. R. (2020). Point-gnn: Graph neural network for 3d object detection in a point cloud. IEEE Conf. Comput. Vis. Pattern Recog., 1711\u20131719.","DOI":"10.1109\/CVPR42600.2020.00178"},{"key":"2577_CR63","doi-asserted-by":"crossref","unstructured":"Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Skeleton-based action recognition with directed graph neural networks. IEEE Conf. Comput. Vis. Pattern Recog., 7912\u20137921.","DOI":"10.1109\/CVPR.2019.00810"},{"key":"2577_CR64","doi-asserted-by":"crossref","unstructured":"Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. IEEE Conf. Comput. Vis. Pattern Recog., 2026\u201312035.","DOI":"10.1109\/CVPR.2019.01230"},{"key":"2577_CR65","doi-asserted-by":"crossref","unstructured":"Song, Y., Vallmitjana, J., Stent, A., & Jaimes, A. (2015). Tvsum: Summarizing web videos using titles. IEEE Conf. Comput. Vis. Pattern Recog., 5179\u20135187.","DOI":"10.1109\/CVPR.2015.7299154"},{"key":"2577_CR66","doi-asserted-by":"crossref","unstructured":"Sun, M., Farhadi, A., Taskar, B., & Seitz, S. (2014). Salient montages from unconstrained videos. Eur. Conf. Comput. Vis., 472\u2013488.","DOI":"10.1007\/978-3-319-10584-0_31"},{"key":"2577_CR67","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. IEEE Conf. Comput. Vis. Pattern Recog.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"2577_CR68","doi-asserted-by":"crossref","unstructured":"Teng, Y., Wang, L., Li, Z., & Wu, G. (2021). Target adaptive context aggregation for video scene graph generation. IEEE Int. Conf. Comput. Vis., 13688\u201313697.","DOI":"10.1109\/ICCV48922.2021.01343"},{"key":"2577_CR69","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Adv. Neural Inform. Process. Syst., 5998\u20136008."},{"key":"2577_CR70","doi-asserted-by":"crossref","unstructured":"Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. Eur. Conf. Comput. Vis., 413\u2013431.","DOI":"10.1007\/978-3-030-01228-1_25"},{"key":"2577_CR71","doi-asserted-by":"crossref","unstructured":"Wang, S., Gao, L., Lyu, X., Guo, Y., Zeng, P., & Song, J. (2022). Dynamic scene graph generation via temporal prior inference. ACM Int. Conf. Multimedia, 5793\u20135801.","DOI":"10.1145\/3503161.3548324"},{"key":"2577_CR72","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. IEEE Conf. Comput. Vis. Pattern Recog., 7794\u20137803.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"2577_CR73","doi-asserted-by":"crossref","unstructured":"Wu, G., Lin, J., & Silva, C. T. (2021). Era: Entity-relationship aware video summarization with wasserstein gan. Brit. Mach. Vis. Conf., 1\u201314.","DOI":"10.5244\/C.35.415"},{"key":"2577_CR74","doi-asserted-by":"crossref","unstructured":"Wu, G., Lin, J., & Silva, C. T. (2022). Intentvizor: Towards generic query guided interactive video summarization. IEEE Conf. Comput. Vis. Pattern Recog., 10503\u201310512.","DOI":"10.1109\/CVPR52688.2022.01025"},{"issue":"11","key":"2577_CR75","doi-asserted-by":"publisher","first-page":"3347","DOI":"10.1109\/TNNLS.2019.2891244","volume":"30","author":"L Wu","year":"2019","unstructured":"Wu, L., Wang, Y., Shao, L., & Wang, M. (2019). 3-d personvlad: Learning deep global representations for video-based person reidentification. IEEE Trans. Neural Net. and Learn. Syst., 30(11), 3347\u20133359.","journal-title":"IEEE Trans. Neural Net. and Learn. Syst."},{"key":"2577_CR76","doi-asserted-by":"crossref","unstructured":"Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., & Feichtenhofer, C. (2021). Videoclip: Contrastive pre-training for zero-shot video-text understanding. ACL Conf. Emp. Meth. Natural Lang. Process., 6787\u20136800.","DOI":"10.18653\/v1\/2021.emnlp-main.544"},{"key":"2577_CR77","doi-asserted-by":"crossref","unstructured":"Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. AAAI Conf. Art. Intell., 7444\u20137452.","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"2577_CR78","doi-asserted-by":"crossref","unstructured":"Yang, J., Zheng, W.-S., Yang, Q., Chen, Y., & Tian, Q. (2020). Spatial-temporal graph convolutional network for video-based person re-identification. IEEE Conf. Comput. Vis. Pattern Recog., 3289\u20133299.","DOI":"10.1109\/CVPR42600.2020.00335"},{"key":"2577_CR79","doi-asserted-by":"crossref","unstructured":"Ye, F., Pu, S., Zhong, Q., Li, C., Xie, D., & Tang, H. (2020). Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. ACM Int. Conf. Multimedia, 55\u201363.","DOI":"10.1145\/3394171.3413941"},{"key":"2577_CR80","unstructured":"Yeung, S., Fathi, A., & Fei-Fei, L. (2014). Videoset: Video summary evaluation through text. arXiv preprint arXiv:1406.5824."},{"key":"2577_CR81","doi-asserted-by":"crossref","unstructured":"Yu, B., Yin, H., & Zhu, Z. (2018). Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. Int. Joint. Conf. Art. Intell., 3634\u20133640.","DOI":"10.24963\/ijcai.2018\/505"},{"key":"2577_CR82","doi-asserted-by":"crossref","unstructured":"Yuan, Y., Liang, X., Wang, X., Yeung, D.-Y., & Gupta, A. (2017). Temporal dynamic graph lstm for action-driven video object detection. IEEE Int. Conf. Comput. Vis., 1801\u20131810.","DOI":"10.1109\/ICCV.2017.200"},{"key":"2577_CR83","doi-asserted-by":"crossref","unstructured":"Yuan, L., Tay, F. E., Li, P., Zhou, L., & Feng, J. (2019). Cycle-sum: Cycle-consistent adversarial lstm networks for unsupervised video summarization. AAAI Conf. Art. Intell., 9143\u20139150.","DOI":"10.1609\/aaai.v33i01.33019143"},{"key":"2577_CR84","doi-asserted-by":"crossref","unstructured":"Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., & Gan, C. (2019). Graph convolutional networks for temporal action localization. IEEE Int. Conf. Comput. Vis., 7094\u20137103.","DOI":"10.1109\/ICCV.2019.00719"},{"issue":"10","key":"2577_CR85","doi-asserted-by":"publisher","first-page":"6209","DOI":"10.1109\/TPAMI.2021.3090167","volume":"44","author":"R Zeng","year":"2021","unstructured":"Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., & Gan, C. (2021). Graph convolutional module for temporal action localization in videos. IEEE Trans. Pattern Anal. Mach. Intell., 44(10), 6209\u20136223.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"2577_CR86","doi-asserted-by":"crossref","unstructured":"Zhang, K., Chao, W.-L., Sha, F., & Grauman, K. (2016). Summary transfer: Exemplar-based subset selection for video summarization. IEEE Conf. Comput. Vis. Pattern Recog., 1059\u20131067.","DOI":"10.1109\/CVPR.2016.120"},{"key":"2577_CR87","doi-asserted-by":"crossref","unstructured":"Zhang, K., Chao, W.-L., Sha, F., & Grauman, K. (2016). Video summarization with long short-term memory. Eur. Conf. Comput. Vis.","DOI":"10.1007\/978-3-319-46478-7_47"},{"key":"2577_CR88","doi-asserted-by":"crossref","unstructured":"Zhang, K., Grauman, K., & Sha, F. (2018). Retrospective encoders for video summarization. Eur. Conf. Comput. Vis., 391\u2013408.","DOI":"10.1007\/978-3-030-01237-3_24"},{"key":"2577_CR89","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., & Zha, Z. (2020). Object relational graph with teacher-recommended learning for video captioning. IEEE Conf. Comput. Vis. Pattern Recog., 13278\u201313288.","DOI":"10.1109\/CVPR42600.2020.01329"},{"key":"2577_CR90","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Zhao, Z., Zhao, Y., Wang, Q., Liu, H., & Gao, L. (2020). Where does it exist: Spatio-temporal video grounding for multi-form sentences. IEEE Conf. Comput. Vis. Pattern Recog., 10668\u201310677.","DOI":"10.1109\/CVPR42600.2020.01068"},{"issue":"7","key":"2577_CR91","doi-asserted-by":"publisher","first-page":"1954","DOI":"10.1109\/TNNLS.2018.2875347","volume":"30","author":"X Zhang","year":"2019","unstructured":"Zhang, X., Zhu, Z., Zhao, Y., Chang, D., & Liu, J. (2019). Seeing all from a few: $$\\ell _1$$-norm-induced discriminative prototype selection. IEEE Trans. Neural Net. and Learn. Syst., 30(7), 1954\u20131966.","journal-title":"IEEE Trans. Neural Net. and Learn. Syst."},{"key":"2577_CR92","doi-asserted-by":"crossref","unstructured":"Zhao, B., & Xing, E.P. (2014). Quasi real-time summarization for consumer videos. IEEE Conf. Comput. Vis. Pattern Recog.","DOI":"10.1109\/CVPR.2014.322"},{"key":"2577_CR93","doi-asserted-by":"crossref","unstructured":"Zhao, B., Li, X., & Lu, X. (2017). Hierarchical recurrent neural network for video summarization. ACM Int. Conf. Multimedia, 863\u2013871.","DOI":"10.1145\/3123266.3123328"},{"key":"2577_CR94","doi-asserted-by":"crossref","unstructured":"Zhao, B., Li, X., & Lu, X. (2020). Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. IEEE Conf. Comput. Vis. Pattern Recog., 7405\u20137414.","DOI":"10.1109\/CVPR.2018.00773"},{"issue":"5","key":"2577_CR95","first-page":"2793","volume":"44","author":"B Zhao","year":"2021","unstructured":"Zhao, B., Li, H., Lu, X., & Li, X. (2021). Reconstructive sequence-graph network for video summarization. IEEE Trans. Pattern Anal. Mach. Intell., 44(5), 2793\u20132801.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"2577_CR96","doi-asserted-by":"crossref","unstructured":"Zhou, K., Qiao, Y., & Xiang, T. (2018). Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. AAAI Conf. Art. Intell., 7582\u20137589.","DOI":"10.1609\/aaai.v32i1.12255"},{"key":"2577_CR97","doi-asserted-by":"crossref","unstructured":"Zhu, Z., Tang, W., Wang, L., Zheng, N., & Hua, G. (2021). Enriching local and global contexts for temporal action localization. IEEE Int. Conf. Comput. Vis., 13516\u201313525.","DOI":"10.1109\/ICCV48922.2021.01326"},{"key":"2577_CR98","doi-asserted-by":"publisher","first-page":"948","DOI":"10.1109\/TIP.2020.3039886","volume":"30","author":"W Zhu","year":"2020","unstructured":"Zhu, W., Lu, J., Li, J., & Zhou, J. (2020). Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Trans. Image Process., 30, 948\u2013962.","journal-title":"IEEE Trans. Image Process."},{"key":"2577_CR99","doi-asserted-by":"crossref","unstructured":"Zwillinger, D., & Kokoska, S. (1999). Crc standard probability and statistics tables and formulae. Boca Raton: CRC Press.","DOI":"10.1201\/9780367802417"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02577-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-025-02577-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02577-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T04:03:48Z","timestamp":1764993828000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-025-02577-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,4]]},"references-count":99,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["2577"],"URL":"https:\/\/doi.org\/10.1007\/s11263-025-02577-2","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"type":"print","value":"0920-5691"},{"type":"electronic","value":"1573-1405"}],"subject":[],"published":{"date-parts":[[2025,9,4]]},"assertion":[{"value":"6 March 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 August 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 September 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}