{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,18]],"date-time":"2026-02-18T23:22:07Z","timestamp":1771456927194,"version":"3.50.1"},"reference-count":51,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2023,9,15]],"date-time":"2023-09-15T00:00:00Z","timestamp":1694736000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>During the last few years, several technological advances have led to an increase in the creation and consumption of audiovisual multimedia content. Users are overexposed to videos via several social media or video sharing websites and mobile phone applications. For efficient browsing, searching, and navigation across several multimedia collections and repositories, e.g., for finding videos that are relevant to a particular topic or interest, this ever-increasing content should be efficiently described by informative yet concise content representations. A common solution to this problem is the construction of a brief summary of a video, which could be presented to the user, instead of the full video, so that she\/he could then decide whether to watch or ignore the whole video. Such summaries are ideally more expressive than other alternatives, such as brief textual descriptions or keywords. In this work, the video summarization problem is approached as a supervised classification task, which relies on feature fusion of audio and visual data. Specifically, the goal of this work is to generate dynamic video summaries, i.e., compositions of parts of the original video, which include its most essential video segments, while preserving the original temporal sequence. This work relies on annotated datasets on a per-frame basis, wherein parts of videos are annotated as being \u201cinformative\u201d or \u201cnoninformative\u201d, with the latter being excluded from the produced summary. The novelties of the proposed approach are, (a) prior to classification, a transfer learning strategy to use deep features from pretrained models is employed. These models have been used as input to the classifiers, making them more intuitive and robust to objectiveness, and (b) the training dataset was augmented by using other publicly available datasets. The proposed approach is evaluated using three datasets of user-generated videos, and it is demonstrated that deep features and data augmentation are able to improve the accuracy of video summaries based on human annotations. Moreover, it is domain independent, could be used on any video, and could be extended to rely on richer feature representations or include other data modalities.<\/jats:p>","DOI":"10.3390\/computers12090186","type":"journal-article","created":{"date-parts":[[2023,9,17]],"date-time":"2023-09-17T06:36:45Z","timestamp":1694932605000},"page":"186","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["Video Summarization Based on Feature Fusion and Data Augmentation"],"prefix":"10.3390","volume":"12","author":[{"given":"Theodoros","family":"Psallidas","sequence":"first","affiliation":[{"name":"Department of Informatics & Telecommunications, University of Thessaly, 35100 Lamia, Greece"},{"name":"Institute of Informatics & Telecommunications, National Center for Scientific Research\u2013\u201cDemokritos\u201d, 15310 Athens, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Evaggelos","family":"Spyrou","sequence":"additional","affiliation":[{"name":"Department of Informatics & Telecommunications, University of Thessaly, 35100 Lamia, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,9,15]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Psallidas, T., Koromilas, P., Giannakopoulos, T., and Spyrou, E. (2021). Multimodal summarization of user-generated videos. Appl. Sci., 11.","DOI":"10.3390\/app11115260"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1016\/j.jvcir.2007.04.002","article-title":"Video summarisation: A conceptual framework and survey of the state of the art","volume":"19","author":"Money","year":"2008","journal-title":"J. Vis. Commun. Image Represent."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Chen, B.C., Chen, Y.Y., and Chen, F. (2017, January 4\u20137). Video to Text Summary: Joint Video Summarization and Captioning with Recurrent Neural Networks. Proceedings of the BMVC, London, UK.","DOI":"10.5244\/C.31.118"},{"key":"ref_4","unstructured":"Li, Y., Merialdo, B., Rouvier, M., and Linares, G. (December, January 28). Static and dynamic video summaries. Proceedings of the 19th ACM International Conference on Multimedia, Scottsdale, AZ, USA."},{"key":"ref_5","unstructured":"Lienhart, R., Pfeiffer, S., and Effelsberg, W. (1996, January 17\u201323). The MoCA workbench: Support for creativity in movie content analysis. Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems, Hiroshima, Japan."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"337","DOI":"10.1007\/s11042-008-0237-9","article-title":"Concept detection and keyframe extraction using a visual thesaurus","volume":"41","author":"Spyrou","year":"2009","journal-title":"Multimed. Tools Appl."},{"key":"ref_7","unstructured":"Li, Y., Zhang, T., and Tretter, D. (2001). An Overview of Video Abstraction Techniques, Hewlett-Packard Company. Technical Report HP-2001-191."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1838","DOI":"10.1109\/JPROC.2021.3117472","article-title":"Video summarization using deep neural networks: A survey","volume":"109","author":"Apostolidis","year":"2021","journal-title":"Proc. IEEE"},{"key":"ref_9","unstructured":"Sen, D., and Raman, B. (2019). Video skimming: Taxonomy and comprehensive survey. arXiv."},{"key":"ref_10","unstructured":"Smith, M.A., and Kanade, T. (1995). Video Skimming for Quick Browsing Based on Audio and Image Characterization, School of Computer Science, Carnegie Mellon University."},{"key":"ref_11","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_12","unstructured":"Song, Y., Vallmitjana, J., Stent, A., and Jaimes, A. (2015, January 7\u201312). Tvsum: Summarizing web videos using titles. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Gygli, M., Grabner, H., Riemenschneider, H., and Van Gool, L. (2014, January 6\u201312). Creating summaries from user videos. Proceedings of the Computer Vision\u2014ECCV 2014: 13th European Conference, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10584-0_33"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Wei, H., Ni, B., Yan, Y., Yu, H., Yang, X., and Yao, C. (2018, January 2\u20133). Video summarization via semantic attended networks. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11297"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Ma, Y.F., Lu, L., Zhang, H.J., and Li, M. (2002, January 1\u20136). A user attention model for video summarization. Proceedings of the Tenth ACM International Conference on Multimedia, Juan-les-Pins, France.","DOI":"10.1145\/641007.641116"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Mahasseni, B., Lam, M., and Todorovic, S. (2017, January 21\u201326). Unsupervised video summarization with adversarial lstm networks. Proceedings of the IEEE con ference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.318"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Otani, M., Nakashima, Y., Rahtu, E., Heikkil\u00e4, J., and Yokoya, N. (2016, January 20\u201324). Video summarization using deep semantic features. Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan.","DOI":"10.1007\/978-3-319-54193-8_23"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1007\/s10844-016-0441-4","article-title":"A video summarization approach based on the emulation of bottom-up mechanisms of visual attention","volume":"49","author":"Jacob","year":"2017","journal-title":"J. Intell. Inf. Syst."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"200","DOI":"10.1016\/j.neucom.2020.04.132","article-title":"Deep attentive and semantic preserving video summarization","volume":"405","author":"Ji","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zhang, K., Chao, W.L., Sha, F., and Grauman, K. (2016, January 11\u201314). Video summarization with long short-term memory. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46478-7_47"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Zhou, K., Qiao, Y., and Xiang, T. (2018, January 2\u20133). Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.12255"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"857","DOI":"10.1007\/s11042-016-4300-7","article-title":"VISCOM: A robust video summarization approach using color co-occurrence matrices","volume":"77","author":"Pedrini","year":"2018","journal-title":"Multimed. Tools Appl."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Fajtl, J., Sokeh, H.S., Argyriou, V., Monekosso, D., and Remagnino, P. (2018, January 2\u20136). Summarizing videos with attention. Proceedings of the Computer Vision\u2014ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia. Revised Selected Papers 14.","DOI":"10.1007\/978-3-030-21074-8_4"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Apostolidis, E., Balaouras, G., Mezaris, V., and Patras, I. (December, January 29). Combining global and local attention with positional encoding for video summarization. Proceedings of the 2021 IEEE international symposium on multimedia (ISM), Naple, Italy.","DOI":"10.1109\/ISM52913.2021.00045"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Hertel, L., Barth, E., K\u00e4ster, T., and Martinetz, T. (2015, January 12\u201317). Deep convolutional neural networks as generic feature extractors. Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland.","DOI":"10.1109\/IJCNN.2015.7280683"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Giannakopoulos, T. (2015). pyaudioanalysis: An open-source python library for audio signal analysis. PLoS ONE, 10.","DOI":"10.1371\/journal.pone.0144610"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Psallidas, T., Vasilakakis, M.D., Spyrou, E., and Iakovidis, D.K. (2022, January 26\u201329). Multimodal video summarization based on fuzzy similarity features. Proceedings of the 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), Nafplio, Greece.","DOI":"10.1109\/IVMSP54334.2022.9816266"},{"key":"ref_28","unstructured":"Viola, P., and Jones, M. (2001, January 8\u201314). Rapid object detection using a boosted cascade of simple features. Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA."},{"key":"ref_29","unstructured":"Lucas, B.D., and Kanade, T. (1981, January 24\u201328). An iterative image registration technique with an application to stereo vision. Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI \u201981), Vancouver, BC, Canada."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016, January 11\u201314). SSD: Single shot multibox detector. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_31","first-page":"143","article-title":"Transfer learning using vgg-16 with deep convolutional neural network for classifying images","volume":"9","author":"Tammina","year":"2019","journal-title":"Int. J. Sci. Res. Publ. (IJSRP)"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1016\/j.neucom.2018.05.083","article-title":"Deep visual domain adaptation: A survey","volume":"312","author":"Wang","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_33","unstructured":"Yu, W., Yang, K., Bai, Y., Xiao, T., Yao, H., and Rui, Y. (2016, January 20\u201322). Visualizing and comparing AlexNet and VGG using deconvolutional layers. Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA."},{"key":"ref_34","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_35","first-page":"559","article-title":"Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning","volume":"18","author":"Nogueira","year":"2017","journal-title":"J. Mach. Learn. Res."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Chen, T., and Guestrin, C. (2016, January 13\u201317). Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA.","DOI":"10.1145\/2939672.2939785"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Elfeki, M., and Borji, A. (2019, January 7\u201311). Video summarization via actionness ranking. Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA.","DOI":"10.1109\/WACV.2019.00085"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Lebron Casas, L., and Koblents, E. (2018, January 5\u20137). Video summarization with LSTM and deep attention models. Proceedings of the International Conference on Multimedia Modeling, Bangkok, Thailand.","DOI":"10.1007\/978-3-030-05716-9_6"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Zhao, B., Li, X., and Lu, X. (2017, January 23\u201327). Hierarchical recurrent neural network for video summarization. Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA.","DOI":"10.1145\/3123266.3123328"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"1709","DOI":"10.1109\/TCSVT.2019.2904996","article-title":"Video summarization with attention-based encoder\u2013decoder networks","volume":"30","author":"Ji","year":"2019","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"577","DOI":"10.1109\/TCSVT.2019.2890899","article-title":"A novel key-frames selection framework for comprehensive video summarization","volume":"30","author":"Huang","year":"2019","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Rochan, M., Ye, L., and Wang, Y. (2018, January 8\u201314). Video summarization using fully convolutional sequence networks. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01258-8_22"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Zhao, B., Li, X., and Lu, X. (2018, January 18\u201323). Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00773"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"64676","DOI":"10.1109\/ACCESS.2019.2916989","article-title":"Spatiotemporal modeling for video summarization using convolutional recurrent neural network","volume":"7","author":"Yuan","year":"2019","journal-title":"IEEE Access"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Feng, L., Li, Z., Kuang, Z., and Zhang, W. (2018, January 22\u201326). Extractive video summarizer with memory augmented neural networks. Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea.","DOI":"10.1145\/3240508.3240651"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"3629","DOI":"10.1109\/TIE.2020.2979573","article-title":"TTH-RNN: Tensor-train hierarchical recurrent neural network for video summarization","volume":"68","author":"Zhao","year":"2020","journal-title":"IEEE Trans. Ind. Electron."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"1765","DOI":"10.1109\/TNNLS.2020.2991083","article-title":"Deep attentive video summarization with distribution consistency learning","volume":"32","author":"Ji","year":"2020","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"107677","DOI":"10.1016\/j.patcog.2020.107677","article-title":"Exploring global diverse attention via pairwise temporal relation for video summarization","volume":"111","author":"Li","year":"2021","journal-title":"Pattern Recognit."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Chu, W.T., and Liu, Y.H. (2019, January 27\u201329). Spatiotemporal modeling and label distribution learning for video summarization. Proceedings of the 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), Kuala Lumpur, Malaysia.","DOI":"10.1109\/MMSP.2019.8901741"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Liu, Y.T., Li, Y.J., Yang, F.E., Chen, S.F., and Wang, Y.C.F. (2019, January 22\u201325). Learning hierarchical self-attention for video summarization. Proceedings of the 2019 IEEE int ernational conf erence on im age proc essing (ICIP), Taipei, Taiwan.","DOI":"10.1109\/ICIP.2019.8803639"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Wang, J., Wang, W., Wang, Z., Wang, L., Feng, D., and Tan, T. (2019, January 21\u201325). Stacked memory network for video summarization. Proceedings of the 27th ACM International Conference on Multimedia, Nice, France.","DOI":"10.1145\/3343031.3350992"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/12\/9\/186\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:51:43Z","timestamp":1760129503000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/12\/9\/186"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,15]]},"references-count":51,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2023,9]]}},"alternative-id":["computers12090186"],"URL":"https:\/\/doi.org\/10.3390\/computers12090186","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,15]]}}}