{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T18:07:39Z","timestamp":1775326059682,"version":"3.50.1"},"reference-count":45,"publisher":"Springer Science and Business Media LLC","issue":"5","license":[{"start":{"date-parts":[[2023,2,24]],"date-time":"2023-02-24T00:00:00Z","timestamp":1677196800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,2,24]],"date-time":"2023-02-24T00:00:00Z","timestamp":1677196800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Major Research plan of the National Social Science Foundation of China","award":["20 &ZD130"],"award-info":[{"award-number":["20 &ZD130"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2023,10]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities\u2019 association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.<\/jats:p>","DOI":"10.1007\/s40747-023-00998-5","type":"journal-article","created":{"date-parts":[[2023,2,24]],"date-time":"2023-02-24T03:03:56Z","timestamp":1677207836000},"page":"4995-5012","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":27,"title":["Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph"],"prefix":"10.1007","volume":"9","author":[{"given":"Shixing","family":"Han","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7249-698X","authenticated-orcid":false,"given":"Jin","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Jinyingming","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Peizhu","family":"Gong","sequence":"additional","affiliation":[]},{"given":"Xiliang","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Huihua","family":"He","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,2,24]]},"reference":[{"key":"998_CR1","doi-asserted-by":"publisher","unstructured":"Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision. pp 4534\u20134542. https:\/\/doi.org\/10.1109\/ICCV.2015.515","DOI":"10.1109\/ICCV.2015.515"},{"key":"998_CR2","doi-asserted-by":"publisher","unstructured":"Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729. https:\/\/doi.org\/10.3115\/v1\/N15-1173","DOI":"10.3115\/v1\/N15-1173"},{"key":"998_CR3","doi-asserted-by":"publisher","unstructured":"Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision. pp 4507\u20134515 . https:\/\/doi.org\/10.1109\/ICCV.2015.512","DOI":"10.1109\/ICCV.2015.512"},{"key":"998_CR4","doi-asserted-by":"publisher","unstructured":"Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4584\u20134593. https:\/\/doi.org\/10.1109\/CVPR.2016.496","DOI":"10.1109\/CVPR.2016.496"},{"key":"998_CR5","doi-asserted-by":"publisher","unstructured":"Krishna R, Hata K, Ren F, Fei-Fei L, Carlos NJ (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision. pp 706\u2013715. https:\/\/doi.org\/10.1109\/ICCV.2017.83","DOI":"10.1109\/ICCV.2017.83"},{"key":"998_CR6","doi-asserted-by":"crossref","unstructured":"Escorcia V, Caba Heilbron F, Niebles JC, Ghanem B (2016) Daps: deep action proposals for action understanding. In: European conference on computer vision. Springer, pp 768\u2013784","DOI":"10.1007\/978-3-319-46487-9_47"},{"key":"998_CR7","doi-asserted-by":"publisher","unstructured":"Li Y, Yao T, Pan Y, Chao H, Mei T (2018) Jointly localizing and describing events for dense video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 7492\u20137500. https:\/\/doi.org\/10.1109\/CVPR.2018.00782","DOI":"10.1109\/CVPR.2018.00782"},{"key":"998_CR8","doi-asserted-by":"publisher","unstructured":"Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on multimedia. pp 988\u2013996. https:\/\/doi.org\/10.1145\/3123266.3123343","DOI":"10.1145\/3123266.3123343"},{"key":"998_CR9","doi-asserted-by":"publisher","unstructured":"Wang J, Jiang W, Ma L, Liu W, Xu Y (2018) Bidirectional attentive fusion with context gating for dense video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 7190\u20137198. https:\/\/doi.org\/10.1109\/CVPR.2018.00751","DOI":"10.1109\/CVPR.2018.00751"},{"key":"998_CR10","unstructured":"Duan X, Huang W, Gan C, Wang J, Zhu W, Huang J (2018) Weakly supervised dense event captioning in videos. arXiv preprint arXiv:1812.03849"},{"key":"998_CR11","doi-asserted-by":"publisher","unstructured":"Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp 12487\u201312496. https:\/\/doi.org\/10.1109\/CVPR.2019.01277","DOI":"10.1109\/CVPR.2019.01277"},{"key":"998_CR12","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30"},{"key":"998_CR13","doi-asserted-by":"publisher","unstructured":"Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B et al (2017) Cnn architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 131\u2013135. https:\/\/doi.org\/10.1109\/ICASSP.2017.7952132","DOI":"10.1109\/ICASSP.2017.7952132"},{"key":"998_CR14","doi-asserted-by":"publisher","unstructured":"Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6299\u20136308. https:\/\/doi.org\/10.1109\/CVPR.2017.502","DOI":"10.1109\/CVPR.2017.502"},{"key":"998_CR15","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1016\/j.cviu.2017.12.004","volume":"173","author":"S Aditya","year":"2018","unstructured":"Aditya S, Yang Y, Baral C, Aloimonos Y, Ferm\u00fcller C (2018) Image understanding using vision and reasoning through scene description graph. Comput Vis Image Underst 173:33\u201345. https:\/\/doi.org\/10.1016\/j.cviu.2017.12.004","journal-title":"Comput Vis Image Underst"},{"key":"998_CR16","doi-asserted-by":"publisher","unstructured":"Zhou Y, Sun Y, Honavar V (2019) Improving image captioning by leveraging knowledge graphs. In: 2019 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 283\u2013293. https:\/\/doi.org\/10.1109\/WACV.2019.00036","DOI":"10.1109\/WACV.2019.00036"},{"key":"998_CR17","unstructured":"Luo H, Ji L, Shi B, Huang H, Duan N, Li T, Li J, Bharti T, Zhou M (2020) Univl: a unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353"},{"key":"998_CR18","doi-asserted-by":"publisher","unstructured":"Hou J, Wu X, Zhang X, Qi Y, Jia Y, Luo J (2020) Joint commonsense and relation reasoning for image and video captioning. In: Proceedings of the AAAI conference on artificial intelligence, vol 34. pp 10973\u201310980. https:\/\/doi.org\/10.1109\/ICSP48669.2020.9321009","DOI":"10.1109\/ICSP48669.2020.9321009"},{"key":"998_CR19","unstructured":"Pearl J, Mackaenzie D (2019) The new science of cause and effect. In: The book of why. Basic Books, New York"},{"key":"998_CR20","doi-asserted-by":"publisher","unstructured":"Iashin V, Rahtu E (2020) Multi-modal dense video captioning. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition workshops. pp 958\u2013959 . https:\/\/doi.org\/10.1109\/CVPRW50498.2020.00487","DOI":"10.1109\/CVPRW50498.2020.00487"},{"key":"998_CR21","doi-asserted-by":"crossref","unstructured":"Buch S, Escorcia V, Shen C, Ghanem B, Carlos Niebles J (2017) Sst: single-stream temporal action proposals. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2911\u20132920","DOI":"10.1109\/CVPR.2017.675"},{"issue":"1","key":"998_CR22","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1007\/s00779-007-0165-0","volume":"13","author":"I Maglogiannis","year":"2009","unstructured":"Maglogiannis I, Vouyioukas D, Aggelopoulos C (2009) Face detection and recognition of natural human emotion using Markov random fields. Pers Ubiquitous Comput 13(1):95\u2013101. https:\/\/doi.org\/10.1007\/s00779-007-0165-0","journal-title":"Pers Ubiquitous Comput"},{"issue":"1","key":"998_CR23","doi-asserted-by":"publisher","first-page":"37","DOI":"10.32604\/csse.2021.017230","volume":"39","author":"Z Tang","year":"2021","unstructured":"Tang Z, Liu J, Yu C, Wang K (2021) Cyclic autoencoder for multimodal data alignment using custom datasets. Comput Syst Sci Eng 39(1):37\u201354","journal-title":"Comput Syst Sci Eng"},{"key":"998_CR24","doi-asserted-by":"publisher","unstructured":"Rahman T, Xu B, Sigal L (2019) Watch, listen and tell: multi-modal weakly supervised dense event captioning. In: Proceedings of the IEEE\/CVF international conference on computer vision. pp 8908\u20138917. https:\/\/doi.org\/10.1109\/ICCV.2019.00900","DOI":"10.1109\/ICCV.2019.00900"},{"key":"998_CR25","doi-asserted-by":"crossref","unstructured":"Hessel J, Pang B, Zhu Z, Soricut R (2019) A case study on combining ASR and visual features for generating instructional video captions. arXiv preprint arXiv:1910.02930","DOI":"10.18653\/v1\/K19-1039"},{"key":"998_CR26","doi-asserted-by":"publisher","unstructured":"Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp 2612\u20132620 . https:\/\/doi.org\/10.1109\/ICCV.2017.285","DOI":"10.1109\/ICCV.2017.285"},{"key":"998_CR27","doi-asserted-by":"publisher","unstructured":"Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp 6588\u20136597. https:\/\/doi.org\/10.1109\/CVPR.2019.00675","DOI":"10.1109\/CVPR.2019.00675"},{"key":"998_CR28","unstructured":"Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555"},{"key":"998_CR29","doi-asserted-by":"publisher","unstructured":"Wang T, Huang J, Zhang H, Sun Q (2020) Visual commonsense r-cnn. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp 10760\u201310770. https:\/\/doi.org\/10.1109\/CVPR42600.2020.01077","DOI":"10.1109\/CVPR42600.2020.01077"},{"key":"998_CR30","doi-asserted-by":"publisher","first-page":"124688","DOI":"10.1109\/ACCESS.2019.2937353","volume":"7","author":"J Liu","year":"2019","unstructured":"Liu J, Zhang X, Li Y, Wang J, Kim H-J (2019) Deep learning-based reasoning with multi-ontology for iot applications. IEEE Access 7:124688\u2013124701","journal-title":"IEEE Access"},{"key":"998_CR31","doi-asserted-by":"publisher","unstructured":"Zhou H, Young T, Huang M, Zhao H, Xu J, Zhu X (2018) Commonsense knowledge aware conversation generation with graph attention. In: IJCAI. pp 4623\u20134629. https:\/\/doi.org\/10.24963\/ijcai.2018\/643","DOI":"10.24963\/ijcai.2018\/643"},{"key":"998_CR32","doi-asserted-by":"publisher","unstructured":"Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532\u20131543. https:\/\/doi.org\/10.3115\/v1\/D14-1162","DOI":"10.3115\/v1\/D14-1162"},{"key":"998_CR33","doi-asserted-by":"publisher","unstructured":"He K, Gkioxari G, Doll\u00e1r P, Girshick R (2020) Mask r-cnn 42:386\u2013397. https:\/\/doi.org\/10.1109\/TPAMI.2018.2844175","DOI":"10.1109\/TPAMI.2018.2844175"},{"key":"998_CR34","doi-asserted-by":"publisher","unstructured":"Zellers R, Yatskar M, Thomson S, Choi Y (2018) Neural motifs: scene graph parsing with global context. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5831\u20135840. https:\/\/doi.org\/10.1109\/CVPR.2018.00611","DOI":"10.1109\/CVPR.2018.00611"},{"key":"998_CR35","doi-asserted-by":"publisher","unstructured":"Tang K, Niu Y, Huang J, Shi J, Zhang H (2020) Unbiased scene graph generation from biased training. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp 3716\u20133725. https:\/\/doi.org\/10.1109\/CVPR42600.2020.00377","DOI":"10.1109\/CVPR42600.2020.00377"},{"key":"998_CR36","doi-asserted-by":"publisher","first-page":"282","DOI":"10.1016\/j.neucom.2020.04.056","volume":"403","author":"J Liu","year":"2020","unstructured":"Liu J, Yang Y, He H (2020) Multi-level semantic representation enhancement network for relationship extraction. Neurocomputing 403:282\u2013293","journal-title":"Neurocomputing"},{"key":"998_CR37","doi-asserted-by":"publisher","unstructured":"Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382\u2013398. https:\/\/doi.org\/10.1007\/978-3-319-46454-1_24","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"998_CR38","doi-asserted-by":"crossref","unstructured":"Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation. pp 376\u2013380","DOI":"10.3115\/v1\/W14-3348"},{"key":"998_CR39","doi-asserted-by":"crossref","unstructured":"Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. pp 311\u2013318","DOI":"10.3115\/1073083.1073135"},{"key":"998_CR40","doi-asserted-by":"publisher","unstructured":"Li Y, Yao T, Pan Y, Chao H, Mei T (2018) Jointly localizing and describing events for dense video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 7492\u20137500. https:\/\/doi.org\/10.1109\/CVPR.2018.00782","DOI":"10.1109\/CVPR.2018.00782"},{"key":"998_CR41","unstructured":"Chadha A, Arora G, Kaloty N (2020) iperceive: applying common-sense reasoning to multi-modal dense video captioning and video question answering. arXiv preprint arXiv:2011.07735"},{"key":"998_CR42","doi-asserted-by":"publisher","unstructured":"Zhou L, Zhou Y, Corso J.J, Socher R, Xiong C (2018) End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 8739\u20138748. https:\/\/doi.org\/10.1109\/CVPR.2018.00911","DOI":"10.1109\/CVPR.2018.00911"},{"key":"998_CR43","doi-asserted-by":"crossref","unstructured":"Iashin V, Rahtu E (2020) A better use of audio-visual cues: dense video captioning with bi-modal transformer. arXiv preprint arXiv:2005.08271","DOI":"10.1109\/CVPRW50498.2020.00487"},{"issue":"6","key":"998_CR44","doi-asserted-by":"publisher","first-page":"4554","DOI":"10.1109\/JIOT.2021.3104289","volume":"9","author":"C-H Lu","year":"2021","unstructured":"Lu C-H, Fan G-Y (2021) Environment-aware dense video captioning for iot-enabled edge cameras. IEEE Internet Things J 9(6):4554\u20134564","journal-title":"IEEE Internet Things J"},{"issue":"3","key":"998_CR45","doi-asserted-by":"publisher","first-page":"400","DOI":"10.1214\/aoms\/1177729586","volume":"22","author":"H Robbins","year":"1951","unstructured":"Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400\u2013407. https:\/\/doi.org\/10.1214\/aoms\/1177729586","journal-title":"Ann Math Stat"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-023-00998-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-023-00998-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-023-00998-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,22]],"date-time":"2023-09-22T17:13:53Z","timestamp":1695402833000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-023-00998-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,24]]},"references-count":45,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2023,10]]}},"alternative-id":["998"],"URL":"https:\/\/doi.org\/10.1007\/s40747-023-00998-5","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,24]]},"assertion":[{"value":"6 October 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 February 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 February 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"On behalf of all authors, the corresponding author states that there is no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}