{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,17]],"date-time":"2026-06-17T05:21:00Z","timestamp":1781673660182,"version":"3.54.5"},"reference-count":52,"publisher":"World Scientific Pub Co Pte Ltd","issue":"12","funder":[{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"publisher","award":["FRF-BD-20-11A"],"award-info":[{"award-number":["FRF-BD-20-11A"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Int. J. Patt. Recogn. Artif. Intell."],"published-print":{"date-parts":[[2022,9,30]]},"abstract":"<jats:p> Multi-modal dense video captioning is a task using multiple information to detect all meaningful events and generate a textual description for each event. The existing works mainly rely on single visual or dual audio-visual modals in dense video captioning, while completely ignoring the text modal (subtitle). The text modal has a similar data structure as the video captions, which provides immediate semantic information to the content description for a video. In this paper, we propose a novel framework, called Two-Stage Cross-Modal Encoding Transformer Network (TS-CMETN), to realize the multi-modal dense video captioning task by fusing multiple features, including audio, visual, and text. First, we design a two-stage feature fusion encoder that hierarchically achieves the intra- and inter-modal information interaction. Second, we propose an anchor-free temporal event proposal module, which efficiently generates event proposals at each time step without the complex anchor calculation. Extensive experiments on the ActivityNet Captions dataset show that our proposed framework achieves high performance. Moreover, our approach can adaptively handle cases of the missing text modal. Our code and data are available at https:\/\/github.com\/xieyulai\/TM-CMETN . <\/jats:p>","DOI":"10.1142\/s021800142255014x","type":"journal-article","created":{"date-parts":[[2022,6,21]],"date-time":"2022-06-21T08:43:37Z","timestamp":1655801017000},"source":"Crossref","is-referenced-by-count":4,"title":["Tri-Modal Dense Video Captioning Based on Fine-Grained Aligned Text and Anchor-Free Event Proposals Generator"],"prefix":"10.1142","volume":"36","author":[{"given":"Jingjing","family":"Niu","sequence":"first","affiliation":[{"name":"School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, P. R. China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yulai","family":"Xie","sequence":"additional","affiliation":[{"name":"Hitachi China Research Laboratory, Beijing, P. R. China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yang","family":"Zhang","sequence":"additional","affiliation":[{"name":"Hitachi China Research Laboratory, Beijing, P. R. China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jinyu","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, P. R. China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yanfei","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, P. R. China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiao","family":"Lei","sequence":"additional","affiliation":[{"name":"School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, P. R. China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2251-9220","authenticated-orcid":false,"given":"Fang","family":"Ren","sequence":"additional","affiliation":[{"name":"School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, P. R. China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"219","published-online":{"date-parts":[[2022,8,15]]},"reference":[{"key":"S021800142255014XBIB001","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01277"},{"key":"S021800142255014XBIB002","author":"Ahmed K.","year":"2017","journal-title":"Artif. Intell."},{"key":"S021800142255014XBIB003","volume-title":"Computer Vision and Pattern Recognition (CVPR)","author":"Aytar Y.","year":"2017"},{"key":"S021800142255014XBIB004","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.675"},{"key":"S021800142255014XBIB005","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"S021800142255014XBIB006","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3348"},{"key":"S021800142255014XBIB007","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-896"},{"key":"S021800142255014XBIB008","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2599174"},{"key":"S021800142255014XBIB009","first-page":"3059","volume-title":"Proc. Advances in Neural Information Processing Systems","author":"Duan X.","year":"2018"},{"key":"S021800142255014XBIB010","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46487-9_47"},{"key":"S021800142255014XBIB011","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58548-8_13"},{"key":"S021800142255014XBIB012","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2729019"},{"key":"S021800142255014XBIB013","first-page":"22605","volume-title":"Neural Information Processing Systems (NeurIPS)","author":"Ging S.","year":"2020"},{"key":"S021800142255014XBIB014","first-page":"2672","volume-title":"Proc. Int. Conf. Neural Information Processing Systems","author":"Goodfellow I. J.","year":"2014"},{"key":"S021800142255014XBIB015","doi-asserted-by":"publisher","DOI":"10.1109\/ICARM.2019.8834066"},{"key":"S021800142255014XBIB016","volume-title":"Proc. IEEE Conf. Computer Vision and Pattern Recognition: IEEE","author":"Hao W.","year":"2017"},{"key":"S021800142255014XBIB017","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952132"},{"key":"S021800142255014XBIB018","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"S021800142255014XBIB019","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2017.8268968"},{"key":"S021800142255014XBIB020","volume-title":"Proc. IEEE Conf. Computer Vision and Pattern Recognition: IEEE","author":"Iashin V.","year":"2020"},{"key":"S021800142255014XBIB021","first-page":"4117","volume-title":"Proc. IEEE\/CVF Conf. Computer Vision and Pattern Recognition Workshops","author":"Iashin V.","year":"2020"},{"key":"S021800142255014XBIB022","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2879642"},{"key":"S021800142255014XBIB023","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00675"},{"key":"S021800142255014XBIB024","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"S021800142255014XBIB025","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01088"},{"key":"S021800142255014XBIB026","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.117"},{"key":"S021800142255014XBIB027","first-page":"311","volume-title":"Proc. Annual Meeting on Association for Computational Linguistics","author":"Papineni K.","year":"2002"},{"key":"S021800142255014XBIB028","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00676"},{"key":"S021800142255014XBIB029","first-page":"8026","volume-title":"Advances in Neural Information Processing Systems","author":"Paszke A.","year":"2019"},{"key":"S021800142255014XBIB030","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00854"},{"key":"S021800142255014XBIB031","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"S021800142255014XBIB032","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00900"},{"key":"S021800142255014XBIB033","first-page":"1","volume-title":"Proc. IEEE Conf. Computer Vision and Pattern Recognition","author":"Redmon J.","year":"2018"},{"key":"S021800142255014XBIB034","first-page":"184","volume-title":"German Conf. Pattern Recognition","author":"Senina A.","year":"2014"},{"key":"S021800142255014XBIB035","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1641"},{"key":"S021800142255014XBIB036","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00931"},{"key":"S021800142255014XBIB037","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"S021800142255014XBIB038","volume-title":"Proc. IEEE Conf. Computer Vision and Pattern Recognition: IEEE","author":"Tian Y.","year":"2018"},{"key":"S021800142255014XBIB039","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00972"},{"key":"S021800142255014XBIB040","first-page":"6000","volume-title":"Proc. 31st Int. Conf. Neural Information Processing Systems","author":"Vaswani A.","year":"2017"},{"key":"S021800142255014XBIB041","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.515"},{"key":"S021800142255014XBIB042","volume-title":"Proc. IEEE Conf. Computer Vision and Pattern Recognition: IEEE","author":"Venugopalan S.","year":"2014"},{"key":"S021800142255014XBIB043","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00751"},{"key":"S021800142255014XBIB044","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00443"},{"key":"S021800142255014XBIB045","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2125"},{"key":"S021800142255014XBIB046","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01252-6_29"},{"key":"S021800142255014XBIB047","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123448"},{"key":"S021800142255014XBIB048","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.3002669"},{"key":"S021800142255014XBIB049","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2924576"},{"key":"S021800142255014XBIB050","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.496"},{"key":"S021800142255014XBIB051","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_29"},{"key":"S021800142255014XBIB052","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00911"}],"container-title":["International Journal of Pattern Recognition and Artificial Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S021800142255014X","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,10,17]],"date-time":"2022-10-17T10:19:24Z","timestamp":1666001964000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/10.1142\/S021800142255014X"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,15]]},"references-count":52,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2022,9,30]]}},"alternative-id":["10.1142\/S021800142255014X"],"URL":"https:\/\/doi.org\/10.1142\/s021800142255014x","relation":{},"ISSN":["0218-0014","1793-6381"],"issn-type":[{"value":"0218-0014","type":"print"},{"value":"1793-6381","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,8,15]]},"article-number":"2255014"}}