{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T15:19:09Z","timestamp":1772119149111,"version":"3.50.1"},"reference-count":48,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2024,5,9]],"date-time":"2024-05-09T00:00:00Z","timestamp":1715212800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,5,9]],"date-time":"2024-05-09T00:00:00Z","timestamp":1715212800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100003385","name":"Georg-August-Universit\u00e4t G\u00f6ttingen","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100003385","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Machine Vision and Applications"],"published-print":{"date-parts":[[2024,7]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Automatic video description necessitates generating natural language statements that encapsulate the actions, events, and objects within a video. An essential human capability in describing videos is to vary the level of detail, a feature that existing automatic video description methods, which typically generate single, fixed-level detail sentences, often overlook. This work delves into video descriptions of manipulation actions, where varying levels of detail are crucial to conveying information about the hierarchical structure of actions, also pertinent to contemporary robot learning techniques. We initially propose two frameworks: a hybrid statistical model and an end-to-end approach. The hybrid method, requiring significantly less data, statistically models uncertainties within video clips. Conversely, the end-to-end method, more data-intensive, establishes a direct link between the visual encoder and the language decoder, bypassing any statistical processing. Furthermore, we introduce an Integrated Method, aiming to amalgamate the benefits of both the hybrid statistical and end-to-end approaches, enhancing the adaptability and depth of video descriptions across different data availability scenarios. All three frameworks utilize LSTM stacks to facilitate description granularity, allowing videos to be depicted through either succinct single sentences or elaborate multi-sentence narratives. Quantitative results demonstrate that these methods produce more realistic descriptions than other competing approaches.<\/jats:p>","DOI":"10.1007\/s00138-024-01547-x","type":"journal-article","created":{"date-parts":[[2024,5,9]],"date-time":"2024-05-09T14:04:58Z","timestamp":1715263498000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Multi sentence description of complex manipulation action videos"],"prefix":"10.1007","volume":"35","author":[{"given":"Fatemeh","family":"Ziaeetabar","sequence":"first","affiliation":[]},{"given":"Reza","family":"Safabakhsh","sequence":"additional","affiliation":[]},{"given":"Saeedeh","family":"Momtazi","sequence":"additional","affiliation":[]},{"given":"Minija","family":"Tamosiunaite","sequence":"additional","affiliation":[]},{"given":"Florentin","family":"W\u00f6rg\u00f6tter","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,5,9]]},"reference":[{"issue":"6","key":"1547_CR1","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3355390","volume":"52","author":"N Aafaq","year":"2019","unstructured":"Aafaq, N., Mian, A., Liu, W., Gilani, S.Z., Shah, M.: Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput. Surv. 52(6), 1\u201337 (2019)","journal-title":"ACM Comput. Surv."},{"key":"1547_CR2","unstructured":"Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization, pp. 65\u201372 (2005)"},{"key":"1547_CR3","first-page":"993","volume":"3","author":"DM Blei","year":"2003","unstructured":"Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993\u20131022 (2003)","journal-title":"J. Mach. Learn. Res."},{"key":"1547_CR4","doi-asserted-by":"crossref","unstructured":"Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291\u20137299 (2017)","DOI":"10.1109\/CVPR.2017.143"},{"key":"1547_CR5","doi-asserted-by":"crossref","unstructured":"Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S: 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In: Computer Vision\u2014ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11\u201314, 2016, Proceedings, Part VIII 14, pages 628\u2013644. Springer (2016)","DOI":"10.1007\/978-3-319-46484-8_38"},{"issue":"11","key":"1547_CR6","doi-asserted-by":"publisher","first-page":"4125","DOI":"10.1109\/TPAMI.2020.2991965","volume":"43","author":"D Damen","year":"2020","unstructured":"Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al.: The epic-kitchens dataset: collection, challenges and baselines. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 4125\u20134141 (2020)","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"1547_CR7","unstructured":"Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171\u20134186 (2019)"},{"issue":"1","key":"1547_CR8","doi-asserted-by":"publisher","first-page":"187","DOI":"10.1109\/LRA.2019.2949221","volume":"5","author":"CRG Dreher","year":"2020","unstructured":"Dreher, C.R.G., W\u00e4chter, M., Asfour, T.: Learning object-action relations from bimanual human demonstration using graph networks. IEEE Robot. Autom. Lett. 5(1), 187\u2013194 (2020)","journal-title":"IEEE Robot. Autom. Lett."},{"key":"1547_CR9","doi-asserted-by":"crossref","unstructured":"Dyer, C., Muresan, S., Resnik, P.: Generalizing word lattice translation. Technical Report, Maryland Univ College Park Inst for Advanced Computer Studies (2008)","DOI":"10.21236\/ADA482158"},{"key":"1547_CR10","unstructured":"Fu, T.J., Li, L., Gan, Z., Lin, K., Wang, W.Y., Wang, L., Liu, Z.: Violet: end-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681 (2021)"},{"key":"1547_CR11","doi-asserted-by":"crossref","unstructured":"Fu, T.J., Li, L., Gan, Z., Lin, K., Wang, W.Y., Wang, L., Liu, Z.: An empirical study of end-to-end video-language transformers with masked visual modeling. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 22898\u201322909 (2023)","DOI":"10.1109\/CVPR52729.2023.02193"},{"key":"1547_CR12","doi-asserted-by":"crossref","unstructured":"Gu, X., Chen, G., Wang, Y., Zhang, L., Luo, T., Wen, L.: Text with knowledge graph augmented transformer for video captioning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 18941\u201318951 (2023)","DOI":"10.1109\/CVPR52729.2023.01816"},{"key":"1547_CR13","doi-asserted-by":"crossref","unstructured":"Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712\u20132719 (2013)","DOI":"10.1109\/ICCV.2013.337"},{"key":"1547_CR14","doi-asserted-by":"crossref","unstructured":"Hanckmann, P., Schutte, K., Burghouts, G.J.: Automated textual descriptions for a wide range of video events with 48 human actions. In: European Conference on Computer Vision, pp. 372\u2013380. Springer (2012)","DOI":"10.1007\/978-3-642-33863-2_37"},{"key":"1547_CR15","doi-asserted-by":"crossref","unstructured":"Ji, J., Krishna, R., Fei-Fei, L., Niebles, J.C.: Action genome: actions as compositions of spatio-temporal scene graphs. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 10236\u201310247 (2020)","DOI":"10.1109\/CVPR42600.2020.01025"},{"key":"1547_CR16","unstructured":"Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583\u20135594. PMLR (2021)"},{"key":"1547_CR17","doi-asserted-by":"crossref","unstructured":"Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et\u00a0al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177\u2013180 (2007)","DOI":"10.3115\/1557769.1557821"},{"issue":"2","key":"1547_CR18","doi-asserted-by":"publisher","first-page":"171","DOI":"10.1023\/A:1020346032608","volume":"50","author":"A Kojima","year":"2002","unstructured":"Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50(2), 171\u2013184 (2002)","journal-title":"Int. J. Comput. Vis."},{"key":"1547_CR19","doi-asserted-by":"crossref","unstructured":"Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317\u2013325 (2017)","DOI":"10.1109\/CVPR.2017.356"},{"key":"1547_CR20","doi-asserted-by":"crossref","unstructured":"Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 706\u2013715 (2017)","DOI":"10.1109\/ICCV.2017.83"},{"key":"1547_CR21","doi-asserted-by":"crossref","unstructured":"Lin, K., Li, L., Lin, C.C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., Wang, L.: Swinbert: end-to-end transformers with sparse attention for video captioning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 17949\u201317958 (2022)","DOI":"10.1109\/CVPR52688.2022.01742"},{"key":"1547_CR22","doi-asserted-by":"publisher","first-page":"293","DOI":"10.1016\/j.neucom.2022.07.028","volume":"508","author":"H Luo","year":"2022","unstructured":"Luo, H., et al.: Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508, 293\u2013304 (2022)","journal-title":"Neurocomputing"},{"key":"1547_CR23","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2019.102840","volume":"190","author":"M Nabati","year":"2020","unstructured":"Nabati, M., Behrad, A.: Video captioning using boosted and parallel long short-term memory networks. Comput. Vis. Image Underst. 190, 102840 (2020)","journal-title":"Comput. Vis. Image Underst."},{"key":"1547_CR24","doi-asserted-by":"crossref","unstructured":"Nguyen, A., Kanoulas, D., Muratore, L., Caldwell, D.G., Tsagarakis, N.G.: Translating videos to commands for robotic manipulation with deep recurrent neural networks. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3782\u20133788. IEEE (2018)","DOI":"10.1109\/ICRA.2018.8460857"},{"key":"1547_CR25","doi-asserted-by":"publisher","first-page":"126","DOI":"10.1016\/j.cviu.2017.06.012","volume":"163","author":"F Nian","year":"2017","unstructured":"Nian, F., Li, T., Wang, Y., Xinyu, W., Ni, B., Changsheng, X.: Learning explicit video attributes from mid-level representation for video captioning. Comput. Vis. Image Understand. 163, 126\u2013138 (2017)","journal-title":"Comput. Vis. Image Understand."},{"key":"1547_CR26","doi-asserted-by":"crossref","unstructured":"Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029\u20131038 (2016)","DOI":"10.1109\/CVPR.2016.117"},{"key":"1547_CR27","doi-asserted-by":"crossref","unstructured":"Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594\u20134602 (2016)","DOI":"10.1109\/CVPR.2016.497"},{"key":"1547_CR28","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311\u2013318 (2002)","DOI":"10.3115\/1073083.1073135"},{"key":"1547_CR29","doi-asserted-by":"crossref","unstructured":"Park, J.S., Rohrbach, M., Darrell, T., Rohrbach, A.: Adversarial inference for multi-sentence video description. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 6598\u20136608 (2019)","DOI":"10.1109\/CVPR.2019.00676"},{"key":"1547_CR30","unstructured":"Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)"},{"key":"1547_CR31","doi-asserted-by":"crossref","unstructured":"Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., Schiele, B.: Coherent multi-sentence video description with variable level of detail. In: German Conference on Pattern Recognition, pp. 184\u2013195. Springer (2014)","DOI":"10.1007\/978-3-319-11752-2_15"},{"key":"1547_CR32","doi-asserted-by":"crossref","unstructured":"Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 433\u2013440 (2013)","DOI":"10.1109\/ICCV.2013.61"},{"key":"1547_CR33","doi-asserted-by":"crossref","unstructured":"Seo, P.H., et\u00a0al.: End-to-end generative pretraining for multimodal video captioning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (2022)","DOI":"10.1109\/CVPR52688.2022.01743"},{"key":"1547_CR34","unstructured":"Singh, A., Meetei, L.S., Singh, S.M., Singh, T.D., Bandyopadhyay, S.: An efficient keyframes selection based framework for video captioning. In: Proceedings of the 18th International Conference on Natural Language Processing (ICON), pp. 240\u2013250 (2021)"},{"issue":"2","key":"1547_CR35","doi-asserted-by":"publisher","first-page":"391","DOI":"10.1016\/S0022-0000(05)80056-X","volume":"49","author":"K Sugihara","year":"1994","unstructured":"Sugihara, K.: Robust gift wrapping for the three-dimensional convex hull. J. Comput. Syst. Sci. 49(2), 391\u2013407 (1994)","journal-title":"J. Comput. Syst. Sci."},{"key":"1547_CR36","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566\u20134575 (2015)","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"1547_CR37","doi-asserted-by":"crossref","unstructured":"Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534\u20134542 (2015)","DOI":"10.1109\/ICCV.2015.515"},{"key":"1547_CR38","doi-asserted-by":"crossref","unstructured":"Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729, (2014)","DOI":"10.3115\/v1\/N15-1173"},{"key":"1547_CR39","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156\u20133164 (2015)","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"1547_CR40","doi-asserted-by":"crossref","unstructured":"Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7622\u20137631 (2018)","DOI":"10.1109\/CVPR.2018.00795"},{"key":"1547_CR41","doi-asserted-by":"crossref","unstructured":"Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288\u20135296 (2016)","DOI":"10.1109\/CVPR.2016.571"},{"key":"1547_CR42","doi-asserted-by":"crossref","unstructured":"Yang, A., et\u00a0al.: Vid2seq: large-scale pretraining of a visual language model for dense video captioning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (2023)","DOI":"10.1109\/CVPR52729.2023.01032"},{"key":"1547_CR43","first-page":"67","volume":"3","author":"Y Yang","year":"2014","unstructured":"Yang, Y., Guha, A., Fermuller, C., Aloimonos, Y.: A cognitive system for understanding human manipulation actions. Adv. Cognit. Syst. 3, 67\u201386 (2014)","journal-title":"Adv. Cognit. Syst."},{"key":"1547_CR44","doi-asserted-by":"crossref","unstructured":"Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584\u20134593 (2016)","DOI":"10.1109\/CVPR.2016.496"},{"key":"1547_CR45","doi-asserted-by":"crossref","unstructured":"Zhang, W., Wang, X.E., Tang, S., Shi, H., Shi, H., Xiao, J., Zhuang, Y., Wang, W.Y.: Relational graph learning for grounded video description generation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3807\u20133828 (2020)","DOI":"10.1145\/3394171.3413746"},{"key":"1547_CR46","doi-asserted-by":"crossref","unstructured":"Zhou, L., Xu, C., Corso, J.J., Wei, D.: Towards automatic learning of procedures from web instructional videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3692\u20133701 (2018)","DOI":"10.1609\/aaai.v32i1.12342"},{"key":"1547_CR47","doi-asserted-by":"crossref","unstructured":"Ziaeetabar, F., Aksoy, E.E., W\u00f6rg\u00f6tter, F., Tamosiunaite, M.: Semantic analysis of manipulation actions using spatial relations. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4612\u20134619. IEEE (2017)","DOI":"10.1109\/ICRA.2017.7989536"},{"key":"1547_CR48","doi-asserted-by":"publisher","first-page":"173","DOI":"10.1016\/j.robot.2018.10.005","volume":"110","author":"F Ziaeetabar","year":"2018","unstructured":"Ziaeetabar, F., Kulvicius, T., Tamosiunaite, M., W\u00f6rg\u00f6tter, F.: Recognition and prediction of manipulation actions using enriched semantic event chains. Robot. Autonom. Syst. 110, 173\u2013188 (2018)","journal-title":"Robot. Autonom. Syst."}],"container-title":["Machine Vision and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00138-024-01547-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00138-024-01547-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00138-024-01547-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,18]],"date-time":"2024-11-18T08:16:00Z","timestamp":1731917760000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00138-024-01547-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,9]]},"references-count":48,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["1547"],"URL":"https:\/\/doi.org\/10.1007\/s00138-024-01547-x","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-3604976\/v1","asserted-by":"object"}]},"ISSN":["0932-8092","1432-1769"],"issn-type":[{"value":"0932-8092","type":"print"},{"value":"1432-1769","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,9]]},"assertion":[{"value":"13 November 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 March 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 April 2024","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 May 2024","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"64"}}