{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T17:22:55Z","timestamp":1772644975560,"version":"3.50.1"},"reference-count":38,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2021,11,30]],"date-time":"2021-11-30T00:00:00Z","timestamp":1638230400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,11,30]],"date-time":"2021-11-30T00:00:00Z","timestamp":1638230400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"publisher","award":["No. 2018R1C1B6007230"],"award-info":[{"award-number":["No. 2018R1C1B6007230"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J of Soc Robotics"],"published-print":{"date-parts":[[2023,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Robot vision provides the most important information to robots so that they can read the context and interact with human partners successfully. Moreover, to allow humans recognize the robot\u2019s visual understanding during human-robot interaction (HRI), the best way is for the robot to provide an explanation of its understanding in natural language. In this paper, we propose a new approach by which to interpret robot vision from an egocentric standpoint and generate descriptions to explain egocentric videos particularly for HRI. Because robot vision equals to egocentric video on the robot\u2019s side, it contains as much egocentric view information as exocentric view information. Thus, we propose a new dataset, referred to as the global, action, and interaction (GAI) dataset, which consists of egocentric video clips and GAI descriptions in natural language to represent both egocentric and exocentric information. The encoder-decoder based deep learning model is trained based on the GAI dataset and its performance on description generation assessments is evaluated. We also conduct experiments in actual environments to verify whether the GAI dataset and the trained deep learning model can improve a robot vision system\n<\/jats:p>","DOI":"10.1007\/s12369-021-00842-1","type":"journal-article","created":{"date-parts":[[2021,11,30]],"date-time":"2021-11-30T19:23:45Z","timestamp":1638300225000},"page":"631-641","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":20,"title":["Video Captioning Based on Both Egocentric and Exocentric Views of Robot Vision for Human-Robot Interaction"],"prefix":"10.1007","volume":"15","author":[{"given":"Soo-Han","family":"Kang","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8391-6898","authenticated-orcid":false,"given":"Ji-Hyeong","family":"Han","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,11,30]]},"reference":[{"key":"842_CR1","unstructured":"Kong Yu, Fu Yun (2018) Human action recognition and prediction: A survey. arXiv preprint arXiv:1806.11230"},{"issue":"1","key":"842_CR2","doi-asserted-by":"publisher","first-page":"101","DOI":"10.1007\/s10846-015-0259-2","volume":"82","author":"D McColl","year":"2016","unstructured":"McColl D, Hong A, Hatakeyama N, Nejat G, Benhabib B (2016) A survey of autonomous human affect detection methods for social robots engaged in natural hri. J Intell Robot Syst 82(1):101\u2013133","journal-title":"J Intell Robot Syst"},{"key":"842_CR3","doi-asserted-by":"crossref","unstructured":"Ji Yanli, Yang Yang, Shen Fumin (2019) Heng Tao Shen, and Xuelong Li. A survey of human action analysis in hri applications, IEEE Transactions on Circuits and Systems for Video Technology","DOI":"10.1109\/TCSVT.2019.2912988"},{"key":"842_CR4","doi-asserted-by":"crossref","unstructured":"Lunghi Giacomo, Marin Raul, Di\u00a0Castro Mario, Masi Alessandro, Sanz Pedro\u00a0J |(2019) Multimodal human-robot interface for accessible remote robotic interventions in hazardous environments. IEEE Access, 7:127290\u2013127319","DOI":"10.1109\/ACCESS.2019.2939493"},{"key":"842_CR5","doi-asserted-by":"crossref","unstructured":"Ruiz Ariel Y\u00a0Ramos, Rivera Luis J\u00a0Figueroa, Chandrasekaran Balasubramaniyan (2019) A sensor fusion based robotic system architecture using human interaction for motion control. In: 2019 IEEE 9th annual computing and communication workshop and conference (CCWC), pages 0095\u20130100. IEEE","DOI":"10.1109\/CCWC.2019.8666526"},{"key":"842_CR6","doi-asserted-by":"crossref","unstructured":"Vasquez Dizan, Stein Proc\u00f3pio, Rios-Martinez Jorge, Escobedo Arturo, Spalanzani Anne, Laugier Christian (2013) Human aware navigation for assistive robotics. In: experimental robotics, pages 449\u2013462. Springer","DOI":"10.1007\/978-3-319-00065-7_31"},{"key":"842_CR7","doi-asserted-by":"crossref","unstructured":"Marques Francisco, Gon\u00e7alves Duarte, Barata Jos\u00e9, Santana Pedro (2017) Human-aware navigation for autonomous mobile robots for intra-factory logistics. In: international workshop on symbiotic interaction, pages 79\u201385. Springer","DOI":"10.1007\/978-3-319-91593-7_9"},{"key":"842_CR8","doi-asserted-by":"crossref","unstructured":"Moghadas M, Moradi, H (2018) Analyzing human-robot interaction using machine vision for autism screening. In: 2018 6th RSI international conference on robotics and mechatronics (IcRoM), pages 572\u2013576. IEEE","DOI":"10.1109\/ICRoM.2018.8657569"},{"key":"842_CR9","doi-asserted-by":"crossref","unstructured":"Liu Miao, Tang Siyu, Li Yin, Rehg James\u00a0M (2020) Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In: European conference on computer vision, pages 704\u2013721. Springer","DOI":"10.1007\/978-3-030-58452-8_41"},{"key":"842_CR10","doi-asserted-by":"crossref","unstructured":"Nguyen Anh, Kanoulas Dimitrios, Muratore Luca, Caldwell Darwin\u00a0G, Tsagarakis Nikos\u00a0G (2018) Translating videos to commands for robotic manipulation with deep recurrent neural networks. In: 2018 IEEE international conference on robotics and automation (ICRA), pages 1\u20139. IEEE","DOI":"10.1109\/ICRA.2018.8460857"},{"issue":"2","key":"842_CR11","doi-asserted-by":"publisher","first-page":"841","DOI":"10.1109\/LRA.2018.2793345","volume":"3","author":"Cascianelli Silvia","year":"2018","unstructured":"Silvia Cascianelli, Gabriele Costante, Ciarfuglia Thomas A, Paolo Valigi, Fravolini Mario L (2018) Full-gru natural language video description for service robotics applications. IEEE Robot Autom Lett 3(2):841\u2013848","journal-title":"IEEE Robot Autom Lett"},{"key":"842_CR12","doi-asserted-by":"crossref","unstructured":"Venugopalan Subhashini, Rohrbach Marcus, Donahue Jeffrey, Mooney Raymond, Darrell Trevor, Saenko Kate (2015)Sequence to sequence-video to text. In: proceedings of the IEEE international conference on computer vision, pages 4534\u20134542","DOI":"10.1109\/ICCV.2015.515"},{"key":"842_CR13","unstructured":"Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan\u00a0N, Kaiser \u0141ukasz, Polosukhin Illia (2017) Attention is all you need. In: advances in neural information processing systems, pages 5998\u20136008"},{"key":"842_CR14","doi-asserted-by":"crossref","unstructured":"Papineni Kishore, Roukos Salim, Ward Todd, Zhu Wei-Jing (2002) Bleu: a method for automatic evaluation of machine translation. In: proceedings of the 40th annual meeting of the association for computational linguistics, pages 311\u2013318","DOI":"10.3115\/1073083.1073135"},{"issue":"7","key":"842_CR15","first-page":"2631","volume":"49","author":"B Yi","year":"2018","unstructured":"Yi B, Yang Y, Fumin S, Ning X, Tao SH, Xuelong L (2018) Describing video with attention-based bidirectional lstm. IEEE Trans Cybernet 49(7):2631\u20132641","journal-title":"IEEE Trans Cybernet"},{"key":"842_CR16","doi-asserted-by":"crossref","unstructured":"Li Xuelong, Zhao Bin, Lu Xiaoqiang, et\u00a0al (2017) Mam-rnn: Multi-level attention model based rnn for video captioning. In: IJCAI, p 2208\u20132214","DOI":"10.24963\/ijcai.2017\/307"},{"key":"842_CR17","doi-asserted-by":"crossref","unstructured":"Bin Yi, Yang Yang, Shen Fumin, Xu Xing, Shen Heng\u00a0Tao (2016) Bidirectional long-short term memory for video description. In: proceedings of the 24th ACM international conference on Multimedia, p 436\u2013440","DOI":"10.1145\/2964284.2967258"},{"key":"842_CR18","doi-asserted-by":"crossref","unstructured":"Fang K, Zhou L, Jin C, Zhang Y, Weng K, Zhang T, Fan W (2019) Fully convolutional video captioning with coarse-to-fine and inherited attention. In: proceedings of the AAAI conference on artificial intelligence 33:8271\u20138278","DOI":"10.1609\/aaai.v33i01.33018271"},{"key":"842_CR19","doi-asserted-by":"crossref","unstructured":"Liu Sheng, Ren Zhou, Yuan Junsong. (2020)Sibnet: Sibling convolutional encoder for video captioning. IEEE Trans Pattern Analy Mach Intell","DOI":"10.1109\/TPAMI.2019.2940007"},{"key":"842_CR20","doi-asserted-by":"crossref","unstructured":"Fan Chenyou, Crandall David\u00a0J (2016) Deepdiary: Automatically captioning lifelogging image streams. In: European conference on computer vision, pp 459\u2013473. Springer","DOI":"10.1007\/978-3-319-46604-0_33"},{"key":"842_CR21","doi-asserted-by":"publisher","first-page":"205","DOI":"10.1016\/j.jvcir.2017.11.022","volume":"50","author":"M Bola\u00f1os","year":"2018","unstructured":"Bola\u00f1os M, Peris \u00c1, Casacuberta F, Soler S, Radeva P (2018) Egocentric video description based on temporally-linked sequences. J Vis Commun Image Represent 50:205\u2013216","journal-title":"J Vis Commun Image Represent"},{"key":"842_CR22","doi-asserted-by":"crossref","unstructured":"Tran Du, Wang Heng, Torresani Lorenzo, Ray Jamie, LeCun Yann, Paluri Manohar (2018) A closer look at spatiotemporal convolutions for action recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition, p 6450\u20136459","DOI":"10.1109\/CVPR.2018.00675"},{"key":"842_CR23","doi-asserted-by":"crossref","unstructured":"Lin Ji, Gan Chuang, Han Song (2019) Tsm: Temporal shift module for efficient video understanding. In: proceedings of the IEEE international conference on computer vision, p 7083\u20137093","DOI":"10.1109\/ICCV.2019.00718"},{"key":"842_CR24","doi-asserted-by":"crossref","unstructured":"Wang Bairui, Ma Lin, Zhang Wei, Liu Wei (2018) Reconstruction network for video captioning. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), June","DOI":"10.1109\/CVPR.2018.00795"},{"issue":"9","key":"842_CR25","doi-asserted-by":"publisher","first-page":"2045","DOI":"10.1109\/TMM.2017.2729019","volume":"19","author":"G Lianli","year":"2017","unstructured":"Lianli G, Zhao G, Zhang Hanwang X, Shen HT (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimedia 19(9):2045\u20132055","journal-title":"IEEE Trans Multimedia"},{"issue":"11","key":"842_CR26","doi-asserted-by":"publisher","first-page":"5600","DOI":"10.1109\/TIP.2018.2855422","volume":"27","author":"Y Yang","year":"2018","unstructured":"Yang Y, Jie Z, Jiangbo A, Yi B, Alan H, Tao SH, Yanli J (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600\u20135611","journal-title":"IEEE Trans Image Process"},{"key":"842_CR27","doi-asserted-by":"crossref","unstructured":"Ryoo MS, Fuchs Thomas\u00a0J, Xia Lu, Aggarwal Jake\u00a0K, Matthies Larry (2015) Robot-centric activity prediction from first-person videos: What will they do to me? In: 2015 10th ACM\/IEEE international conference on human-robot interaction (HRI), p 295\u2013302. IEEE","DOI":"10.1145\/2696454.2696462"},{"issue":"1","key":"842_CR28","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1109\/TPAMI.2015.2430335","volume":"38","author":"S Koppula Hema","year":"2015","unstructured":"Koppula Hema S, Ashutosh S (2015) Anticipating human activities using object affordances for reactive robotic response. IEEE Trans Pattern Anal Mach Intell 38(1):14\u201329","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"842_CR29","doi-asserted-by":"crossref","unstructured":"Lee Yong\u00a0Jae, Ghosh Joydeep, Grauman Kristen (2012) Discovering important people and objects for egocentric video summarization. In: 2012 IEEE conference on computer vision and pattern recognition. p 1346\u20131353. IEEE","DOI":"10.1109\/CVPR.2012.6247820"},{"key":"842_CR30","doi-asserted-by":"crossref","unstructured":"Lu Zheng, Grauman Kristen (2013) Story-driven summarization for egocentric video. In: proceedings of the IEEE conference on computer vision and pattern recognition, p 2714\u20132721","DOI":"10.1109\/CVPR.2013.350"},{"key":"842_CR31","doi-asserted-by":"crossref","unstructured":"Fathi Alireza, Ren Xiaofeng, Rehg James\u00a0M (2011) Learning to recognize objects in egocentric activities. In: CVPR 2011, p 3281\u20133288. IEEE","DOI":"10.1109\/CVPR.2011.5995444"},{"key":"842_CR32","doi-asserted-by":"crossref","unstructured":"Li Yin, Ye Zhefan, Rehg James\u00a0M (2015)Delving into egocentric actions. In: proceedings of the IEEE conference on computer vision and pattern recognition, pages 287\u2013295","DOI":"10.1109\/CVPR.2015.7298625"},{"key":"842_CR33","doi-asserted-by":"crossref","unstructured":"Fathi Alireza, Li Yin, Rehg James\u00a0M (2012) Learning to recognize daily actions using gaze. In: European conference on computer vision, p 314\u2013327. Springer","DOI":"10.1007\/978-3-642-33718-5_23"},{"key":"842_CR34","unstructured":"Torre Fernando De\u00a0la, Hodgins Jessica, Bargteil Adam, Martin Xavier, Macey Justin, Collado Alex, Beltran Pep (2008) Guide to the carnegie mellon university multimodal activity (cmu-mmac) database"},{"key":"842_CR35","doi-asserted-by":"crossref","unstructured":"Ryoo Michael\u00a0S, Matthies Larry (2013) First-person activity recognition: What are they doing to me? In: proceedings of the IEEE conference on computer vision and pattern recognition, p 2730\u20132737","DOI":"10.1109\/CVPR.2013.352"},{"key":"842_CR36","doi-asserted-by":"crossref","unstructured":"Alletto Stefano, Serra Giuseppe, Calderara Simone, Solera Francesco, Cucchiara Rita (2014) From ego to nos-vision: detecting social relationships in first-person views. In: proceedings of the IEEE conference on computer vision and pattern recognition workshops, p 580\u2013585","DOI":"10.1109\/CVPRW.2014.91"},{"key":"842_CR37","doi-asserted-by":"crossref","unstructured":"Song Sibo, Chandrasekhar Vijay, Cheung Ngai-Man, Narayan Sanath, Li Liyuan, Lim Joo-Hwee (2014) Activity recognition in egocentric life-logging videos. In: Asian conference on computer vision, p 445\u2013458. Springer","DOI":"10.1007\/978-3-319-16634-6_33"},{"key":"842_CR38","doi-asserted-by":"crossref","unstructured":"Damen Dima, Doughty Hazel, Farinella Giovanni\u00a0Maria, Fidler Sanja, Furnari Antonino, Kazakos Evangelos (2018) Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In: European conference on computer vision (ECCV)","DOI":"10.1007\/978-3-030-01225-0_44"}],"container-title":["International Journal of Social Robotics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s12369-021-00842-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s12369-021-00842-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s12369-021-00842-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,4,6]],"date-time":"2023-04-06T12:59:02Z","timestamp":1680785942000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s12369-021-00842-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,11,30]]},"references-count":38,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,4]]}},"alternative-id":["842"],"URL":"https:\/\/doi.org\/10.1007\/s12369-021-00842-1","relation":{},"ISSN":["1875-4791","1875-4805"],"issn-type":[{"value":"1875-4791","type":"print"},{"value":"1875-4805","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,11,30]]},"assertion":[{"value":"15 October 2021","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 November 2021","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflicts of interest to report regarding the present study.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflicts of interest"}}]}}