{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T16:40:53Z","timestamp":1777567253464,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":58,"publisher":"ACM","license":[{"start":{"date-parts":[[2019,10,15]],"date-time":"2019-10-15T00:00:00Z","timestamp":1571097600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Natural Science Foundation of China","award":["U1509206"],"award-info":[{"award-number":["U1509206"]}]},{"name":"National Key Research and Development Program of China","award":["2018YFB1004300"],"award-info":[{"award-number":["2018YFB1004300"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2019,10,15]]},"DOI":"10.1145\/3343031.3351015","type":"proceedings-article","created":{"date-parts":[[2019,10,21]],"date-time":"2019-10-21T16:32:26Z","timestamp":1571675546000},"page":"411-419","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":32,"title":["Embodied One-Shot Video Recognition"],"prefix":"10.1145","author":[{"given":"Yuqian","family":"Fu","sequence":"first","affiliation":[{"name":"Fudan University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chengrong","family":"Wang","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yanwei","family":"Fu","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yu-Xiong","family":"Wang","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Cong","family":"Bai","sequence":"additional","affiliation":[{"name":"Zhejiang University of Technology (ZJUT), Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiangyang","family":"Xue","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yu-Gang","family":"Jiang","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,10,15]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Alexey Dosovitskiy, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir.","author":"Anderson Peter","year":"2018","unstructured":"Peter Anderson , Angel Chang , Devendra Singh Chaplot , Alexey Dosovitskiy, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. 2018 . On evaluation of embodied navigation agents. In ECCV . Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. 2018. On evaluation of embodied navigation agents. In ECCV ."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR .  Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR .","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"crossref","unstructured":"Angel Chang Angela Dai Thomas Funkhouser Maciej Halber Matthias Niessner Manolis Savva Shuran Song Andy Zeng and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV .  Angel Chang Angela Dai Thomas Funkhouser Maciej Halber Matthias Niessner Manolis Savva Shuran Song Andy Zeng and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV .","DOI":"10.1109\/3DV.2017.00081"},{"key":"e_1_3_2_1_4_1","unstructured":"Xiaojun Chang Yi Yang Alexander G. Hauptmann Eric P. Xing and Yao-Liang Yu. 2015. Semantic concept discovery for large-scale zero-shot event detection. In IJCAI .  Xiaojun Chang Yi Yang Alexander G. Hauptmann Eric P. Xing and Yao-Liang Yu. 2015. Semantic concept discovery for large-scale zero-shot event detection. In IJCAI ."},{"key":"e_1_3_2_1_5_1","volume-title":"Hauptmann","author":"Chang Xiaojun","year":"2016","unstructured":"Xiaojun Chang , Yi Yang , Guodong Long , Chengqi Zhang , and Alexander G . Hauptmann . 2016 . Dynamic concept composition for zero-example event detection. In AAAI . Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016. Dynamic concept composition for zero-example event detection. In AAAI ."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Xinlei Chen and C Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation. In CVPR .  Xinlei Chen and C Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation. In CVPR .","DOI":"10.1109\/CVPR.2015.7298856"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"crossref","unstructured":"Zitian Chen Yanwei Fu Yu-Xiong Wang Lin Ma Wei Liu and Martial Hebert. 2018. Image deformation meta-network for one-shot learning. In CVPR .  Zitian Chen Yanwei Fu Yu-Xiong Wang Lin Ma Wei Liu and Martial Hebert. 2018. Image deformation meta-network for one-shot learning. In CVPR .","DOI":"10.1109\/CVPR.2019.00888"},{"key":"e_1_3_2_1_8_1","volume-title":"Multi-level semantic feature augmentation for one-shot learning. TIP","author":"Chen Zitian","year":"2019","unstructured":"Zitian Chen , Yanwei Fu , Yinda Zhang , Yu-Gang Jiang , Xiangyang Xues , and Leonid Sigal . 2019. Multi-level semantic feature augmentation for one-shot learning. TIP ( 2019 ). Zitian Chen, Yanwei Fu, Yinda Zhang, Yu-Gang Jiang, Xiangyang Xues, and Leonid Sigal. 2019. Multi-level semantic feature augmentation for one-shot learning. TIP (2019)."},{"key":"e_1_3_2_1_9_1","volume-title":"Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell.","author":"Donahue Jeffrey","year":"2015","unstructured":"Jeffrey Donahue , Lisa Anne Hendricks , Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015 . Long-term recurrent convolutional networks for visual recognition and description. In CVPR . Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR ."},{"key":"e_1_3_2_1_10_1","volume-title":"Daniel Cremers, and Thomas Brox.","author":"Dosovitskiy Alexey","year":"2015","unstructured":"Alexey Dosovitskiy , Philipp Fischer , Eddy Ilg , Philip Hausser , Caner Hazirbas , Vladimir Golkov , Patrick Van Der Smagt , Daniel Cremers, and Thomas Brox. 2015 . Flownet : Learning optical flow with convolutional networks. In ICCV . Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In ICCV ."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"crossref","unstructured":"Li Fei-Fei Rob Fergus and Pietro Perona. 2003. A Bayesian approach to unsupervised one-shot learning of object categories. In ICCV .  Li Fei-Fei Rob Fergus and Pietro Perona. 2003. A Bayesian approach to unsupervised one-shot learning of object categories. In ICCV .","DOI":"10.1109\/ICCV.2003.1238476"},{"key":"e_1_3_2_1_12_1","volume-title":"One-shot learning of object categories. PAMI","author":"Fei-Fei Li","year":"2006","unstructured":"Li Fei-Fei , Rob Fergus , and Pietro Perona . 2006. One-shot learning of object categories. PAMI ( 2006 ). Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories. PAMI (2006)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Christoph Feichtenhofer Axel Pinz and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR .  Christoph Feichtenhofer Axel Pinz and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR .","DOI":"10.1109\/CVPR.2016.213"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Basura Fernando Efstratios Gavves Jose M Oramas Amir Ghodrati and Tinne Tuytelaars. 2015. Modeling video evolution for action recognition. In CVPR .  Basura Fernando Efstratios Gavves Jose M Oramas Amir Ghodrati and Tinne Tuytelaars. 2015. Modeling video evolution for action recognition. In CVPR .","DOI":"10.1109\/CVPR.2015.7299176"},{"key":"e_1_3_2_1_15_1","unstructured":"Chelsea Finn Pieter Abbeel and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML .  Chelsea Finn Pieter Abbeel and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML ."},{"key":"e_1_3_2_1_16_1","volume-title":"Learning multimodal latent attributes. PAMI","author":"Fu Yanwei","year":"2014","unstructured":"Yanwei Fu , Timothy M Hospedales , Tao Xiang , and Shaogang Gong . 2014. Learning multimodal latent attributes. PAMI ( 2014 ). Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. 2014. Learning multimodal latent attributes. PAMI (2014)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Ross Girshick Jeff Donahue Trevor Darrell and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR .  Ross Girshick Jeff Donahue Trevor Darrell and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR .","DOI":"10.1109\/CVPR.2014.81"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Amir Habibian Thomas Mensink and Cees Snoek. 2014. VideoStory: A new multimedia embedding for few-example recognition and translation of events. In ACM MM .  Amir Habibian Thomas Mensink and Cees Snoek. 2014. VideoStory: A new multimedia embedding for few-example recognition and translation of events. In ACM MM .","DOI":"10.1145\/2647868.2654913"},{"key":"e_1_3_2_1_19_1","volume-title":"Proc. TRECvid .","author":"Inoue Nakamasa","year":"2009","unstructured":"Nakamasa Inoue , Shanshan Hao , Tatsuhiko Saito , and Koichi Shinoda . 2009 . Titgt at TRECVID 2009 workshop . In Proc. TRECvid . Nakamasa Inoue, Shanshan Hao, Tatsuhiko Saito, and Koichi Shinoda. 2009. Titgt at TRECVID 2009 workshop. In Proc. TRECvid ."},{"key":"e_1_3_2_1_20_1","volume-title":"3D convolutional neural networks for human action recognition. PAMI","author":"Ji Shuiwang","year":"2013","unstructured":"Shuiwang Ji , Wei Xu , Ming Yang , and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. PAMI ( 2013 ). Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. PAMI (2013)."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2823900"},{"key":"e_1_3_2_1_22_1","volume-title":"Beyond Vicary's fantasies: The impact of subliminal priming and brand choice. JESP","author":"Karremans Johan C","year":"2006","unstructured":"Johan C Karremans , Wolfgang Stroebe , and Jasper Claus . 2006. Beyond Vicary's fantasies: The impact of subliminal priming and brand choice. JESP ( 2006 ). Johan C Karremans, Wolfgang Stroebe, and Jasper Claus. 2006. Beyond Vicary's fantasies: The impact of subliminal priming and brand choice. JESP (2006)."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Alexander Klaser Marcin Marsza\u0142ek and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3d-gradients. In BMVC .  Alexander Klaser Marcin Marsza\u0142ek and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3d-gradients. In BMVC .","DOI":"10.5244\/C.22.99"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24471-1_3"},{"key":"e_1_3_2_1_25_1","unstructured":"Gregory Koch Richard Zemel and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML .  Gregory Koch Richard Zemel and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML ."},{"key":"e_1_3_2_1_26_1","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NeurIPS .  Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NeurIPS ."},{"key":"e_1_3_2_1_27_1","volume-title":"Tenenbaum","author":"Lake Brenden M.","year":"2011","unstructured":"Brenden M. Lake , Ruslan Salakhutdinov , Jason Gross , and Joshua B . Tenenbaum . 2011 . One shot learning of simple visual concepts. In CogSci . Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum. 2011. One shot learning of simple visual concepts. In CogSci ."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"crossref","unstructured":"Ivan Laptev. 2005. On space-time interest points. In ICCV .  Ivan Laptev. 2005. On space-time interest points. In ICCV .","DOI":"10.1007\/s11263-005-1838-7"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"crossref","unstructured":"Ivan Laptev Marcin Marszalek Cordelia Schmid and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In CVPR .  Ivan Laptev Marcin Marszalek Cordelia Schmid and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In CVPR .","DOI":"10.1109\/CVPR.2008.4587756"},{"key":"e_1_3_2_1_30_1","volume-title":"SSD: Single shot multibox detector. In ECCV .","author":"Liu Wei","year":"2016","unstructured":"Wei Liu , Dragomir Anguelov , Dumitru Erhan , Christian Szegedy , Scott Reed , Cheng-Yang Fu , and Alexander C Berg . 2016 . SSD: Single shot multibox detector. In ECCV . Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In ECCV ."},{"key":"e_1_3_2_1_31_1","volume-title":"Visualizing data using t-SNE . JMLR","author":"van der Maaten Laurens","year":"2008","unstructured":"Laurens van der Maaten and Geoffrey Hinton . 2008. Visualizing data using t-SNE . JMLR ( 2008 ). Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE . JMLR (2008)."},{"key":"e_1_3_2_1_32_1","volume-title":"Viola","author":"Miller Erik G.","year":"2000","unstructured":"Erik G. Miller , Nicholas E. Matsakis , and Paul A . Viola . 2000 . Learning from one example through shared densities on transforms. In CVPR . Erik G. Miller, Nicholas E. Matsakis, and Paul A. Viola. 2000. Learning from one example through shared densities on transforms. In CVPR ."},{"key":"e_1_3_2_1_33_1","volume-title":"M Shiva Krishna Reddy, S Arulkumar, Piyush Rai, and Anurag Mittal.","author":"Mishra Ashish","year":"2018","unstructured":"Ashish Mishra , Vinay Kumar Verma , M Shiva Krishna Reddy, S Arulkumar, Piyush Rai, and Anurag Mittal. 2018 . A generative approach to zero-shot and few-shot action recognition. In WACV . Ashish Mishra, Vinay Kumar Verma, M Shiva Krishna Reddy, S Arulkumar, Piyush Rai, and Anurag Mittal. 2018. A generative approach to zero-shot and few-shot action recognition. In WACV ."},{"key":"e_1_3_2_1_34_1","volume-title":"Subliminal advertising: What you see is what you get. Journal of marketing","author":"Moore Timothy E","year":"1982","unstructured":"Timothy E Moore . 1982. Subliminal advertising: What you see is what you get. Journal of marketing ( 1982 ). Timothy E Moore. 1982. Subliminal advertising: What you see is what you get. Journal of marketing (1982)."},{"key":"e_1_3_2_1_35_1","volume-title":"TRECVID 2011 -- An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2011 .","author":"Over Paul","unstructured":"Paul Over , George Awad , Martial Michel , Jon Fiscus , Wessel Kraaij , and Alan F. Smeaton . 2011 . TRECVID 2011 -- An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2011 . Paul Over, George Awad, Martial Michel, Jon Fiscus, Wessel Kraaij, and Alan F. Smeaton. 2011. TRECVID 2011 -- An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2011 ."},{"key":"e_1_3_2_1_36_1","volume-title":"Unrealcv: Connecting computer vision to unreal engine. In ECCV .","author":"Qiu Weichao","year":"2016","unstructured":"Weichao Qiu and Alan Yuille . 2016 . Unrealcv: Connecting computer vision to unreal engine. In ECCV . Weichao Qiu and Alan Yuille. 2016. Unrealcv: Connecting computer vision to unreal engine. In ECCV ."},{"key":"e_1_3_2_1_37_1","unstructured":"Zhaofan Qiu Ting Yao and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV .  Zhaofan Qiu Ting Yao and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV ."},{"key":"e_1_3_2_1_38_1","unstructured":"Craig Quiter and Maik Ernst. 2018. Deepdrive\/deepdrive: 2.0.  Craig Quiter and Maik Ernst. 2018. Deepdrive\/deepdrive: 2.0."},{"key":"e_1_3_2_1_39_1","volume-title":"Introspection and subliminal perception. Phenomenology and the cognitive sciences","author":"Rams\u00f8y Thomas Zo\u00ebga","year":"2004","unstructured":"Thomas Zo\u00ebga Rams\u00f8y and Morten Overgaard . 2004. Introspection and subliminal perception. Phenomenology and the cognitive sciences ( 2004 ). Thomas Zo\u00ebga Rams\u00f8y and Morten Overgaard. 2004. Introspection and subliminal perception. Phenomenology and the cognitive sciences (2004)."},{"key":"e_1_3_2_1_40_1","unstructured":"Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In ICLR .  Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In ICLR ."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Joseph Redmon Santosh Divvala Ross Girshick and Ali Farhadi. 2016. You only look once: Unified real-time object detection. In CVPR .  Joseph Redmon Santosh Divvala Ross Girshick and Ali Farhadi. 2016. You only look once: Unified real-time object detection. In CVPR .","DOI":"10.1109\/CVPR.2016.91"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"crossref","unstructured":"Stephan R Richter Vibhav Vineet Stefan Roth and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In ECCV .  Stephan R Richter Vibhav Vineet Stefan Roth and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In ECCV .","DOI":"10.1007\/978-3-319-46475-6_7"},{"key":"e_1_3_2_1_43_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NeurIPS .  Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NeurIPS ."},{"key":"e_1_3_2_1_44_1","volume-title":"Zemeln","author":"Snell Jake","year":"2017","unstructured":"Jake Snell , Kevin Swersky , and Richard S . Zemeln . 2017 . Prototypical networks for few-shot learning. In NeurIPS . Jake Snell, Kevin Swersky, and Richard S. Zemeln. 2017. Prototypical networks for few-shot learning. In NeurIPS ."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"crossref","unstructured":"Shuran Song Fisher Yu Andy Zeng Angel X. Chang Manolis Savva and Thomas Funkhouser. 2017. Semantic scene completion from a single depth image. In CVPR .  Shuran Song Fisher Yu Andy Zeng Angel X. Chang Manolis Savva and Thomas Funkhouser. 2017. Semantic scene completion from a single depth image. In CVPR .","DOI":"10.1109\/CVPR.2017.28"},{"key":"e_1_3_2_1_46_1","volume-title":"Amir Roshan Zamir, and Mubarak Shah","author":"Soomro Khurram","year":"2012","unstructured":"Khurram Soomro , Amir Roshan Zamir, and Mubarak Shah . 2012 . UCF101: A dataset of 101 human action classes from videos in the wild. CRCV ( 2012). Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human action classes from videos in the wild. CRCV (2012)."},{"key":"e_1_3_2_1_47_1","volume-title":"The new data and new challenges in multimedia research. Commun. ACM","author":"Thomee Bart","year":"2016","unstructured":"Bart Thomee , David A Shamma , Gerald Friedland , Benjamin Elizalde , Karl Ni , Douglas Poland , Damian Borth , and Li-Jia Li. 2016. The new data and new challenges in multimedia research. Commun. ACM ( 2016 ). Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. The new data and new challenges in multimedia research. Commun. ACM (2016)."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresani and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV .  Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresani and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV .","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_1_49_1","unstructured":"Oriol Vinyals Charles Blundell Timothy Lillicrap Koray Kavukcuoglu and Daan Wierstra. 2016. Matching networks for one shot learning. In NeurIPS .  Oriol Vinyals Charles Blundell Timothy Lillicrap Koray Kavukcuoglu and Daan Wierstra. 2016. Matching networks for one shot learning. In NeurIPS ."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"crossref","unstructured":"Oriol Vinyals Alexander Toshev Samy Bengio and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR .  Oriol Vinyals Alexander Toshev Samy Bengio and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR .","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"crossref","unstructured":"Yu-Xiong Wang Ross Girshick Martial Hebert and Bharath Hariharan. 2018. Low-shot learning from imaginary data. In CVPR .  Yu-Xiong Wang Ross Girshick Martial Hebert and Bharath Hariharan. 2018. Low-shot learning from imaginary data. In CVPR .","DOI":"10.1109\/CVPR.2018.00760"},{"key":"e_1_3_2_1_52_1","unstructured":"Yu-Xiong Wang and Martial Hebert. 2016a. Learning from small sample sets by combining unsupervised meta-training with CNNs. In NeurIPS .  Yu-Xiong Wang and Martial Hebert. 2016a. Learning from small sample sets by combining unsupervised meta-training with CNNs. In NeurIPS ."},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"crossref","unstructured":"Yu-Xiong Wang and Martial Hebert. 2016b. Learning to learn: Model regression networks for easy small sample learning. In ECCV .  Yu-Xiong Wang and Martial Hebert. 2016b. Learning to learn: Model regression networks for easy small sample learning. In ECCV .","DOI":"10.1007\/978-3-319-46466-4_37"},{"key":"e_1_3_2_1_54_1","unstructured":"Yu-Xiong Wang Deva Ramanan and Martial Hebert. 2017. Learning to model the tail. In NeurIPS .  Yu-Xiong Wang Deva Ramanan and Martial Hebert. 2017. Learning to model the tail. In NeurIPS ."},{"key":"e_1_3_2_1_55_1","unstructured":"Kelvin Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron Courville Ruslan Salakhudinov Rich Zemel and Yoshua Bengio. 2015. Show attend and tell: Neural image caption generation with visual attention. In ICML .  Kelvin Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron Courville Ruslan Salakhudinov Rich Zemel and Yoshua Bengio. 2015. Show attend and tell: Neural image caption generation with visual attention. In ICML ."},{"key":"e_1_3_2_1_56_1","volume-title":"Proc TRECvid .","author":"Chen Ming","year":"2009","unstructured":"Ming yu Chen , Huan Li , and Alexander Hauptmann . 2009 . Informedia @ TRECVID 2009: Analyzing video motions . In Proc TRECvid . Ming yu Chen, Huan Li, and Alexander Hauptmann. 2009. Informedia @ TRECVID 2009: Analyzing video motions. In Proc TRECvid ."},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"crossref","unstructured":"Joe Yue-Hei Ng Matthew Hausknecht Sudheendra Vijayanarasimhan Oriol Vinyals Rajat Monga and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR .  Joe Yue-Hei Ng Matthew Hausknecht Sudheendra Vijayanarasimhan Oriol Vinyals Rajat Monga and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR .","DOI":"10.1109\/CVPR.2015.7299101"},{"key":"e_1_3_2_1_58_1","unstructured":"Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In ECCV .  Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In ECCV ."}],"event":{"name":"MM '19: The 27th ACM International Conference on Multimedia","location":"Nice France","acronym":"MM '19","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 27th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3343031.3351015","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3343031.3351015","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:13:11Z","timestamp":1750201991000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3343031.3351015"}},"subtitle":["Learning from Actions of a Virtual Embodied Agent"],"short-title":[],"issued":{"date-parts":[[2019,10,15]]},"references-count":58,"alternative-id":["10.1145\/3343031.3351015","10.1145\/3343031"],"URL":"https:\/\/doi.org\/10.1145\/3343031.3351015","relation":{},"subject":[],"published":{"date-parts":[[2019,10,15]]},"assertion":[{"value":"2019-10-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}