{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,24]],"date-time":"2025-08-24T01:33:27Z","timestamp":1755999207409,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":44,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,7,15]],"date-time":"2021-07-15T00:00:00Z","timestamp":1626307200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"W911NF17-2-0196W911NF17-2-0196"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,7,15]]},"DOI":"10.1145\/3458305.3459598","type":"proceedings-article","created":{"date-parts":[[2021,7,16]],"date-time":"2021-07-16T00:49:47Z","timestamp":1626396587000},"page":"146-158","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["EScALation"],"prefix":"10.1145","author":[{"given":"Bo","family":"Chen","sequence":"first","affiliation":[{"name":"University of Illinois at Urbana-Champaign"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Klara","family":"Nahrstedt","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,7,15]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Scaling video analytics on constrained edge nodes. arXiv preprint arXiv:1905.13536","author":"Canel Christopher","year":"2019","unstructured":"Christopher Canel , Thomas Kim , Giulio Zhou , Conglong Li , Hyeontaek Lim , David G Andersen , Michael Kaminsky , and Subramanya R Dulloor . 2019. Scaling video analytics on constrained edge nodes. arXiv preprint arXiv:1905.13536 ( 2019 ). Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim, David G Andersen, Michael Kaminsky, and Subramanya R Dulloor. 2019. Scaling video analytics on constrained edge nodes. arXiv preprint arXiv:1905.13536 (2019)."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISM.2020.00018"},{"key":"e_1_3_2_1_4_1","volume-title":"Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. 155--168","author":"Yu-Han Chen Tiffany","year":"2015","unstructured":"Tiffany Yu-Han Chen , Lenin Ravindranath , Shuo Deng , Paramvir Bahl , and Hari Balakrishnan . 2015 . Glimpse: Continuous, real-time object recognition on mobile devices . In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. 155--168 . Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl, and Hari Balakrishnan. 2015. Glimpse: Continuous, real-time object recognition on mobile devices. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. 155--168."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"e_1_3_2_1_6_1","unstructured":"M. Everingham L. Van Gool C. K. I. Williams J. Winn and A. Zisserman. [n. d.]. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http:\/\/www.pascal-network.org\/challenges\/VOC\/voc2012\/workshop\/index.html.  M. Everingham L. Van Gool C. K. I. Williams J. Winn and A. Zisserman. [n. d.]. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http:\/\/www.pascal-network.org\/challenges\/VOC\/voc2012\/workshop\/index.html."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.213"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.169"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.81"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298676"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00685"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.620"},{"key":"e_1_3_2_1_13_1","volume-title":"Focus: Querying large video datasets with low latency and low cost. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 269--286.","author":"Hsieh Kevin","year":"2018","unstructured":"Kevin Hsieh , Ganesh Ananthanarayanan , Peter Bodik , Shivaram Venkataraman , Paramvir Bahl , Matthai Philipose , Phillip B Gibbons , and Onur Mutlu . 2018 . Focus: Querying large video datasets with low latency and low cost. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 269--286. Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B Gibbons, and Onur Mutlu. 2018. Focus: Querying large video datasets with low latency and low cost. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 269--286."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.396"},{"key":"e_1_3_2_1_15_1","volume-title":"3D convolutional neural networks for human action recognition","author":"Ji Shuiwang","year":"2012","unstructured":"Shuiwang Ji , Wei Xu , Ming Yang , and Kai Yu. 2012. 3D convolutional neural networks for human action recognition . IEEE transactions on pattern analysis and machine intelligence 35, 1 ( 2012 ), 221--231. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2012), 221--231."},{"key":"e_1_3_2_1_16_1","volume-title":"Noscope: optimizing neural network queries over video at scale. arXiv preprint arXiv:1703.02529","author":"Kang Daniel","year":"2017","unstructured":"Daniel Kang , John Emmons , Firas Abuzaid , Peter Bailis , and Matei Zaharia . 2017. Noscope: optimizing neural network queries over video at scale. arXiv preprint arXiv:1703.02529 ( 2017 ). Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. Noscope: optimizing neural network queries over video at scale. arXiv preprint arXiv:1703.02529 (2017)."},{"key":"e_1_3_2_1_17_1","volume-title":"You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization. arXiv preprint arXiv:1911.06644","author":"K\u00f6p\u00fckl\u00fc Okan","year":"2019","unstructured":"Okan K\u00f6p\u00fckl\u00fc , Xiangyu Wei , and Gerhard Rigoll . 2019. You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization. arXiv preprint arXiv:1911.06644 ( 2019 ). Okan K\u00f6p\u00fckl\u00fc, Xiangyu Wei, and Gerhard Rigoll. 2019. You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization. arXiv preprint arXiv:1911.06644 (2019)."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46466-4_50"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3387514.3405874"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.10.011"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"e_1_3_2_1_24_1","unstructured":"NVIDIA. 2016. NVIDIA TITAN X Graphics Card for VR Gaming. https:\/\/www.nvidia.com\/en-us\/geforce\/products\/10series\/titan-x-pascal\/  NVIDIA. 2016. NVIDIA TITAN X Graphics Card for VR Gaming. https:\/\/www.nvidia.com\/en-us\/geforce\/products\/10series\/titan-x-pascal\/"},{"key":"e_1_3_2_1_25_1","unstructured":"NVIDIA. 2017. TITAN Xp Graphics Card with Pascal Architecture. https:\/\/www.nvidia.com\/en-us\/titan\/titan-xp\/  NVIDIA. 2017. TITAN Xp Graphics Card with Pascal Architecture. https:\/\/www.nvidia.com\/en-us\/titan\/titan-xp\/"},{"key":"e_1_3_2_1_26_1","unstructured":"NVIDIA. 2020. Jetson Nano Developer Kit. https:\/\/developer.nvidia.com\/embedded\/jetson-nano-developer-kit  NVIDIA. 2020. Jetson Nano Developer Kit. https:\/\/developer.nvidia.com\/embedded\/jetson-nano-developer-kit"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_45"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.91"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.690"},{"key":"e_1_3_2_1_30_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.  Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99."},{"key":"e_1_3_2_1_31_1","volume-title":"Philip HS Torr, and Fabio Cuzzolin","author":"Saha Suman","year":"2016","unstructured":"Suman Saha , Gurkirt Singh , Michael Sapienza , Philip HS Torr, and Fabio Cuzzolin . 2016 . Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529 (2016). Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. 2016. Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529 (2016)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0662-8"},{"key":"e_1_3_2_1_33_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems. 568--576.  Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems. 568--576."},{"key":"e_1_3_2_1_34_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.393"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.290"},{"volume-title":"Amir Roshan Zamir, and M Shah","year":"2012","key":"e_1_3_2_1_37_1","unstructured":"KhurramSoomro , Amir Roshan Zamir, and M Shah . 2012 . A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2 (2012). KhurramSoomro, Amir Roshan Zamir, and M Shah. 2012. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2 (2012)."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.122"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_1_40_1","volume-title":"Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038","author":"Tran Du","year":"2017","unstructured":"Du Tran , Jamie Ray , Zheng Shou , Shih-Fu Chang , and Manohar Paluri . 2017. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 ( 2017 ). Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. 2017. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.5244\/C.29.177"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.362"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01216-8_43"}],"event":{"name":"MMSys '21: 12th ACM Multimedia Systems Conference","sponsor":["SIGMM ACM Special Interest Group on Multimedia","SIGCOMM ACM Special Interest Group on Data Communication","SIGMOBILE ACM Special Interest Group on Mobility of Systems, Users, Data and Computing"],"location":"Istanbul Turkey","acronym":"MMSys '21"},"container-title":["Proceedings of the 12th ACM Multimedia Systems Conference"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3458305.3459598","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3458305.3459598","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:39Z","timestamp":1750195479000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3458305.3459598"}},"subtitle":["a framework for efficient and scalable spatio-temporal action localization"],"short-title":[],"issued":{"date-parts":[[2021,7,15]]},"references-count":44,"alternative-id":["10.1145\/3458305.3459598","10.1145\/3458305"],"URL":"https:\/\/doi.org\/10.1145\/3458305.3459598","relation":{},"subject":[],"published":{"date-parts":[[2021,7,15]]},"assertion":[{"value":"2021-07-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}