{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T20:29:24Z","timestamp":1776889764793,"version":"3.51.2"},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2020,11,30]],"date-time":"2020-11-30T00:00:00Z","timestamp":1606694400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"PRIN 2017 project \u201cPREVUE - PRediction of activities and Events by Vision in an Urban Environment.\u201d"},{"name":"NVIDIA Corporation with the donation of the Titan XP GPU"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2020,11,30]]},"abstract":"<jats:p>\n            In this article, we deal with the problem of predicting action progress in videos. We argue that this is an extremely important task, since it can be valuable for a wide range of interaction applications. To this end, we introduce a novel approach, named ProgressNet, capable of predicting\n            <jats:italic>when<\/jats:italic>\n            an action takes place in a video,\n            <jats:italic>where<\/jats:italic>\n            it is located within the frames, and\n            <jats:italic>how far<\/jats:italic>\n            it has progressed during its execution. To provide a general definition of action progress, we ground our work in the linguistics literature, borrowing terms and concepts to understand which actions can be the subject of progress estimation. As a result, we define a categorization of actions and their phases. Motivated by the recent success obtained from the interaction of Convolutional and Recurrent Neural Networks, our model is based on a combination of the Faster R-CNN framework, to make framewise predictions, and LSTM networks, to estimate action progress through time. After introducing two evaluation protocols for the task at hand, we demonstrate the capability of our model to effectively predict action progress on the UCF-101 and J-HMDB datasets.\n          <\/jats:p>","DOI":"10.1145\/3402447","type":"journal-article","created":{"date-parts":[[2020,12,17]],"date-time":"2020-12-17T17:49:26Z","timestamp":1608227366000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["Am I Done? Predicting Action Progress in Videos"],"prefix":"10.1145","volume":"16","author":[{"given":"Federico","family":"Becattini","sequence":"first","affiliation":[{"name":"University of Florence, Florence, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tiberio","family":"Uricchio","sequence":"additional","affiliation":[{"name":"University of Florence, Florence, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4816-0268","authenticated-orcid":false,"given":"Lorenzo","family":"Seidenari","sequence":"additional","affiliation":[{"name":"University of Florence, Florence, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lamberto","family":"Ballan","sequence":"additional","affiliation":[{"name":"University of Padova, Padova, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alberto Del","family":"Bimbo","sequence":"additional","affiliation":[{"name":"University of Florence, Florence, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,12,17]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"crossref","unstructured":"J. K. Aggarwal and M. S. Ryoo. 2011. Human activity analysis: A review. Comput. Surveys 43 3 (2011) 16:1--16:43.  J. K. Aggarwal and M. S. Ryoo. 2011. Human activity analysis: A review. Comput. Surveys 43 3 (2011) 16:1--16:43.","DOI":"10.1145\/1922649.1922653"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.39"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0010-9452(08)70388-5"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.675"},{"key":"e_1_2_1_5_1","volume-title":"Aspect: An Introduction to the Study of Verbal Aspect and Related Problems.","author":"Comrie Bernard","year":"1976","unstructured":"Bernard Comrie . 1976 . Aspect: An Introduction to the Study of Verbal Aspect and Related Problems. Vol. 2 . Cambridge university press . Bernard Comrie. 1976. Aspect: An Introduction to the Study of Verbal Aspect and Related Problems. Vol. 2. Cambridge university press."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_17"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00190"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2019.102886"},{"key":"e_1_2_1_9_1","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition. 961--970","author":"Fabian Caba Heilbron Bernard Ghanem","year":"2015","unstructured":"Bernard Ghanem Fabian Caba Heilbron , Victor Escorcia and Juan Carlos Niebles . 2015 . ActivityNet: A large-scale video benchmark for human activity understanding . In IEEE Conference on Computer Vision and Pattern Recognition. 961--970 . Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 961--970."},{"key":"e_1_2_1_10_1","volume-title":"International Journal of Computer Vision","author":"Ferm\u00fcller Cornelia","year":"2016","unstructured":"Cornelia Ferm\u00fcller , Fang Wang , Yezhou Yang , Konstantinos Zampogiannis , Yi Zhang , Francisco Barranco , and Michael Pfeiffer . 2016. Prediction of manipulation actions . International Journal of Computer Vision ( 2016 ), 1--17. Cornelia Ferm\u00fcller, Fang Wang, Yezhou Yang, Konstantinos Zampogiannis, Yi Zhang, Francisco Barranco, and Michael Pfeiffer. 2016. Prediction of manipulation actions. International Journal of Computer Vision (2016), 1--17."},{"key":"e_1_2_1_11_1","volume-title":"Johansson","author":"Randall Flanagan J.","year":"2003","unstructured":"J. Randall Flanagan and Roland S . Johansson . 2003 . Action plans used in action observation. Nature 424, 6950 (2003), 769. J. Randall Flanagan and Roland S. Johansson. 2003. Action plans used in action observation. Nature 424, 6950 (2003), 769."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2013.65"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.392"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298676"},{"key":"e_1_2_1_15_1","volume-title":"Artificial Intelligence and Statistics Conference. 249--256","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio . 2010 . Understanding the difficulty of training deep feedforward neural networks .. In Artificial Intelligence and Statistics Conference. 249--256 . Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks.. In Artificial Intelligence and Statistics Conference. 249--256."},{"key":"e_1_2_1_16_1","volume-title":"Human action forecasting by learning task grammarss. arXiv preprint arXiv:1412.6980","author":"Han Tengda","year":"2017","unstructured":"Tengda Han , Jue Wang , Anoop Cherian , and Stephen Gould . 2017. Human action forecasting by learning task grammarss. arXiv preprint arXiv:1412.6980 ( 2017 ). Tengda Han, Jue Wang, Anoop Cherian, and Stephen Gould. 2017. Human action forecasting by learning task grammarss. arXiv preprint arXiv:1412.6980 (2017)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10578-9_23"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5244\/C.30.142"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the British Machine Vision Conference (BMVC).","author":"Heidarivincheh Farnoosh","year":"2018","unstructured":"Farnoosh Heidarivincheh , Majid Mirmehdi , and Dima Damen . 2018 . Action completion: A temporal model for moment detection . In Proceedings of the British Machine Vision Conference (BMVC). Farnoosh Heidarivincheh, Majid Mirmehdi, and Dima Damen. 2018. Action completion: A temporal model for moment detection. In Proceedings of the British Machine Vision Conference (BMVC)."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00150"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.211"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0683-3"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.620"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2016.10.018"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision. 3192--3199","author":"Jhuang Hueihan","unstructured":"Hueihan Jhuang , Juergen Gall , Silvia Zuffi , Cordelia Schmid , and Michael J. Black . 2013. Towards understanding action recognition . In Proceedings of the IEEE International Conference on Computer Vision. 3192--3199 . Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. 2013. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 3192--3199."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.472"},{"key":"e_1_2_1_27_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik","year":"2014","unstructured":"Diederik Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.390"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.79"},{"key":"e_1_2_1_30_1","volume-title":"Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 1, 3, Article Article 73 (Sept.","author":"Li Xinyu","year":"2017","unstructured":"Xinyu Li , Yanyi Zhang , Jianyu Zhang , Moliang Zhou , Shuhong Chen , Yue Gu , Yueyang Chen , Ivan Marsic , Richard A. Farneth , and Randall S. Burd . 2017. Progress estimation and phase detection for sequential processes . Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 1, 3, Article Article 73 (Sept. 2017 ), 20 pages. DOI:https:\/\/doi.org\/10.1145\/3130936 10.1145\/3130936 Xinyu Li, Yanyi Zhang, Jianyu Zhang, Moliang Zhou, Shuhong Chen, Yue Gu, Yueyang Chen, Ivan Marsic, Richard A. Farneth, and Randall S. Burd. 2017. Progress estimation and phase detection for sequential processes. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 1, 3, Article Article 73 (Sept. 2017), 20 pages. DOI:https:\/\/doi.org\/10.1145\/3130936"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00587"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00139"},{"key":"e_1_2_1_33_1","volume-title":"Berg","author":"Liu Wei","year":"2016","unstructured":"Wei Liu , Dragomir Anguelov , Dumitru Erhan , Christian Szegedy , Scott Reed , Cheng-Yang Fu , and Alexander C . Berg . 2016 . Ssd : Single shot multibox detector. In European Conference on Computer Vision. Springer , 21--37. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. Ssd: Single shot multibox detector. In European Conference on Computer Vision. Springer, 21--37."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.478"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00043"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.214"},{"key":"e_1_2_1_37_1","first-page":"2579","article-title":"Visualizing data using t-SNE","author":"van der Maaten Laurens","year":"2008","unstructured":"Laurens van der Maaten and Geoffrey Hinton . 2008 . Visualizing data using t-SNE . Journal of Machine Learning Research 9 , Nov (2008), 2579 -- 2605 . Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, Nov (2008), 2579--2605.","journal-title":"Journal of Machine Learning Research 9"},{"key":"e_1_2_1_38_1","volume-title":"Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440","author":"Mathieu Michael","year":"2015","unstructured":"Michael Mathieu , Camille Couprie , and Yann LeCun . 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 ( 2015 ). Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015)."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.314"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2019.00354"},{"key":"e_1_2_1_41_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision. 5502--5511","author":"Nguyen Phuc Xuan","unstructured":"Phuc Xuan Nguyen , Deva Ramanan , and Charless C. Fowlkes . 2019. Weakly-supervised action localization with background modeling . In Proceedings of the IEEE International Conference on Computer Vision. 5502--5511 . Phuc Xuan Nguyen, Deva Ramanan, and Charless C. Fowlkes. 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision. 5502--5511."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1098\/rstb.2011.0123"},{"key":"e_1_2_1_43_1","unstructured":"A. Patra and J. A. Noble. 2018. Sequential anatomy localization in fetal echocardiography videos. arXiv preprint arXiv:1810.11868 (2018).  A. Patra and J. A. Noble. 2018. Sequential anatomy localization in fetal echocardiography videos. arXiv preprint arXiv:1810.11868 (2018)."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_45"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2009.11.014"},{"key":"e_1_2_1_46_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.  Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.5244\/C.30.58"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.155"},{"key":"e_1_2_1_49_1","volume-title":"IEEE International Conference on Computer Vision. 2137--2146","author":"Sigurdsson G. A.","unstructured":"G. A. Sigurdsson , O. Russakovsky , and A. Gupta . 2017. What actions are needed for understanding human actions in videos? In IEEE International Conference on Computer Vision. 2137--2146 . G. A. Sigurdsson, O. Russakovsky, and A. Gupta. 2017. What actions are needed for understanding human actions in videos? In IEEE International Conference on Computer Vision. 2137--2146."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.393"},{"key":"e_1_2_1_51_1","volume-title":"UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402","author":"Soomro Khurram","year":"2012","unstructured":"Khurram Soomro , Amir R. Zamir , and Mubarak Shah . 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 ( 2012 ). Khurram Soomro, Amir R. Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.530"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1098\/rstb.2011.0295"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMI.2018.2878055"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.2307\/2182371"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.18"},{"key":"e_1_2_1_57_1","unstructured":"Carl Vondrick Hamed Pirsiavash and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems. 613--621.  Carl Vondrick Hamed Pirsiavash and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems. 613--621."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.291"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.362"},{"key":"e_1_2_1_60_1","volume-title":"A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716","author":"Xiong Yuanjun","year":"2017","unstructured":"Yuanjun Xiong , Yue Zhao , Limin Wang , Dahua Lin , and Xiaoou Tang . 2017. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716 ( 2017 ). Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, and Xiaoou Tang. 2017. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716 (2017)."},{"key":"e_1_2_1_61_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision. 5532--5541","author":"Xu Mingze","unstructured":"Mingze Xu , Mingfei Gao , Yi-Ting Chen , Larry S. Davis , and David J. Crandall . 2019. Temporal recurrent networks for online action detection . In Proceedings of the IEEE International Conference on Computer Vision. 5532--5541 . Mingze Xu, Mingfei Gao, Yi-Ting Chen, Larry S. Davis, and David J. Crandall. 2019. Temporal recurrent networks for online action detection. In Proceedings of the IEEE International Conference on Computer Vision. 5532--5541."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.293"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298735"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.342"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.317"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.619"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3402447","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3402447","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:41:34Z","timestamp":1750200094000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3402447"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,11,30]]},"references-count":66,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,11,30]]}},"alternative-id":["10.1145\/3402447"],"URL":"https:\/\/doi.org\/10.1145\/3402447","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,11,30]]},"assertion":[{"value":"2020-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-12-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}