{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,23]],"date-time":"2026-06-23T01:22:11Z","timestamp":1782177731730,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":59,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,5,6]],"date-time":"2021-05-06T00:00:00Z","timestamp":1620259200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF CAREER Award","award":["1942531"],"award-info":[{"award-number":["1942531"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,5,6]]},"DOI":"10.1145\/3411764.3445347","type":"proceedings-article","created":{"date-parts":[[2021,5,8]],"date-time":"2021-05-08T05:53:19Z","timestamp":1620453199000},"page":"1-12","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":84,"title":["Toward Automatic Audio Description Generation for Accessible Videos"],"prefix":"10.1145","author":[{"given":"Yujia","family":"Wang","sequence":"first","affiliation":[{"name":"Computer Science Beijing Institute of Technology, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wei","family":"Liang","sequence":"additional","affiliation":[{"name":"School of Computer Science Beijing Institute of Technology, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Haikun","family":"Huang","sequence":"additional","affiliation":[{"name":"Computer Science Department George Mason University, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yongqi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Computer Science Department University of Massachusetts Boston, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dingzeyu","family":"Li","sequence":"additional","affiliation":[{"name":"Adobe Research, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lap-Fai","family":"Yu","sequence":"additional","affiliation":[{"name":"Computer Science George Mason University, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,5,7]]},"reference":[{"key":"e_1_3_2_2_1_1","unstructured":"2019. Katna: Tool for automating common vide keyframe extraction and Image Autocrop tasks. https:\/\/katna.readthedocs.io\/.  2019. Katna: Tool for automating common vide keyframe extraction and Image Autocrop tasks. https:\/\/katna.readthedocs.io\/."},{"key":"e_1_3_2_2_2_1","unstructured":"2020. Guidelines for Audio Describers. http:\/\/www.acb.org\/adp\/guidelines.html.  2020. Guidelines for Audio Describers. http:\/\/www.acb.org\/adp\/guidelines.html."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3355390"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"crossref","unstructured":"Relja Arandjelovic and Andrew Zisserman. 2017. Look listen and learn. In ICCV. 609\u2013617.  Relja Arandjelovic and Andrew Zisserman. 2017. Look listen and learn. In ICCV. 609\u2013617.","DOI":"10.1109\/ICCV.2017.73"},{"key":"e_1_3_2_2_5_1","unstructured":"Sanjeev Arora Yingyu Liang and Tengyu Ma. 2016. A simple but tough-to-beat baseline for sentence embeddings. (2016).  Sanjeev Arora Yingyu Liang and Tengyu Ma. 2016. A simple but tough-to-beat baseline for sentence embeddings. (2016)."},{"key":"e_1_3_2_2_6_1","volume-title":"Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems. 892\u2013900.","author":"Aytar Yusuf","year":"2016","unstructured":"Yusuf Aytar , Carl Vondrick , and Antonio Torralba . 2016 . Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems. 892\u2013900. Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems. 892\u2013900."},{"key":"e_1_3_2_2_7_1","first-page":"1137","article-title":"A neural probabilistic language model","author":"Bengio Yoshua","year":"2003","unstructured":"Yoshua Bengio , R\u00e9jean Ducharme , Pascal Vincent , and Christian Jauvin . 2003 . A neural probabilistic language model . Journal of machine learning research 3 , Feb (2003), 1137 \u2013 1155 . Yoshua Bengio, R\u00e9jean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research 3, Feb (2003), 1137\u20131155.","journal-title":"Journal of machine learning research 3"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1177\/0145482X9108500307"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"crossref","unstructured":"Sabine Braun. 2011. Creating coherence in audio description. Meta: Journal des traducteurs\/Meta: Translators\u2019 Journal 56 3(2011) 645\u2013662.  Sabine Braun. 2011. Creating coherence in audio description. Meta: Journal des traducteurs\/Meta: Translators\u2019 Journal 56 3(2011) 645\u2013662.","DOI":"10.7202\/1008338ar"},{"key":"e_1_3_2_2_10_1","volume-title":"Consciousness inside and out: Phenomenology, neuroscience, and the nature of experience","author":"Brown Richard","unstructured":"Richard Brown . 2013. Consciousness inside and out: Phenomenology, neuroscience, and the nature of experience . Springer Science & Business Media . Richard Brown. 2013. Consciousness inside and out: Phenomenology, neuroscience, and the nature of experience. Springer Science & Business Media."},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_2_12_1","unstructured":"Xuguang Duan Wenbing Huang Chuang Gan Jingdong Wang Wenwu Zhu and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems. 3059\u20133069.  Xuguang Duan Wenbing Huang Chuang Gan Jingdong Wang Wenwu Zhu and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems. 3059\u20133069."},{"key":"e_1_3_2_2_13_1","first-page":"3","article-title":"Netflix closed captions offer an accessible model for the streaming video industry, but what about audio description?Communication","volume":"47","author":"Ellis Katie","year":"2015","unstructured":"Katie Ellis 2015 . Netflix closed captions offer an accessible model for the streaming video industry, but what about audio description?Communication , Politics & Culture 47 , 3 (2015), 3 . Katie Ellis 2015. Netflix closed captions offer an accessible model for the streaming video industry, but what about audio description?Communication, Politics & Culture 47, 3 (2015), 3.","journal-title":"Politics & Culture"},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1177\/0145482X0610000507"},{"key":"e_1_3_2_2_15_1","volume-title":"Seeing with sound: A journey into sight. Retrieved September 21(2002)","author":"Fletcher Pat","year":"2015","unstructured":"Pat Fletcher . 2002. Seeing with sound: A journey into sight. Retrieved September 21(2002) , 2015 . Pat Fletcher. 2002. Seeing with sound: A journey into sight. Retrieved September 21(2002), 2015."},{"key":"e_1_3_2_2_16_1","volume-title":"An introduction to audio description: A practical guide","author":"Fryer Louise","unstructured":"Louise Fryer . 2016. An introduction to audio description: A practical guide . Routledge . Louise Fryer. 2016. An introduction to audio description: A practical guide. Routledge."},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1080\/0907676X.2012.693108"},{"key":"e_1_3_2_2_18_1","volume-title":"What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology 5, 1","author":"Gaver W","year":"1993","unstructured":"William\u00a0 W Gaver . 1993. What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology 5, 1 ( 1993 ), 1\u201329. William\u00a0W Gaver. 1993. What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology 5, 1 (1993), 1\u201329."},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"crossref","unstructured":"Nicholas\u00a0A Giudice and Gordon\u00a0E Legge. 2008. Blind navigation and the role of technology. The engineering handbook of smart technology for aging disability and independence 8(2008) 479\u2013500.  Nicholas\u00a0A Giudice and Gordon\u00a0E Legge. 2008. Blind navigation and the role of technology. The engineering handbook of smart technology for aging disability and independence 8(2008) 479\u2013500.","DOI":"10.1002\/9780470379424.ch25"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"crossref","unstructured":"Cole Gleason Amy Pavel Himalini Gururaj Kris\u00a0M Kitani and Jefrey\u00a0P Bigham. 2020. Making GIFs Accessible. (2020).  Cole Gleason Amy Pavel Himalini Gururaj Kris\u00a0M Kitani and Jefrey\u00a0P Bigham. 2020. Making GIFs Accessible. (2020).","DOI":"10.1145\/3373625.3417027"},{"key":"e_1_3_2_2_21_1","volume-title":"Making Memes Accessible. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility. 367\u2013376","author":"Gleason Cole","year":"2019","unstructured":"Cole Gleason , Amy Pavel , Xingyu Liu , Patrick Carrington , Lydia\u00a0 B Chilton , and Jeffrey\u00a0 P Bigham . 2019 . Making Memes Accessible. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility. 367\u2013376 . Cole Gleason, Amy Pavel, Xingyu Liu, Patrick Carrington, Lydia\u00a0B Chilton, and Jeffrey\u00a0P Bigham. 2019. Making Memes Accessible. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility. 367\u2013376."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3313831.3376728"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3174092"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00947"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300851"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_3_2_2_27_1","volume-title":"The Accessible Netflix Project Advocates Taking Steps to Ensure Netflix Accessibility for Everyone. The Accessible Netflix Project 26","author":"Kingett R","year":"2014","unstructured":"R Kingett . 2014. The Accessible Netflix Project Advocates Taking Steps to Ensure Netflix Accessibility for Everyone. The Accessible Netflix Project 26 ( 2014 ). R Kingett. 2014. The Accessible Netflix Project Advocates Taking Steps to Ensure Netflix Accessibility for Everyone. The Accessible Netflix Project 26 (2014)."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00782"},{"key":"e_1_3_2_2_30_1","volume-title":"An Efficient Framework for Learning Sentence Representations. In International Conference on Learning Representations. 1\u201316","author":"Logeswaran Lajanugen","year":"2018","unstructured":"Lajanugen Logeswaran and Honglak Lee . 2018 . An Efficient Framework for Learning Sentence Representations. In International Conference on Learning Representations. 1\u201316 . Lajanugen Logeswaran and Honglak Lee. 2018. An Efficient Framework for Learning Sentence Representations. In International Conference on Learning Representations. 1\u201316."},{"key":"e_1_3_2_2_31_1","volume-title":"Audio description\u2013seeing theater with your ears. Information Technology and Disabilities 2, 2","author":"Miers John","year":"1995","unstructured":"John Miers . 1995. Audio description\u2013seeing theater with your ears. Information Technology and Disabilities 2, 2 ( 1995 ). John Miers. 1995. Audio description\u2013seeing theater with your ears. Information Technology and Disabilities 2, 2 (1995)."},{"key":"e_1_3_2_2_32_1","unstructured":"Tomas Mikolov Ilya Sutskever Kai Chen Greg\u00a0S Corrado and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111\u20133119.  Tomas Mikolov Ilya Sutskever Kai Chen Greg\u00a0S Corrado and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111\u20133119."},{"key":"e_1_3_2_2_33_1","unstructured":"Chris Mikul. 2010. Audio description background paper. Ultimo NSW: Media Access Australia(2010).  Chris Mikul. 2010. Audio description background paper. Ultimo NSW: Media Access Australia(2010)."},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00772"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1177\/0145482X1510900204"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1049"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.497"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.111"},{"key":"e_1_3_2_2_39_1","volume-title":"Video captions for online courses: Do YouTube\u2019s auto-generated captions meet deaf students","author":"Parton Becky","year":"2016","unstructured":"Becky Parton . 2016. Video captions for online courses: Do YouTube\u2019s auto-generated captions meet deaf students \u2019 needs?Journal of Open, Flexible , and Distance Learning 20, 1 ( 2016 ), 8\u201318. Becky Parton. 2016. Video captions for online courses: Do YouTube\u2019s auto-generated captions meet deaf students\u2019 needs?Journal of Open, Flexible, and Distance Learning 20, 1 (2016), 8\u201318."},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1075\/target.28.3.04per"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2019.2899494"},{"key":"e_1_3_2_2_42_1","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.  Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training."},{"key":"e_1_3_2_2_43_1","volume-title":"Artesis Hogeschool","author":"Remael Aline","year":"2011","unstructured":"Aline Remael and Gert Vercauteren . 2011. Basisprincipes voor audiobeschrijving voor televisie en film [Basics of audio description for television and film]. Antwerp: Departement Vertalers and Tolken , Artesis Hogeschool ( 2011 ). Aline Remael and Gert Vercauteren. 2011. Basisprincipes voor audiobeschrijving voor televisie en film [Basics of audio description for television and film]. Antwerp: Departement Vertalers and Tolken, Artesis Hogeschool (2011)."},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0987-1"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1177\/0145482X1310700405"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"crossref","unstructured":"Elliot Salisbury Ece Kamar and Meredith\u00a0Ringel Morris. 2017. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind.. In HCOMP. 147\u2013156.  Elliot Salisbury Ece Kamar and Meredith\u00a0Ringel Morris. 2017. Toward Scalable Social Alt Text: Conversational Crowdsourcing as a Tool for Refining Vision-to-Language Technology for the Blind.. In HCOMP. 147\u2013156.","DOI":"10.1609\/hcomp.v5i1.13301"},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/1857920.1857924"},{"key":"e_1_3_2_2_48_1","volume-title":"Adding audio description: Does it make a difference?Journal of Visual Impairment & Blindness 95, 4","author":"Schmeidler Emilie","year":"2001","unstructured":"Emilie Schmeidler and Corinne Kirchner . 2001. Adding audio description: Does it make a difference?Journal of Visual Impairment & Blindness 95, 4 ( 2001 ), 197\u2013212. Emilie Schmeidler and Corinne Kirchner. 2001. Adding audio description: Does it make a difference?Journal of Visual Impairment & Blindness 95, 4 (2001), 197\u2013212."},{"key":"e_1_3_2_2_50_1","volume-title":"The visual made verbal: A comprehensive training manual and guide to the history and applications of audio description","author":"Snyder Joel","unstructured":"Joel Snyder . 2014. The visual made verbal: A comprehensive training manual and guide to the history and applications of audio description . American Council of the Blind, Incorporated. Joel Snyder. 2014. The visual made verbal: A comprehensive training manual and guide to the history and applications of audio description. American Council of the Blind, Incorporated."},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3313831.3376404"},{"key":"e_1_3_2_2_52_1","unstructured":"Sandeep Subramanian Adam Trischler Yoshua Bengio and Christopher\u00a0J Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. (2018). arXiv:1804.00079  Sandeep Subramanian Adam Trischler Yoshua Bengio and Christopher\u00a0J Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. (2018). arXiv:1804.00079"},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.515"},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1177\/0264619616661603"},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00751"},{"key":"e_1_3_2_2_57_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3355089.3356487","article-title":"Comic-guided speech synthesis","volume":"38","author":"Wang Yujia","year":"2019","unstructured":"Yujia Wang , Wenguan Wang , Wei Liang , and Lap-Fai Yu . 2019 . Comic-guided speech synthesis . ACM Transactions on Graphics (TOG) 38 , 6 (2019), 1 \u2013 14 . Yujia Wang, Wenguan Wang, Wei Liang, and Lap-Fai Yu. 2019. Comic-guided speech synthesis. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1\u201314.","journal-title":"ACM Transactions on Graphics (TOG)"},{"key":"e_1_3_2_2_58_1","doi-asserted-by":"crossref","unstructured":"Yujia Wang Liang Wei Li Wanwan Li Dingzeyu and Lap-Fai Yu. 2020. Scene-Aware Background Music Synthesis. In ACM Multimedia Vol.\u00a038.  Yujia Wang Liang Wei Li Wanwan Li Dingzeyu and Lap-Fai Yu. 2020. Scene-Aware Background Music Synthesis. In ACM Multimedia Vol.\u00a038.","DOI":"10.1145\/3394171.3413894"},{"key":"e_1_3_2_2_59_1","volume-title":"Transformers: State-of-the-art natural language processing.","author":"Wolf Thomas","year":"2019","unstructured":"Thomas Wolf , Lysandre Debut , Victor Sanh , Julien Chaumond , Clement Delangue , Anthony Moi , Pierric Cistac , Tim Rault , R\u00e9mi Louf , Morgan Funtowicz , 2019 . Transformers: State-of-the-art natural language processing. (2019). arXiv:1910.03771 Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R\u00e9mi Louf, Morgan Funtowicz, 2019. Transformers: State-of-the-art natural language processing. (2019). arXiv:1910.03771"},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00911"}],"event":{"name":"CHI '21: CHI Conference on Human Factors in Computing Systems","location":"Yokohama Japan","acronym":"CHI '21","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction"]},"container-title":["Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3411764.3445347","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3411764.3445347","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:28:34Z","timestamp":1750195714000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3411764.3445347"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,6]]},"references-count":59,"alternative-id":["10.1145\/3411764.3445347","10.1145\/3411764"],"URL":"https:\/\/doi.org\/10.1145\/3411764.3445347","relation":{},"subject":[],"published":{"date-parts":[[2021,5,6]]},"assertion":[{"value":"2021-05-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}