{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,12]],"date-time":"2026-05-12T03:12:26Z","timestamp":1778555546168,"version":"3.51.4"},"reference-count":60,"publisher":"Association for Computing Machinery (ACM)","issue":"CSCW2","license":[{"start":{"date-parts":[[2021,10,15]],"date-time":"2021-10-15T00:00:00Z","timestamp":1634256000000},"content-version":"vor","delay-in-days":366,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation","award":["IIS-1755593"],"award-info":[{"award-number":["IIS-1755593"]}]},{"name":"Microsoft"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Hum.-Comput. Interact."],"published-print":{"date-parts":[[2020,10,14]]},"abstract":"<jats:p>The task of answering questions about images has garnered attention as a practical service for assisting populations with visual impairments as well as a visual Turing test for the artificial intelligence community. Our first aim is to identify the common vision skills needed for both scenarios. To do so, we analyze the need for four vision skills--object recognition, text recognition, color recognition, and counting--on over 27,000 visual questions from two datasets representing both scenarios. We next quantify the difficulty of these skills for both humans and computers on both datasets. Finally, we propose a novel task of predicting what vision skills are needed to answer a question about an image. Our results reveal (mis)matches between aims of real users of such services and the focus of the AI community. We conclude with a discussion about future directions for addressing the visual question answering task.<\/jats:p>","DOI":"10.1145\/3415220","type":"journal-article","created":{"date-parts":[[2020,10,15]],"date-time":"2020-10-15T18:27:49Z","timestamp":1602786469000},"page":"1-31","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["Vision Skills Needed to Answer Visual Questions"],"prefix":"10.1145","volume":"4","author":[{"given":"Xiaoyu","family":"Zeng","sequence":"first","affiliation":[{"name":"The University of Texas at Austin, Austin, TX, USA"}]},{"given":"Yanan","family":"Wang","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, TX, USA"}]},{"given":"Tai-Yin","family":"Chiu","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, TX, USA"}]},{"given":"Nilavra","family":"Bhattacharya","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, TX, USA"}]},{"given":"Danna","family":"Gurari","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, TX, USA"}]}],"member":"320","published-online":{"date-parts":[[2020,10,15]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Aishwarya Aishwarya and Jiasen Liu. 2017. Python API and Evaluation Code for v2.0 and v1.0 releases of the VQA dataset. https:\/\/github.com\/GT-Vision-Lab\/VQA."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10593-2_27"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3274291"},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the ACM on Human-Computer Interaction","volume":"3","author":"Baldwin Mark S.","year":"2019","unstructured":"Mark S. Baldwin, Sen H. Hirano, Jennifer Mankoff, and Gillian R. Hayes. 2019. Design in the Public Square : Supporting Assistive Technology Design Through Public Mixed -Ability Cooperation. Proceedings of the ACM on Human-Computer Interaction, Vol. 3, CSCW (2019), 155."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00437"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1866029.1866080"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2010.5543821"},{"key":"e_1_2_1_10_1","unstructured":"Inc. Blindsight. 2019. blindsight. https:\/\/blindsight.app\/."},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2117--2126","author":"Brady Erin","unstructured":"Erin Brady, Meredith Ringel Morris, Yu Zhong, Samuel White, and Jeffrey P. Bigham. 2013. Visual Challenges in the Everyday Lives of Blind People. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2117--2126."},{"key":"e_1_2_1_12_1","unstructured":"Envision Technologies B.V. 2019. EnvisionAI. https:\/\/itunes.apple.com\/us\/app\/envision-ai\/id1268632314?mt=8."},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5716--5725","author":"Chao Wei-Lun","year":"2018","unstructured":"Wei-Lun Chao, Hexiang Hu, and Fei Sha. 2018. Cross-Dataset Adaptation for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5716--5725."},{"key":"e_1_2_1_14_1","volume-title":"Cascade: Crowdsourcing Taxonomy Creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM","author":"Chilton Lydia B.","unstructured":"Lydia B. Chilton, Greg Little, Darren Edge, Daniel S. Weld, and James A. Landay. 2013. Cascade: Crowdsourcing Taxonomy Creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1999--2008."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00370"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1002\/pra2.7"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359293"},{"key":"e_1_2_1_18_1","unstructured":"Be My Eyes. 2019. Be My Eyes. https:\/\/www.bemyeyes.com\/."},{"key":"e_1_2_1_19_1","volume-title":"Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach.","author":"Fukui Akira","year":"2016","unstructured":"Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.1422953112"},{"key":"e_1_2_1_21_1","volume-title":"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Goyal Yash","year":"2017","unstructured":"Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_2_1_22_1","volume-title":"Towards transparent ai systems: Interpreting visual question answering models. arXiv preprint arXiv:1608.08974","author":"Goyal Yash","year":"2016","unstructured":"Yash Goyal, Akrit Mohapatra, Devi Parikh, and Dhruv Batra. 2016. Towards transparent ai systems: Interpreting visual question answering models. arXiv preprint arXiv:1608.08974 (2016)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3174092"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 651--664","author":"Guo Anhong","unstructured":"Anhong Guo, Xiang `Anthony' Chen, Haoran Qi, Samuel White, Suman Ghosh, Chieko Asakawa, and Jeffrey P Bigham. 2016. VizLens: A robust and interactive screen reader for interfaces in the real world. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 651--664."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3025453.3025781"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-018-1065-7"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 939--948","author":"Gurari Danna","unstructured":"Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale Stangl, and Jeffrey P. Bigham. 2019. VizWiz -Priv : A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 939--948."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3608--3617","author":"Gurari Danna","unstructured":"Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. 2018b. VizWiz Grand Challenge : Answering Visual Questions from Blind People. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3608--3617."},{"key":"e_1_2_1_29_1","volume-title":"Captioning Images Taken by People Who Are Blind. arXiv preprint arXiv:2002.08565","author":"Gurari Danna","year":"2020","unstructured":"Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. 2020. Captioning Images Taken by People Who Are Blind. arXiv preprint arXiv:2002.08565 (2020)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359318"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_44"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359300"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3338243"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3025453.3025899"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.538"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.06.005"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_2_1_39_1","unstructured":"Jiasen Lu Jianwei Yang Dhruv Batra and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems. 289--297."},{"key":"e_1_2_1_40_1","unstructured":"Mateusz Malinowski and Mario Fritz. 2014. A Multi-World Approach to Question Answering about Real-World Scenes Based on Uncertain Input. In Advances in Neural Information Processing Systems. 1682--1690."},{"key":"e_1_2_1_41_1","unstructured":"Microsoft. 2017. Seeing AI. https:\/\/www.microsoft.com\/en-us\/seeing-ai."},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the ACM on Human-Computer Interaction","volume":"2","author":"Passi Samir","year":"2018","unstructured":"Samir Passi and Steven J. Jackson. 2018. Trust in Data Science : Collaboration, Translation, and Accountability in Corporate Data Science Projects. Proceedings of the ACM on Human-Computer Interaction, Vol. 2, CSCW (2018), 136."},{"key":"e_1_2_1_43_1","volume-title":"YOLOv3: An Incremental Improvement. arXiv","author":"Redmon Joseph","year":"2018","unstructured":"Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. arXiv (2018)."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1609\/hcomp.v5i1.13301"},{"key":"e_1_2_1_45_1","volume-title":"Brubaker","author":"Scheuerman Morgan Klaus","year":"2019","unstructured":"Morgan Klaus Scheuerman, Jacob M. Paul, and Jed R. Brubaker. 2019. How Computers See Gender : An Evaluation of Gender Classification in Commercial Facial Analysis Services. Proceedings of the ACM on Human-Computer Interaction, Vol. 3, CSCW (2019), 144."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359233"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3313831.3376404"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3234695.3236337"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2010.5543725"},{"key":"e_1_2_1_50_1","unstructured":"TapTapSee. 2019. TapTapSee. https:\/\/taptapseeapp.com\/."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359130"},{"key":"e_1_2_1_52_1","volume-title":"Rifat Sabbir Mansur, and Kurt Luther","author":"Venkatagiri Sukrit","year":"2019","unstructured":"Sukrit Venkatagiri, Jacob Thebault-Spieker, Rachel Kohler, John Purviance, Rifat Sabbir Mansur, and Kurt Luther. 2019. GroundTruth : Augmenting Expert Image Geolocation with Crowdsourcing and Shared Representations. Proceedings of the ACM on Human-Computer Interaction, Vol. 3, CSCW (2019), 107."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2008.172"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.05.001"},{"key":"e_1_2_1_55_1","unstructured":"xkcd Palette. 2019. xkcd Palette. https:\/\/xkcd.com\/color\/rgb\/."},{"key":"e_1_2_1_56_1","doi-asserted-by":"crossref","unstructured":"Chun-Ju Yang Kristen Grauman and Danna Gurari. 2018. Visual Question Answer Diversity.. In HCOMP. 184--192.","DOI":"10.1609\/hcomp.v6i1.13341"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.10"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/1459359.1459412"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359158"},{"key":"e_1_2_1_60_1","volume-title":"Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 2353--2362","author":"Zhong Yu","unstructured":"Yu Zhong, Walter S. Lasecki, Erin Brady, and Jeffrey P. Bigham. 2015. Regionspeak: Quick Comprehensive Spatial Descriptions of Complex Images for Blind Users. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 2353--2362."}],"container-title":["Proceedings of the ACM on Human-Computer Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3415220","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3415220","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3415220","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,12]],"date-time":"2026-05-12T02:33:39Z","timestamp":1778553219000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3415220"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,14]]},"references-count":60,"journal-issue":{"issue":"CSCW2","published-print":{"date-parts":[[2020,10,14]]}},"alternative-id":["10.1145\/3415220"],"URL":"https:\/\/doi.org\/10.1145\/3415220","relation":{},"ISSN":["2573-0142"],"issn-type":[{"value":"2573-0142","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,10,14]]},"assertion":[{"value":"2020-10-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}