{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T01:52:57Z","timestamp":1768009977789,"version":"3.49.0"},"reference-count":79,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,8,22]],"date-time":"2024-08-22T00:00:00Z","timestamp":1724284800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2024,8,22]]},"abstract":"<jats:p>Accessible image exploration systems are able to help people with visual impairments (PVI) to understand image content by providing different types of interactions. With the development of computer vision technologies, image exploration systems are supporting more fine-grained image content processing, including image segmentation, description and object recognition. However, in developing countries like China, it is still rare for PVI to widely rely on such accessible system. To better understand the usage situation of accessible image exploration system in China and improve the image understanding of PVI in China, we developed AI-Vision, an Android based hierarchical accessible image exploration system supporting the generations of image general description, local object description and metadata information. Our 7-day diary study with 10 PVI verified the usability of AI-Vision and also revealed a series of design implications for improving accessible image exploration systems similar to AI-Vision.<\/jats:p>","DOI":"10.1145\/3678537","type":"journal-article","created":{"date-parts":[[2024,9,9]],"date-time":"2024-09-09T14:36:21Z","timestamp":1725892581000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["AI-Vision: A Three-Layer Accessible Image Exploration System for People with Visual Impairments in China"],"prefix":"10.1145","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0000-084X","authenticated-orcid":false,"given":"Kaixing","family":"Zhao","sequence":"first","affiliation":[{"name":"School of Software, Northwestern Polytechnical University, Xi'an, Shaanxi, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-9872-8625","authenticated-orcid":false,"given":"Rui","family":"Lai","sequence":"additional","affiliation":[{"name":"School of Software, Northwestern Polytechnical University, Xi'an, Shaanxi, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6097-2467","authenticated-orcid":false,"given":"Bin","family":"Guo","sequence":"additional","affiliation":[{"name":"School of Computer Science, Northwestern Polytechnical University, Xi'an, Shaanxi, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9758-6620","authenticated-orcid":false,"given":"Le","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Computer Science, Northwestern Polytechnical University, Xi'an, Shaanxi, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9973-356X","authenticated-orcid":false,"given":"Liang","family":"He","sequence":"additional","affiliation":[{"name":"School of Software, Northwestern Polytechnical University, Xi'an, Shaanxi, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3686-695X","authenticated-orcid":false,"given":"Yuhang","family":"Zhao","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Wisconsin-Madison, Madison, Wisconsin, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,9,9]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3442381.3449871"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2010.161"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2644615"},{"key":"e_1_2_2_5_1","volume-title":"Proc. Int. Conf. of Auditory Display","author":"Banf Michael","year":"2012","unstructured":"Michael Banf and Volker Blanz. 2012. A modular computer vision sonification model for the visually impaired. In Proc. Int. Conf. of Auditory Display. Georgia Tech, USA, 1--8."},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2459236.2459264"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1080\/17483107.2019.1673834"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1866029.1866080"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1168987.1169018"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.777378"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3432196"},{"key":"e_1_2_2_12_1","volume-title":"Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang.","author":"Bubeck S\u00e9bastien","year":"2023","unstructured":"S\u00e9bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712 [cs.CL]"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-60639-2_4"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.3390\/electronics10030297"},{"key":"e_1_2_2_15_1","unstructured":"W3 Consortium. 2018. Web Content Accessibility Guidelines (WCAG) 2.1. https:\/\/www.w3.org\/TR\/WCAG21\/."},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01101"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-54446-5_17"},{"key":"e_1_2_2_18_1","unstructured":"Be My Eyes. 2015. The story about Be My Eyes. https:\/\/www.bemyeyes.com\/about."},{"key":"e_1_2_2_19_1","unstructured":"Facebook. 2021. How Facebook is using AI to improve photo descriptions for people who are blind or visually impaired. https:\/\/tech.facebook.com\/artificial-intelligence\/2021\/1\/how-facebook-is-using-ai-to-improve-photo-descriptions-for-people-who-are-blind-or-visually-impaired\/."},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1068\/p250967"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.81"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2384916.2384935"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23774-4_5"},{"key":"e_1_2_2_24_1","volume-title":"Mobile assistive technologies for the visually impaired. Survey of ophthalmology 58, 6","author":"Hakobyan Lilit","year":"2013","unstructured":"Lilit Hakobyan, Jo Lumsden, Dympna O'Sullivan, and Hannah Bartlett. 2013. Mobile assistive technologies for the visually impaired. Survey of ophthalmology 58, 6 (2013), 513--528."},{"key":"e_1_2_2_25_1","volume-title":"Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, USA, 2961--2969","author":"He Kaiming","year":"2017","unstructured":"Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, USA, 2961--2969."},{"key":"e_1_2_2_26_1","volume-title":"Image captioning: Transforming objects into words. Advances in neural information processing systems 32","author":"Herdade Simao","year":"2019","unstructured":"Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. Advances in neural information processing systems 32 (2019), 1--11."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1177\/0145482X9308700805"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/FUZZ45933.2021.9494549"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2814940.2814952"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.494"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3025453.3025899"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_2_2_34_1","volume-title":"Stages of manual exploration in haptic object identification. Perception & psychophysics 52, 6","author":"Klatzky Roberta L","year":"1992","unstructured":"Roberta L Klatzky and Susan J Lederman. 1992. Stages of manual exploration in haptic object identification. Perception & psychophysics 52, 6 (1992), 661--670."},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/IEMBS.2010.5626038"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3501966"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3441852.3476548"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445321"},{"key":"e_1_2_2_39_1","unstructured":"Junnan Li Dongxu Li Silvio Savarese and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597 [cs.CV]"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1002\/asi.23292"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3586076"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.23919\/ELECO47770.2019.8990630"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-5010"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3289485"},{"key":"e_1_2_2_45_1","unstructured":"Microsoft. 2021. Seeing AI. https:\/\/www.microsoft.com\/en-us\/ai\/seeing-ai."},{"key":"e_1_2_2_46_1","unstructured":"Microsoft. 2023. What is Image Analysis? https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/computer-vision\/overview-image-analysis?tabs=4-0."},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3173633"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581302"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1177\/016264341102600204"},{"key":"e_1_2_2_50_1","first-page":"1","article-title":"Describing images on the web: a survey of current practice and prospects for the future","volume":"71","author":"Petrie Helen","year":"2005","unstructured":"Helen Petrie, Chandra Harrison, and Sundeep Dev. 2005. Describing images on the web: a survey of current practice and prospects for the future. Proceedings of Human Computer Interaction International (HCII) 71, 2 (2005), 1--10.","journal-title":"Proceedings of Human Computer Interaction International (HCII)"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00160"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.91"},{"key":"e_1_2_2_53_1","volume-title":"Models of object recognition. Nature neuroscience 3, 11","author":"Riesenhuber Maximilian","year":"2000","unstructured":"Maximilian Riesenhuber and Tomaso Poggio. 2000. Models of object recognition. Nature neuroscience 3, 11 (2000), 1199--1204."},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1609\/hcomp.v5i1.13301"},{"key":"e_1_2_2_55_1","volume-title":"Colour envisioned: Concepts of colour in the blind and sighted. Visual cognition 26, 5","author":"Saysani Armin","year":"2018","unstructured":"Armin Saysani, Michael C Corballis, and Paul M Corballis. 2018. Colour envisioned: Concepts of colour in the blind and sighted. Visual cognition 26, 5 (2018), 382--392."},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2022.3152990"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3397309"},{"key":"e_1_2_2_58_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs.CV]"},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1037\/1076-8998.6.3.196"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1177\/0145482X9208600508"},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_4"},{"key":"e_1_2_2_62_1","first-page":"17","article-title":"Inferring the scale and content of a map using deep learning. The International Archives of the Photogrammetry","volume":"43","author":"Touya Guillaume","year":"2020","unstructured":"Guillaume Touya, F Brisebard, F Quinton, and Azelle Courtial. 2020. Inferring the scale and content of a map using deep learning. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 43 (2020), 17--24.","journal-title":"Remote Sensing and Spatial Information Sciences"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2016.61"},{"key":"e_1_2_2_64_1","first-page":"n1","article-title":"Everyday information behaviour of the visually impaired in China","volume":"22","author":"Wang Sufang","year":"2017","unstructured":"Sufang Wang and Jieli Yu. 2017. Everyday information behaviour of the visually impaired in China. Information Research: An International Electronic Journal 22, 1 (2017), n1.","journal-title":"Information Research: An International Electronic Journal"},{"key":"e_1_2_2_65_1","volume-title":"Caption Anything: Interactive Image Description with Diverse Multimodal Controls. arXiv:2305.02677 [cs.CV]","author":"Wang Teng","year":"2023","unstructured":"Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, and Shanshan Zhao. 2023. Caption Anything: Interactive Image Description with Diverse Multimodal Controls. arXiv:2305.02677 [cs.CV]"},{"key":"e_1_2_2_66_1","volume-title":"Guided search 2.0 a revised model of visual search. Psychonomic bulletin & review 1","author":"Wolfe Jeremy M","year":"1994","unstructured":"Jeremy M Wolfe. 1994. Guided search 2.0 a revised model of visual search. Psychonomic bulletin & review 1 (1994), 202--238."},{"key":"e_1_2_2_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/2998181.2998364"},{"key":"e_1_2_2_68_1","first-page":"2287","article-title":"Salient region extraction based on global contrast enhancement and saliency cut for image information recognition of the visually impaired","volume":"12","author":"Yoon Hongchan","year":"2018","unstructured":"Hongchan Yoon, Baek-Hyun Kim, Mukhiddinov Mukhriddin, and Jinsoo Cho. 2018. Salient region extraction based on global contrast enhancement and saliency cut for image information recognition of the visually impaired. KSII Transactions on Internet and Information Systems (TIIS) 12, 5 (2018), 2287--2312.","journal-title":"KSII Transactions on Internet and Information Systems (TIIS)"},{"key":"e_1_2_2_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3481623"},{"key":"e_1_2_2_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/3597638.3608388"},{"key":"e_1_2_2_71_1","volume-title":"Recognize Anything: A Strong Image Tagging Model. arXiv:2306.03514 [cs.CV]","author":"Zhang Youcai","year":"2023","unstructured":"Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, Yandong Guo, and Lei Zhang. 2023. Recognize Anything: A Strong Image Tagging Model. arXiv:2306.03514 [cs.CV]"},{"key":"e_1_2_2_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445578"},{"key":"e_1_2_2_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/3567733"},{"key":"e_1_2_2_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3427335"},{"key":"e_1_2_2_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/3134756"},{"key":"e_1_2_2_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3173789"},{"key":"e_1_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/2702123.2702437"},{"key":"e_1_2_2_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/2702123.2702437"},{"key":"e_1_2_2_79_1","volume-title":"Simple Baseline for Visual Question Answering. arXiv preprint arXiv:1512.02167 1, 1","author":"Zhou B","year":"2015","unstructured":"B Zhou, Y Tian, S Sukhbaatar, A Szlam, and R Fergus. 2015. Simple Baseline for Visual Question Answering. arXiv preprint arXiv:1512.02167 1, 1 (2015), 1--7."}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3678537","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3678537","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T14:42:53Z","timestamp":1755787373000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3678537"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,22]]},"references-count":79,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,8,22]]}},"alternative-id":["10.1145\/3678537"],"URL":"https:\/\/doi.org\/10.1145\/3678537","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,8,22]]},"assertion":[{"value":"2024-09-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}