{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,13]],"date-time":"2026-06-13T07:20:40Z","timestamp":1781335240325,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":89,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,10,27]],"date-time":"2024-10-27T00:00:00Z","timestamp":1729987200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,10,27]]},"DOI":"10.1145\/3663548.3675599","type":"proceedings-article","created":{"date-parts":[[2024,10,20]],"date-time":"2024-10-20T18:37:25Z","timestamp":1729449445000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["EditScribe: Non-Visual Image Editing with Natural Language Verification Loops"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7545-4136","authenticated-orcid":false,"given":"Ruei-Che","family":"Chang","sequence":"first","affiliation":[{"name":"Computer Science and Engineering, University of Michigan, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-5023-1426","authenticated-orcid":false,"given":"Yuxuan","family":"Liu","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, University of Michigan, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6315-9970","authenticated-orcid":false,"given":"Lotus","family":"Zhang","sequence":"additional","affiliation":[{"name":"Human Centered Design and Engineering, University of Washington, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4447-7818","authenticated-orcid":false,"given":"Anhong","family":"Guo","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, University of Michigan, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,10,27]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2015. Specific Guidelines: Art Photos & Cartoons. http:\/\/diagramcenter.org\/specific-guidelines-final-draft.html"},{"key":"e_1_3_2_1_2_1","unstructured":"2018. How to Write Alt Text and Image Descriptions for the visually impaired. https:\/\/www.perkins.org\/resource\/how-write-alt-text-and-image-descriptions-visually-impaired\/"},{"key":"e_1_3_2_1_3_1","unstructured":"2018. Web Content Accessibility Guidelines (WCAG) Overview. https:\/\/www.w3.org\/WAI\/standards-guidelines\/wcag\/"},{"key":"e_1_3_2_1_4_1","unstructured":"2022. Auto Color. https:\/\/helpx.adobe.com\/ca\/premiere-pro\/using\/auto-color.html"},{"key":"e_1_3_2_1_5_1","unstructured":"2022. Text to Color Grade. https:\/\/runwayml.com\/ai-tools\/text-to-color-grade\/"},{"key":"e_1_3_2_1_6_1","unstructured":"2024. Aira. https:\/\/aira.io\/"},{"key":"e_1_3_2_1_7_1","unstructured":"2024. BeMyEyes. https:\/\/www.bemyeyes.com\/"},{"key":"e_1_3_2_1_8_1","unstructured":"2024. ChatGPT. https:\/\/chat.openai.com\/"},{"key":"e_1_3_2_1_9_1","unstructured":"2024. GPT-4 Vision. https:\/\/platform.openai.com\/docs\/guides\/vision"},{"key":"e_1_3_2_1_10_1","unstructured":"2024. Gradio. https:\/\/www.gradio.app\/"},{"key":"e_1_3_2_1_11_1","unstructured":"2024. How to use Text Analyzer in JAWS to proofread documents. https:\/\/www.perkins.org\/resource\/how-to-use-text-analyzer-in-jaws-to-proofread-documents\/"},{"key":"e_1_3_2_1_12_1","unstructured":"2024. Introducing Be My AI (formerly Virtual Volunteer) for People who are Blind or Have Low Vision Powered by OpenAI\u2019s GPT-4. https:\/\/www.bemyeyes.com\/blog\/introducing-be-my-eyes-virtual-volunteer"},{"key":"e_1_3_2_1_13_1","unstructured":"2024. Midjourney. https:\/\/www.midjourney.com\/home"},{"key":"e_1_3_2_1_14_1","unstructured":"2024. OpenCV. https:\/\/opencv.org\/"},{"key":"e_1_3_2_1_15_1","unstructured":"2024. SeeingAI. https:\/\/www.seeingai.com\/"},{"key":"e_1_3_2_1_16_1","unstructured":"2024. Tap into the power of AI photo editing. https:\/\/www.adobe.com\/products\/photoshop\/ai.html"},{"key":"e_1_3_2_1_17_1","unstructured":"2024. Use VoiceOver for images and videos on iPhone. https:\/\/support.apple.com\/en-ca\/guide\/iphone\/iph37e6b3844\/ios"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2504335.2504360"},{"key":"e_1_3_2_1_19_1","volume-title":"Twelfth Symposium on Usable Privacy and Security (SOUPS 2016","author":"Ahmed Tousif","year":"2016","unstructured":"Tousif Ahmed, Patrick Shaffer, Kay Connelly, David Crandall, and Apu Kapadia. 2016. Addressing Physical Safety, Security, and Privacy for People with Visual Impairments. In Twelfth Symposium on Usable Privacy and Security (SOUPS 2016). USENIX Association, Denver, CO, 341\u2013354. https:\/\/www.usenix.org\/conference\/soups2016\/technical-sessions\/presentation\/ahmed"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3555570"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300233"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3173650"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1866029.1866080"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"crossref","unstructured":"Tim Brooks Aleksander Holynski and Alexei\u00a0A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. arxiv:2211.09800\u00a0[cs.CV]","DOI":"10.1109\/CVPR52729.2023.01764"},{"key":"e_1_3_2_1_25_1","volume-title":"Language models are few-shot learners. Advances in neural information processing systems 33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared\u00a0D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877\u20131901."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3501918"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3517790"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/IVCNZ51579.2020.9290737"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3517428.3550372"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3517428.3550372"},{"key":"e_1_3_2_1_31_1","unstructured":"Ian\u00a0J. Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville and Yoshua Bengio. 2014. Generative Adversarial Networks. arxiv:1406.2661\u00a0[stat.ML]"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2470654.2481292"},{"key":"e_1_3_2_1_33_1","unstructured":"R. Hartson and P.S. Pyla. 2012. The UX Book: Process and Guidelines for Ensuring a Quality User Experience. Elsevier Science. https:\/\/books.google.ca\/books?id=w4I3Y64SWLoC"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581249"},{"key":"e_1_3_2_1_35_1","unstructured":"Amir Hertz Ron Mokady Jay Tenenbaum Kfir Aberman Yael Pritch and Daniel Cohen-Or. 2022. Prompt-to-Prompt Image Editing with Cross Attention Control. arxiv:2208.01626\u00a0[cs.CV]"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3597638.3608422"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3586183.3606735"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581494"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3581494"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3476038"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3532106.3533522"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_3_2_1_43_1","volume-title":"Style Vectors for Steering Generative Large Language Model. arXiv preprint arXiv:2402.01618","author":"Konen Kai","year":"2024","unstructured":"Kai Konen, Sophie Jentzsch, Diaoul\u00e9 Diallo, Peer Sch\u00fctt, Oliver Bensch, Roxanne\u00a0El Baff, Dominik Opitz, and Tobias Hecking. 2024. Style Vectors for Steering Generative Large Language Model. arXiv preprint arXiv:2402.01618 (2024)."},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3517635"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3501966"},{"key":"e_1_3_2_1_46_1","volume-title":"Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767","author":"Li Feng","year":"2023","unstructured":"Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. 2023. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767 (2023)."},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300436"},{"key":"e_1_3_2_1_48_1","unstructured":"Yaron Lipman Ricky T.\u00a0Q. Chen Heli Ben-Hamu Maximilian Nickel and Matt Le. 2023. Flow Matching for Generative Modeling. arxiv:2210.02747\u00a0[cs.LG]"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3025453.3025814"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2858036.2858116"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1609\/hcomp.v7i1.5284"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373625.3417082"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588432.3591513"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3526113.3545637"},{"key":"e_1_3_2_1_55_1","volume-title":"Proceedings of Human Computer Interaction International (HCII) 71","author":"Petrie Helen","year":"2005","unstructured":"Helen Petrie, Chandra Harrison, and Sundeep Dev. 2005. Describing images on the web: a survey of current practice and prospects for the future. Proceedings of Human Computer Interaction International (HCII) 71, 2 (2005)."},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445040"},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3517428.3544812"},{"key":"e_1_3_2_1_58_1","unstructured":"Alec Radford Jong\u00a0Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020\u00a0[cs.CV]"},{"key":"e_1_3_2_1_59_1","unstructured":"Aditya Ramesh Prafulla Dhariwal Alex Nichol Casey Chu and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. arxiv:2204.06125\u00a0[cs.CV]"},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"crossref","unstructured":"Robin Rombach Andreas Blattmann Dominik Lorenz Patrick Esser and Bj\u00f6rn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. arxiv:2112.10752\u00a0[cs.CV]","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3532106.3533514"},{"key":"e_1_3_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/1015706.1015720"},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3441852.3476521"},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373625.3416993"},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445242"},{"key":"e_1_3_2_1_66_1","volume-title":"Emu Edit: Precise Image Editing via Recognition and Generation Tasks. arxiv:2311.10089\u00a0[cs.CV]","author":"Sheynin Shelly","year":"2023","unstructured":"Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. 2023. Emu Edit: Precise Image Editing via Recognition and Generation Tasks. arxiv:2311.10089\u00a0[cs.CV]"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3517678"},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3441852.3471233"},{"key":"e_1_3_2_1_69_1","volume-title":"Resolution-robust Large Mask Inpainting with Fourier Convolutions. arXiv preprint arXiv:2109.07161","author":"Suvorov Roman","year":"2021","unstructured":"Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2021. Resolution-robust Large Mask Inpainting with Fourier Convolutions. arXiv preprint arXiv:2109.07161 (2021)."},{"key":"e_1_3_2_1_70_1","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar Aurelien Rodriguez Armand Joulin Edouard Grave and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arxiv:2302.13971\u00a0[cs.CL]"},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1145\/3592451"},{"key":"e_1_3_2_1_72_1","volume-title":"Making Short-Form Videos Accessible with Hierarchical Video Summaries. arXiv preprint arXiv:2402.10382","author":"Van\u00a0Daele Tess","year":"2024","unstructured":"Tess Van\u00a0Daele, Akhil Iyer, Yuning Zhang, Jalyn\u00a0C Derry, Mina Huh, and Amy Pavel. 2024. Making Short-Form Videos Accessible with Hierarchical Video Summaries. arXiv preprint arXiv:2402.10382 (2024)."},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/2818048.2820013"},{"key":"e_1_3_2_1_74_1","unstructured":"World Wide Web\u00a0Consortium (W3C). 2022. W3C Image Concepts. https:\/\/www.w3.org\/WAI\/tutorials\/images\/"},{"key":"e_1_3_2_1_75_1","unstructured":"Chien-Yao Wang I-Hau Yeh and Hong-Yuan\u00a0Mark Liao. 2024. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arxiv:2402.13616\u00a0[cs.CV]"},{"key":"e_1_3_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/2998181.2998364"},{"key":"e_1_3_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3677846.3677861"},{"key":"e_1_3_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3617695.3617701"},{"key":"e_1_3_2_1_79_1","volume-title":"Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv preprint arXiv:2310.11441","author":"Yang Jianwei","year":"2023","unstructured":"Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv preprint arXiv:2310.11441 (2023)."},{"key":"e_1_3_2_1_80_1","unstructured":"Ahmet\u00a0Burak Yildirim Vedat Baday Erkut Erdem Aykut Erdem and Aysegul Dundar. 2023. Inst-Inpaint: Instructing to Remove Objects with Diffusion Models. arxiv:2304.03246\u00a0[cs.CV]"},{"key":"e_1_3_2_1_81_1","volume-title":"Inpaint Anything: Segment Anything Meets Image Inpainting. arxiv:2304.06790\u00a0[cs.CV]","author":"Yu Tao","year":"2023","unstructured":"Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. 2023. Inpaint Anything: Segment Anything Meets Image Inpainting. arxiv:2304.06790\u00a0[cs.CV]"},{"key":"e_1_3_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1145\/3490099.3511105"},{"key":"e_1_3_2_1_83_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 23465\u201323476","author":"Zeng Zequn","year":"2023","unstructured":"Zequn Zeng, Hao Zhang, Ruiying Lu, Dongsheng Wang, Bo Chen, and Zhengjue Wang. 2023. Conzic: Controllable zero-shot image captioning by sampling-based polishing. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 23465\u201323476."},{"key":"e_1_3_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1145\/3613904.3642713"},{"key":"e_1_3_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1145\/3597638.3608387"},{"key":"e_1_3_2_1_86_1","volume-title":"Nineteenth Symposium on Usable Privacy and Security (SOUPS","author":"Zhang Zhuohao\u00a0Jerry","year":"2023","unstructured":"Zhuohao\u00a0Jerry Zhang, Smirity Kaushik, JooYoung Seo, Haolin Yuan, Sauvik Das, Leah Findlater, Danna Gurari, Abigale Stangl, and Yang Wang. 2023. { ImageAlly} : A { Human-AI} Hybrid Approach to Support Blind People in Detecting and Redacting Private Image Content. In Nineteenth Symposium on Usable Privacy and Security (SOUPS 2023). 417\u2013436."},{"key":"e_1_3_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544548.3580655"},{"key":"e_1_3_2_1_88_1","doi-asserted-by":"publisher","DOI":"10.1145\/3134756"},{"key":"e_1_3_2_1_89_1","volume-title":"Segment everything everywhere all at once. Advances in Neural Information Processing Systems 36","author":"Zou Xueyan","year":"2024","unstructured":"Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong\u00a0Jae Lee. 2024. Segment everything everywhere all at once. Advances in Neural Information Processing Systems 36 (2024)."}],"event":{"name":"ASSETS '24: The 26th International ACM SIGACCESS Conference on Computers and Accessibility","location":"St. John's NL Canada","acronym":"ASSETS '24","sponsor":["SIGACCESS ACM Special Interest Group on Accessible Computing"]},"container-title":["The 26th International ACM SIGACCESS Conference on Computers and Accessibility"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3663548.3675599","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:57:16Z","timestamp":1750291036000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3663548.3675599"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,27]]},"references-count":89,"alternative-id":["10.1145\/3663548.3675599","10.1145\/3663548"],"URL":"https:\/\/doi.org\/10.1145\/3663548.3675599","relation":{},"subject":[],"published":{"date-parts":[[2024,10,27]]},"assertion":[{"value":"2024-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}