{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,9]],"date-time":"2026-05-09T05:33:25Z","timestamp":1778304805825,"version":"3.51.4"},"reference-count":68,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,7,4]],"date-time":"2022-07-04T00:00:00Z","timestamp":1656892800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Interact. Mob. Wearable Ubiquitous Technol."],"published-print":{"date-parts":[[2022,7,4]]},"abstract":"<jats:p>Visual Question Answering (VQA) is a relatively new task where a user can ask a natural question about an image and obtain an answer. VQA is useful for many applications and is widely popular for users with visual impairments. Our goal is to design a VQA application that works efficiently on mobile devices without requiring cloud support. Such a system will allow users to ask visual questions privately, without having to send their questions to the cloud, while also reduce cloud communication costs. However, existing VQA applications use deep learning models that significantly improve accuracy, but is computationally heavy. Unfortunately, existing techniques that optimize deep learning for mobile devices cannot be applied for VQA because the VQA task is multi-modal---it requires both processing vision and text data. Existing mobile optimizations that work for vision-only or text-only neural networks cannot be applied here because of the dependencies between the two modes. Instead, we design MobiVQA, a set of optimizations that leverage the multi-modal nature of VQA. We show using extensive evaluation on two VQA testbeds and two mobile platforms, that MobiVQA significantly improves latency and energy with minimal accuracy loss compared to state-of-the-art VQA models. For instance, MobiVQA can answer a visual question in 163 milliseconds on the phone, compared to over 20-second latency incurred by the most accurate state-of-the-art model, while incurring less than 1 point reduction in accuracy.<\/jats:p>","DOI":"10.1145\/3534619","type":"journal-article","created":{"date-parts":[[2022,7,7]],"date-time":"2022-07-07T18:50:18Z","timestamp":1657219818000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":15,"title":["MobiVQA"],"prefix":"10.1145","volume":"6","author":[{"given":"Qingqing","family":"Cao","sequence":"first","affiliation":[{"name":"Stony Brook University, Stony Brook, NY, USA"}]},{"given":"Prerna","family":"Khanna","sequence":"additional","affiliation":[{"name":"Stony Brook University, Stony Brook, NY, USA"}]},{"given":"Nicholas D.","family":"Lane","sequence":"additional","affiliation":[{"name":"University of Cambridge &amp; Samsung AI, Cambridge, United Kingdom"}]},{"given":"Aruna","family":"Balasubramanian","sequence":"additional","affiliation":[{"name":"Stony Brook University, Stony Brook, NY, USA"}]}],"member":"320","published-online":{"date-parts":[[2022,7,7]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n.d.]. Bing delivers its largest improvement in search experience using Azure GPUs. https:\/\/azure.microsoft.com\/en-us\/blog\/bing-delivers-its-largest-improvement-in-search-experience-using-azure-gpus\/"},{"key":"e_1_2_1_2_1","unstructured":"[n.d.]. DumpSys. ([n.d.]). https:\/\/developer.android.com\/studio\/command-line\/dumpsys.html"},{"key":"e_1_2_1_3_1","unstructured":"[n.d.]. Name that tune: Brain takes just 100 to 300 milliseconds to recognize familiar music. https:\/\/www.sciencedaily.com\/releases\/2019\/10\/191030073312.htm"},{"key":"e_1_2_1_4_1","unstructured":"2021. Pixel 3 XL. https:\/\/en.wikipedia.org\/w\/index.php?title=Pixel_3&oldid=1062029816 Page Version ID: 1062029816."},{"key":"e_1_2_1_5_1","first-page":"71","article-title":"World blindness and visual impairment: despite many successes, the problem is growing","volume":"30","author":"Ackland Peter","year":"2017","unstructured":"Peter Ackland, Serge Resnikoff, and Rupert Bourne. 2017. World blindness and visual impairment: despite many successes, the problem is growing. Community Eye Health 30, 100 (2017), 71--73. https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC5820628\/","journal-title":"Community Eye Health"},{"key":"e_1_2_1_6_1","volume-title":"VQA: Visual Question Answering. arXiv:1505.00468 [cs] (Oct.","author":"Agrawal Aishwarya","year":"2016","unstructured":"Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. 2016. VQA: Visual Question Answering. arXiv:1505.00468 [cs] (Oct. 2016). http:\/\/arxiv.org\/abs\/1505.00468"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/VLSI-SoC.2018.8644937"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2470654.2481291"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00408"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3089801.3089804"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3469116.3470011"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3307334.3326071"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Tai-Yin Chiu Yinan Zhao and Danna Gurari. 2020. Assessing Image Quality Issues for Real-World Problems. 3646--3656. https:\/\/openaccess.thecvf.com\/content_CVPR_2020\/html\/Chiu_Assessing_Image_Quality_Issues_for_Real-World_Problems_CVPR_2020_paper.html","DOI":"10.1109\/CVPR42600.2020.00370"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.707"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1117\/12.2518469"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_2_1_19_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https:\/\/openreview.net\/forum?id=YicbFdNTTy"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","unstructured":"William Falcon and The PyTorch Lightning team. 2019. PyTorch Lightning. https:\/\/doi.org\/10.5281\/zenodo.3828935","DOI":"10.5281\/zenodo.3828935"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Zhiyuan Fang Jianfeng Wang Xiaowei Hu Lijuan Wang Yezhou Yang and Zicheng Liu. 2021. Compressing Visual-Linguistic Model via Knowledge Distillation. 1428--1438. https:\/\/openaccess.thecvf.com\/content\/ICCV2021\/html\/Fang_Compressing_Visual-Linguistic_Model_via_Knowledge_Distillation_ICCV_2021_paper.html","DOI":"10.1109\/ICCV48922.2021.00146"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.775"},{"key":"e_1_2_1_23_1","first-page":"2640","volume-title":"Proceedings of the 37th International Conference on Machine Learning. PMLR, 3690--3699","author":"Goyal Saurabh","year":"2020","unstructured":"Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma. 2020. PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 3690--3699. https:\/\/proceedings.mlr.press\/v119\/goyal20a.html ISSN: 2640-3498."},{"key":"e_1_2_1_24_1","doi-asserted-by":"crossref","unstructured":"Yash Goyal Tejas Khot Douglas Summers-Stay Dhruv Batra and Devi Parikh. 2017. Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. 6904--6913. https:\/\/openaccess.thecvf.com\/content_cvpr_2017\/html\/Goyal_Making_the_v_CVPR_2017_paper.html","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_2_1_25_1","volume-title":"Bigham","author":"Gurari Danna","year":"2019","unstructured":"Danna Gurari, Qing Li, Chi Lin, Yinan Zhao, Anhong Guo, Abigale Stangl, and Jeffrey P. Bigham. 2019. VizWiz-Priv: A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People. 939--948. https:\/\/openaccess.thecvf.com\/content_CVPR_2019\/html\/Gurari_VizWiz-Priv_A_Dataset_for_Recognizing_the_Presence_and_Purpose_of_CVPR_2019_paper.html"},{"key":"e_1_2_1_26_1","volume-title":"Bigham","author":"Gurari Danna","year":"2018","unstructured":"Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. 2018. VizWiz Grand Challenge: Answering Visual Questions From Blind People. 3608--3617. https:\/\/openaccess.thecvf.com\/content_cvpr_2018\/html\/Gurari_VizWiz_Grand_Challenge_CVPR_2018_paper.html"},{"key":"e_1_2_1_27_1","unstructured":"S. Han H. Mao and W. J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning trained quantization and huffman coding. ArXiv e-prints (Oct. 2015). arXiv: 1510.00149 [cs.CV] tex.adsnote: Provided by the SAO\/NASA Astrophysics Data System tex.adsurl: http:\/\/adsabs.harvard.edu\/abs\/2015arXiv151000149H."},{"key":"e_1_2_1_28_1","unstructured":"Gao Huang Danlu Chen Tianhong Li Felix Wu Laurens van der Maaten and Kilian Weinberger. 2018. Multi-Scale Dense Networks for Resource Efficient Image Classification. https:\/\/openreview.net\/forum?id=Hk2aImxAb"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3081333.3081360"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2010.57"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01028"},{"key":"e_1_2_1_32_1","volume-title":"Shallow-Deep Networks: Understanding and Mitigating Network Over-thinking. arXiv:1810.07052 [cs, stat] (May","author":"Kaya Yigitcan","year":"2019","unstructured":"Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. 2019. Shallow-Deep Networks: Understanding and Mitigating Network Over-thinking. arXiv:1810.07052 [cs, stat] (May 2019). http:\/\/arxiv.org\/abs\/1810.07052 arXiv: 1810.07052."},{"key":"e_1_2_1_33_1","volume-title":"Learned Token Pruning for Transformers. arXiv:2107.00910 [cs] (July","author":"Kim Sehoon","year":"2021","unstructured":"Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Joseph Hassoun, and Kurt Keutzer. 2021. Learned Token Pruning for Transformers. arXiv:2107.00910 [cs] (July 2021). http:\/\/arxiv.org\/abs\/2107.00910 arXiv: 2107.00910."},{"key":"e_1_2_1_34_1","volume-title":"International Conference on Machine Learning. PMLR, 5583--5594","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594."},{"key":"e_1_2_1_35_1","first-page":"2640","volume-title":"Proceedings of the 38th International Conference on Machine Learning. PMLR, 5583--5594","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 5583--5594. https:\/\/proceedings.mlr.press\/v139\/kim21k.html ISSN: 2640-3498."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPSN.2016.7460664"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2750858.2804262"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3372224.3419194"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3300061.3345455"},{"key":"e_1_2_1_41_1","volume-title":"VisualBERT: A Simple and Performant Baseline for Vision and Language. (Aug","author":"Li Liunian Harold","year":"2019","unstructured":"Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. (Aug. 2019). https:\/\/arxiv.org\/abs\/1908.03557v1"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458864.3467884"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3210240.3210337"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.537"},{"key":"e_1_2_1_46_1","volume-title":"Split Computing and Early Exiting for Deep Learning Applications: Survey and Research Challenges. arXiv:2103.04505 [cs, eess] (March","author":"Matsubara Yoshitomo","year":"2021","unstructured":"Yoshitomo Matsubara, Marco Levorato, and Francesco Restuccia. 2021. Split Computing and Early Exiting for Deep Learning Applications: Survey and Research Challenges. arXiv:2103.04505 [cs, eess] (March 2021). http:\/\/arxiv.org\/abs\/2103.04505"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2380116.2380174"},{"key":"e_1_2_1_48_1","unstructured":"PyTorch. 2018. PyTorch. https:\/\/pytorch.org\/."},{"key":"e_1_2_1_49_1","volume-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:1506.01497 [cs] (Jan","author":"Ren Shaoqing","year":"2016","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:1506.01497 [cs] (Jan. 2016). http:\/\/arxiv.org\/abs\/1506.01497"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.593"},{"key":"e_1_2_1_51_1","volume-title":"MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. arXiv:2004.02984 [cs] (April","author":"Sun Zhiqing","year":"2020","unstructured":"Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. arXiv:2004.02984 [cs] (April 2020). http:\/\/arxiv.org\/abs\/2004.02984"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","unstructured":"Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics Hong Kong China 5100--5111. https:\/\/doi.org\/10.18653\/v1\/D19-1514","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2016.7900006"},{"key":"e_1_2_1_54_1","volume-title":"https:\/\/devblogs.nvidia.com\/jetson-tx2-delivers-twice-intelligence-edge\/","author":"Nvidia","year":"2018","unstructured":"Nvidia TX2. 2018. Nvidia TX2. (2018). https:\/\/devblogs.nvidia.com\/jetson-tx2-delivers-twice-intelligence-edge\/"},{"key":"e_1_2_1_55_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008."},{"key":"e_1_2_1_56_1","volume-title":"SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. arXiv:2012.09852 [cs] (Jan","author":"Wang Hanrui","year":"2021","unstructured":"Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. arXiv:2012.09852 [cs] (Jan. 2021). http:\/\/arxiv.org\/abs\/2012.09852 arXiv: 2012.09852."},{"key":"e_1_2_1_57_1","volume-title":"MiniVLM: A Smaller and Faster Vision-Language Model. arXiv:2012.06946 [cs] (Dec","author":"Wang Jianfeng","year":"2020","unstructured":"Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiujun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. 2020. MiniVLM: A Smaller and Faster Vision-Language Model. arXiv:2012.06946 [cs] (Dec. 2020). http:\/\/arxiv.org\/abs\/2012.06946 arXiv: 2012.06946."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.521"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.521"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.sustainlp-1.11"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.204"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308561.3353806"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3241539.3241563"},{"key":"e_1_2_1_64_1","volume-title":"Visual Question Answer Diversity. In Sixth AAAI Conference on Human Computation and Crowdsourcing. https:\/\/www.aaai.org\/ocs\/index.php\/HCOMP\/HCOMP18\/paper\/view\/17936","author":"Yang Chun-Ju","year":"2018","unstructured":"Chun-Ju Yang, Kristen Grauman, and Danna Gurari. 2018. Visual Question Answer Diversity. In Sixth AAAI Conference on Human Computation and Crowdsourcing. https:\/\/www.aaai.org\/ocs\/index.php\/HCOMP\/HCOMP18\/paper\/view\/17936"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/cvpr.2017.643"},{"key":"e_1_2_1_66_1","doi-asserted-by":"crossref","unstructured":"Pengchuan Zhang Xiujun Li Xiaowei Hu Jianwei Yang Lei Zhang Lijuan Wang Yejin Choi and Jianfeng Gao. 2021. VinVL: Revisiting Visual Representations in Vision-Language Models. 5579--5588. https:\/\/openaccess.thecvf.com\/content\/CVPR2021\/html\/Zhang_VinVL_Revisiting_Visual_Representations_in_Vision-Language_Models_CVPR_2021_paper.html","DOI":"10.1109\/CVPR46437.2021.00553"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447993.3448628"},{"key":"e_1_2_1_68_1","volume-title":"Lin (Eds.)","volume":"33","author":"Zhou Wangchunshu","year":"2020","unstructured":"Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. 2020. BERT loses patience: Fast and robust inference with early exit. In Advances in neural information processing systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 18330--18341. https:\/\/proceedings.neurips.cc\/paper\/2020\/file\/d4dd111a4fd973394238aca5c05bebe3-Paper.pdf"}],"container-title":["Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3534619","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3534619","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,14]],"date-time":"2025-07-14T04:31:09Z","timestamp":1752467469000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3534619"}},"subtitle":["Efficient On-Device Visual Question Answering"],"short-title":[],"issued":{"date-parts":[[2022,7,4]]},"references-count":68,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,7,4]]}},"alternative-id":["10.1145\/3534619"],"URL":"https:\/\/doi.org\/10.1145\/3534619","relation":{},"ISSN":["2474-9567"],"issn-type":[{"value":"2474-9567","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,7,4]]},"assertion":[{"value":"2022-07-07","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}