{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T18:47:53Z","timestamp":1761677273477,"version":"3.41.0"},"reference-count":58,"publisher":"Association for Computing Machinery (ACM)","issue":"7","license":[{"start":{"date-parts":[[2024,4,25]],"date-time":"2024-04-25T00:00:00Z","timestamp":1714003200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"SUT"},{"name":"National Collaborative Research Infrastructure Strategy"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,7,31]]},"abstract":"<jats:p>Video-see-through (VST) augmented reality (AR) is widely used to present novel augmentative visual experiences by processing video frames for viewers. Among VST AR systems, assistive vision displays aim to compensate for low vision or blindness, presenting enhanced visual information to support activities of daily living for the vision impaired\/deprived. Despite progress, current assistive displays suffer from a visual information bottleneck, limiting their functional outcomes compared to healthy vision. This motivates the exploration of methods to selectively enhance and augment salient visual information. Traditionally, vision processing pipelines for assistive displays rely on hand-crafted, single-modality filters, lacking adaptability to time-varying and environment-dependent needs. This article proposes the use of Deep Reinforcement Learning (DRL) and Self-attention (SA) networks as a means to learn vision processing pipelines for assistive displays. SA networks selectively attend to task-relevant features, offering a more parameter\u2014and compute-efficient approach to RL-based task learning. We assess the feasibility of using SA networks in a simulation-trained model to generate relevant representations of real-world states for navigation with prosthetic vision displays. We explore two prosthetic vision applications, vision-to-auditory encoding, and retinal prostheses, using simulated phosphene visualisations. This article introduces SA-px, a general-purpose vision processing pipeline using self-attention networks, and SA-phos, a display-specific formulation targeting low-resolution assistive displays. We present novel scene visualisations derived from SA image patches importance rankings to support mobility with prosthetic vision devices. To the best of our knowledge, this is the first application of self-attention networks to the task of learning vision processing pipelines for prosthetic vision or assistive displays.<\/jats:p>\n          <jats:p\/>","DOI":"10.1145\/3650111","type":"journal-article","created":{"date-parts":[[2024,3,2]],"date-time":"2024-03-02T11:22:26Z","timestamp":1709378546000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Learning Scene Representations for Human-assistive Displays Using Self-attention Networks"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0220-3253","authenticated-orcid":false,"given":"Jaime","family":"Ruiz-Serra","sequence":"first","affiliation":[{"name":"Swinburne University of Technology, Hawthorn, Australia and The University of Sydney, Darlington, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1273-1786","authenticated-orcid":false,"given":"Jack","family":"White","sequence":"additional","affiliation":[{"name":"Swinburne University of Technology, Hawthorn, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8773-516X","authenticated-orcid":false,"given":"Stephen","family":"Petrie","sequence":"additional","affiliation":[{"name":"Swinburne University of Technology, Hawthorn, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0081-8569","authenticated-orcid":false,"given":"Tatiana","family":"Kameneva","sequence":"additional","affiliation":[{"name":"Swinburne University of Technology, Hawthorn, Australia and The University of Melbourne, Parkville, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3848-1631","authenticated-orcid":false,"given":"Chris","family":"McCarthy","sequence":"additional","affiliation":[{"name":"Swinburne University of Technology, Hawthorn, Australia"}]}],"member":"320","published-online":{"date-parts":[[2024,4,25]]},"reference":[{"key":"e_1_3_3_2_2","first-page":"19","volume-title":"Proceedings of the IASTED International Conference on Signal Processing & Communication","author":"Agaian Sos S.","year":"2000","unstructured":"Sos S. Agaian, Karen Panetta, and Artyom M. Grigoryan. 2000. A new measure of image enhancement. In Proceedings of the IASTED International Conference on Signal Processing & Communication. Citeseer, 19\u201322."},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.clinph.2019.11.029"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2013.6738315"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1088\/1741-2560\/13\/3\/036013"},{"issue":"7","key":"e_1_3_3_6_2","first-page":"755","article-title":"Enhancing object contrast using augmented depth improves mobility in patients implanted with a retinal prosthesis","volume":"56","author":"Barnes Nick M.","year":"2015","unstructured":"Nick M. Barnes, Adele F. Scott, Ashley Stacey, Chris McCarthy, David Feng, Matthew A. Petoe, Lauren N. Ayton, Rebecca Dengate, Robyn H. Guymer, and Janine Walker. 2015. Enhancing object contrast using augmented depth improves mobility in patients implanted with a retinal prosthesis. Investigative Ophthalmology & Visual Science 56, 7 (2015), 755\u2013755.","journal-title":"Investigative Ophthalmology & Visual Science"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.3389\/fnsys.2016.00041"},{"key":"e_1_3_3_8_2","unstructured":"Charles Beattie Joel Z. Leibo Denis Teplyashin Tom Ward Marcus Wainwright Heinrich K\u00fcttler Andrew Lefrancq Simon Green V\u00edctor Vald\u00e9s Amir Sadik Julian Schrittwieser Keith Anderson Sarah York Max Cant Adam Cain Adrian Bolton Stephen Gaffney Helen King Demis Hassabis Shane Legg and Stig Petersen. 2016. DeepMind lab. (2016) 1\u201311."},{"key":"e_1_3_3_9_2","first-page":"427","volume-title":"Lecture Notes in Computer Science","author":"Bermudez-Cameo Jesus","year":"2017","unstructured":"Jesus Bermudez-Cameo, Alberto Badias-Herbera, Manuel Guerrero-Viu, Gonzalo Lopez-Nicolas, and Jose J. Guerrero. 2017. RGB-D computer vision techniques for simulated prosthetic vision. Lecture Notes in Computer Science, Vol. 10255 LNCS. Springer International Publishing, Cham, 427\u2013436."},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICMLA51294.2020.00193"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1177\/2515841418817501"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1068\/p6952"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3465055"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISMAR-Adjunct.2017.89"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1111\/opo.12646"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.1142\/S0219635205000914"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1167\/jov.22.2.20"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ajo.2016.12.021"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.257"},{"key":"e_1_3_3_20_2","unstructured":"Alec Gorjestani Lee Alexander Bryan Newstrom Pi-Ming Cheng Mike Sergi Craig Shankwitz and Max Donath. 2003. Driver assistive systems for snowplows. (March2003)."},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1163\/22134808-000S0084"},{"key":"e_1_3_3_22_2","unstructured":"Nikolaus Hansen. 2016. The CMA evolution strategy: A tutorial. (2016)."},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2016.02.015"},{"key":"e_1_3_3_24_2","first-page":"3379","volume-title":"Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society","volume":"2015","author":"Horne Lachlan","year":"2015","unstructured":"Lachlan Horne, Jose M. Alvarez, Chris McCarthy, and Nick Barnes. 2015. Semantic labelling to aid navigation in prosthetic vision. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vol. 2015-Novem. Institute of Electrical and Electronics Engineers Inc., 3379\u20133382."},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.visres.2017.06.002"},{"issue":"8","key":"e_1_3_3_26_2","first-page":"3168","article-title":"A 44 channel suprachoroidal retinal prosthesis: Laboratory based visual function and functional vision outcomes.","volume":"62","author":"Kolic Maria","year":"2021","unstructured":"Maria Kolic, Elizabeth K. Baglin, Samuel A. Titchener, Jessica Kvansakul, Carla J. Abbott, Nick Barnes, Myra McGuinness, William G. Kentler, Kiera Young, Janine Walker, Jonathan Yeoh, David A. X. Nayagam, Chi D. Luu, Lauren N. Ayton, Matthew A. Petoe, and Penelope J. Allen. 2021. A 44 channel suprachoroidal retinal prosthesis: Laboratory based visual function and functional vision outcomes. Investigative Ophthalmology & Visual Science 62, 8 (June2021), 3168.","journal-title":"Investigative Ophthalmology & Visual Science"},{"issue":"2","key":"e_1_3_3_27_2","first-page":"101","article-title":"Depth matters: Influence of depth cues on visual saliency","volume":"7573","author":"Lang Congyan","year":"2012","unstructured":"Congyan Lang, Tam V. Nguyen, Harish Katti, Karthik Yadati, Mohan Kankanhalli, and Shuicheng Yan. 2012. Depth matters: Influence of depth cues on visual saliency. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7573 LNCS, PART 2 (2012), 101\u2013115.","journal-title":"Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2016.03.019"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2016.09.001"},{"key":"e_1_3_3_30_2","article-title":"igibson 2.0: Object-centric simulation for robot learning of everyday household tasks","author":"Li Chengshu","year":"2021","unstructured":"Chengshu Li, Fei Xia, Roberto Mart\u00edn-Mart\u00edn, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, Andrey Kurenkov, C. Karen Liu, Hyowon Gweon, Jiajun Wu, Li Fei-Fei, and Silvio Savarese. 2021. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. arXiv:2108.03272 . Retrieved from https:\/\/arxiv.org\/abs\/2108.03272","journal-title":"arXiv:2108.03272"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/EMBC.2012.6346585"},{"key":"e_1_3_3_32_2","first-page":"8017","article-title":"Substituting depth for intensity and real-time phosphene rendering: Visual navigation under low vision conditions","author":"Lieby Paulette","year":"2011","unstructured":"Paulette Lieby, Nick Barnes, Chris McCarthy, Nianjun Liu, Hugh Dennett, Janine G. Walker, Viorica Botea, and Adele F. Scott. 2011. Substituting depth for intensity and real-time phosphene rendering: Visual navigation under low vision conditions. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (2011), 8017\u20138020.","journal-title":"In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society"},{"key":"e_1_3_3_33_2","article-title":"TranSalNet: Towards perceptually relevant visual saliency prediction","author":"Lou Jianxun","year":"2022","unstructured":"Jianxun Lou, Hanhe Lin, David Marshall, Dietmar Saupe, and Hantao Liu. 2022. TranSalNet: Towards perceptually relevant visual saliency prediction. arXiv:2110.03593. Retrieved from https:\/\/arxiv.org\/abs\/2110.03593","journal-title":"arXiv:2110.03593"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.preteyeres.2015.09.003"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1111\/aor.12476"},{"key":"e_1_3_3_36_2","article-title":"Time-to-contact maps for navigation with a low resolution visual prosthesis","author":"McCarthy Chris","year":"2012","unstructured":"Chris McCarthy and Nick Barnes. 2012. Time-to-contact maps for navigation with a low resolution visual prosthesis. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (2012).","journal-title":"In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISMAR.2014.6948408"},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1088\/1741-2560\/12\/1\/016003"},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/10.121642"},{"key":"e_1_3_3_40_2","first-page":"1928","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Mnih Volodymyr","year":"2016","unstructured":"Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 1928\u20131937."},{"issue":"8","key":"e_1_3_3_41_2","first-page":"4616","article-title":"Navigational outcomes with a depth-based vision processing method in a second generation suprachoroidal retinal prosthesis","volume":"64","author":"Moussallem Lauren","year":"2023","unstructured":"Lauren Moussallem, Lisa Lombardi, Matthew A. Petoe, Rui Jin, Maria Kolic, Elizabeth K. Baglin, Carla J. Abbott, Janine G. Walker, Nick Barnes, and Penelope J. Allen. 2023. Navigational outcomes with a depth-based vision processing method in a second generation suprachoroidal retinal prosthesis. Investigative Ophthalmology & Visual Science 64, 8 (2023), 4616\u20134616.","journal-title":"Investigative Ophthalmology & Visual Science"},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocn.2020.05.041"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2017.179"},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00266"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0227677"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ajo.2019.10.005"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0893-6080(96)00127-X"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1068\/p281059"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298655"},{"key":"e_1_3_3_50_2","doi-asserted-by":"publisher","DOI":"10.1098\/rspb.2013.0077"},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.1586\/17434440.2014.862494"},{"key":"e_1_3_3_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3377930.3389847"},{"key":"e_1_3_3_53_2","first-page":"5999","article-title":"Attention is all you need","volume":"2017","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 2017-December, Nips (2017), 5999\u20136009.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_54_2","article-title":"Generative neuroevolution for deep learning","author":"Verbancsics Phillip","year":"2013","unstructured":"Phillip Verbancsics and Josh Harguess. 2013. Generative neuroevolution for deep learning. arXiv:1312.5355 . Retrieved from https:\/\/arxiv.org\/abs\/1312.5355","journal-title":"arXiv:1312.5355"},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1111\/aor.12868"},{"key":"e_1_3_3_56_2","first-page":"2809","volume-title":"Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society","author":"White Jack","year":"2019","unstructured":"Jack White, Tatiana Kameneva, and Chris McCarthy. 2019. Deep reinforcement learning for task-based feature learning in prosthetic vision. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 2809\u20132812."},{"key":"e_1_3_3_57_2","first-page":"1","article-title":"Vision processing for assistive vision: A deep reinforcement learning approach","author":"White Jack","year":"2021","unstructured":"Jack White, Tatiana Kameneva, and Chris McCarthy. 2021. Vision processing for assistive vision: A deep reinforcement learning approach. IEEE Transactions on Human-Machine Systems (2021), 1\u201311.","journal-title":"IEEE Transactions on Human-Machine Systems"},{"key":"e_1_3_3_58_2","volume-title":"Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society","author":"White Jack.","year":"2023","unstructured":"Jack. White, Jaime. Ruiz-Serra, Stephen Petrie, Tatiana Kameneva, and Chris McCarthy. 2023. Self-attention based vision processing for prosthetic vision. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society."},{"key":"e_1_3_3_59_2","doi-asserted-by":"publisher","DOI":"10.1111\/aor.12504"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3650111","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3650111","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:03:31Z","timestamp":1750291411000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3650111"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,25]]},"references-count":58,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2024,7,31]]}},"alternative-id":["10.1145\/3650111"],"URL":"https:\/\/doi.org\/10.1145\/3650111","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2024,4,25]]},"assertion":[{"value":"2023-08-04","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-02-25","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-25","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}