{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:16:41Z","timestamp":1750220201746,"version":"3.41.0"},"reference-count":79,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2022,7,1]],"date-time":"2022-07-01T00:00:00Z","timestamp":1656633600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100026024","name":"Adobe Research","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100026024","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2022,7]]},"abstract":"<jats:p>We present ASSET, a neural architecture for automatically modifying an input high-resolution image according to a user's edits on its semantic segmentation map. Our architecture is based on a transformer with a novel attention mechanism. Our key idea is to sparsify the transformer's attention matrix at high resolutions, guided by dense attention extracted at lower image resolutions. While previous attention mechanisms are computationally too expensive for handling high-resolution images or are overly constrained within specific image regions hampering long-range interactions, our novel attention mechanism is both computationally efficient and effective. Our sparsified attention mechanism is able to capture long-range interactions and context, leading to synthesizing interesting phenomena in scenes, such as reflections of landscapes onto water or fora consistent with the rest of the landscape, that were not possible to generate reliably with previous convnets and transformer approaches. We present qualitative and quantitative results, along with user studies, demonstrating the effectiveness of our method. Our code and dataset are available at our project page: https:\/\/github.com\/DifanLiu\/ASSET<\/jats:p>","DOI":"10.1145\/3528223.3530172","type":"journal-article","created":{"date-parts":[[2022,7,22]],"date-time":"2022-07-22T21:06:27Z","timestamp":1658523987000},"page":"1-12","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["ASSET"],"prefix":"10.1145","volume":"41","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5971-2748","authenticated-orcid":false,"given":"Difan","family":"Liu","sequence":"first","affiliation":[{"name":"UMass Amherst and Adobe Research"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0331-4809","authenticated-orcid":false,"given":"Sandesh","family":"Shetty","sequence":"additional","affiliation":[{"name":"UMass Amherst"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1354-1562","authenticated-orcid":false,"given":"Tobias","family":"Hinz","sequence":"additional","affiliation":[{"name":"Adobe Research"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8908-3417","authenticated-orcid":false,"given":"Matthew","family":"Fisher","sequence":"additional","affiliation":[{"name":"Adobe Research"}]},{"given":"Richard","family":"Zhang","sequence":"additional","affiliation":[{"name":"Adobe Research"}]},{"given":"Taesung","family":"Park","sequence":"additional","affiliation":[{"name":"Adobe Research"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5867-5735","authenticated-orcid":false,"given":"Evangelos","family":"Kalogerakis","sequence":"additional","affiliation":[{"name":"UMass Amherst"}]}],"member":"320","published-online":{"date-parts":[[2022,7,22]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.385"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3306346.3323023"},{"key":"e_1_2_2_3_1","volume-title":"Longformer: The long-document transformer. arXiv:2004.05150.","author":"Beltagy Iz","year":"2020","unstructured":"Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150."},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00132"},{"key":"e_1_2_2_5_1","volume-title":"Proc. NeurIPS.","author":"Cao Chenjie","year":"2021","unstructured":"Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, XiangYang Xue, and Yanwei Fu. 2021. The Image Local Autoregressive Transformer. In Proc. NeurIPS."},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2980179.2982423"},{"key":"e_1_2_2_7_1","article-title":"Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs","volume":"40","author":"Chen Liang-Chieh","year":"2017","unstructured":"Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2017).","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_2_2_8_1","volume-title":"Proc. ICML.","author":"Chen Mark","year":"2020","unstructured":"Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In Proc. ICML."},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.168"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413551"},{"key":"e_1_2_2_11_1","unstructured":"Rewon Child Scott Gray Alec Radford and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv:1904.10509."},{"key":"e_1_2_2_12_1","volume-title":"Proc. NeurIPS.","author":"Chu Xiangxiang","year":"2021","unstructured":"Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021a. Twins: Revisiting the design of spatial attention in vision transformers. In Proc. NeurIPS."},{"key":"e_1_2_2_13_1","unstructured":"Xiangxiang Chu Zhi Tian Bo Zhang Xinlong Wang Xiaolin Wei Huaxia Xia and Chunhua Shen. 2021b. Conditional positional encodings for vision transformers. arXiv:2102.10882."},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00526"},{"key":"e_1_2_2_15_1","volume-title":"Proc. ICLR.","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. ICLR."},{"key":"e_1_2_2_16_1","volume-title":"Proc. NeurIPS.","author":"Esser Patrick","year":"2021","unstructured":"Patrick Esser, Robin Rombach, Andreas Blattmann, and Bj\u00f6rn Ommer. 2021b. Image-BART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis. In Proc. NeurIPS."},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01268"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.127"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00355"},{"key":"e_1_2_2_20_1","volume-title":"Proc. NeurIPS.","author":"Heusel Martin","year":"2017","unstructured":"Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. NeurIPS."},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00134"},{"key":"e_1_2_2_22_1","volume-title":"Proc. ICLR.","author":"Hinz Tobias","year":"2019","unstructured":"Tobias Hinz, Stefan Heinrich, and Stefan Wermter. 2019. Generating multiple objects at spatially distinct locations. In Proc. ICLR."},{"key":"e_1_2_2_23_1","volume-title":"Proc. ICLR.","author":"Holtzman Ari","year":"2019","unstructured":"Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. In Proc. ICLR."},{"key":"e_1_2_2_24_1","volume-title":"Proc. NeurIPS.","author":"Hong Seunghoon","year":"2018","unstructured":"Seunghoon Hong, Xinchen Yan, Thomas Huang, and Honglak Lee. 2018. Learning hierarchical semantic image manipulation through structured representations. In Proc. NeurIPS."},{"key":"e_1_2_2_25_1","volume-title":"Chen Change Loy, and Xiaoou Tang","author":"Hui Tak-Wai","year":"2016","unstructured":"Tak-Wai Hui, Chen Change Loy, and Xiaoou Tang. 2016. Depth Map Super-Resolution by Deep Multi-Scale Guidance. In Proc. ECCV."},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.632"},{"key":"e_1_2_2_27_1","volume-title":"Proc. NeurIPS.","author":"Jiang Yifan","year":"2021","unstructured":"Yifan Jiang, Shiyu Chang, and Zhangyang Wang. 2021. TransGAN: Two Transformers Can Make One Strong GAN. In Proc. NeurIPS."},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00183"},{"key":"e_1_2_2_29_1","volume-title":"Proc. ICLR.","author":"Kitaev Nikita","year":"2020","unstructured":"Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In Proc. ICLR."},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1276377.1276497"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00559"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"e_1_2_2_33_1","volume-title":"Proc. NeurIPS.","author":"Ling Huan","year":"2021","unstructured":"Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, and Sanja Fidler. 2021. EditGAN: High-Precision Semantic Image Editing. In Proc. NeurIPS."},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01252-6_6"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00925"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01062"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.29"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00892"},{"key":"e_1_2_2_40_1","volume-title":"Proc. NeurIPS.","author":"Nam Seonghyeon","year":"2018","unstructured":"Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. 2018. Text-adaptive generative adversarial networks: manipulating images with natural language. In Proc. NeurIPS."},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58542-6_24"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2011.6126423"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00244"},{"key":"e_1_2_2_44_1","volume-title":"Proc. ICML.","author":"Parmar Niki","year":"2018","unstructured":"Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In Proc. ICML."},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00209"},{"key":"e_1_2_2_46_1","volume-title":"Proc. ICML.","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proc. ICML."},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00467"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01464"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00329"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475326"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00323"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00787"},{"key":"e_1_2_2_53_1","unstructured":"Yi Tay Mostafa Dehghani Dara Bahri and Donald Metzler. 2020. Efficient transformers: A survey. arXiv:2009.06732."},{"key":"e_1_2_2_54_1","volume-title":"Proc. ICML.","author":"Tulsiani Shubham","year":"2021","unstructured":"Shubham Tulsiani and Abhinav Gupta. 2021. PixelTransformer: Sample Conditioned Signal Generation. In Proc. ICML."},{"key":"e_1_2_2_55_1","volume-title":"Proc. NeurIPS.","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. NeurIPS."},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01351"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00465"},{"key":"e_1_2_2_58_1","volume-title":"Linformer: Self-attention with linear complexity. arXiv:2006.04768.","author":"Wang Sinong","year":"2020","unstructured":"Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv:2006.04768."},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00813"},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2003.819861"},{"key":"e_1_2_2_62_1","volume-title":"Proc. CVPR.","author":"Yang Chao","year":"2017","unstructured":"Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. 2017. Highresolution image inpainting using multi-scale neural patch synthesis. In Proc. CVPR."},{"key":"e_1_2_2_63_1","volume-title":"Proc. NeurIPS.","author":"Yang Jianwei","year":"2021","unstructured":"Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. 2021. Focal Self-attention for Local-Global Interactions in Vision Transformers. In Proc. NeurIPS."},{"key":"e_1_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2014.2329776"},{"key":"e_1_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2007.383211"},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00577"},{"key":"e_1_2_2_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00457"},{"key":"e_1_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6967"},{"key":"e_1_2_2_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475436"},{"key":"e_1_2_2_70_1","volume-title":"Proc. NeurIPS.","author":"Zaheer Manzil","year":"2020","unstructured":"Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big Bird: Transformers for Longer Sequences.. In Proc. NeurIPS."},{"key":"e_1_2_2_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00299"},{"key":"e_1_2_2_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00519"},{"key":"e_1_2_2_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00068"},{"key":"e_1_2_2_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00153"},{"key":"e_1_2_2_75_1","unstructured":"Haitian Zheng Zhe Lin Jingwan Lu Scott Cohen Jianming Zhang Ning Xu and Jiebo Luo. 2021. Semantic Layout Manipulation with High-Resolution Sparse Attention. arXiv:2012.07288."},{"key":"e_1_2_2_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.544"},{"key":"e_1_2_2_77_1","unstructured":"Xingran Zhou Bo Zhang Ting Zhang Pan Zhang Jianmin Bao Dong Chen Zhongfei"},{"key":"e_1_2_2_78_1","volume-title":"Proc. CVPR.","author":"Wen Fang","year":"2021","unstructured":"Zhang, and Fang Wen. 2021. CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation. In Proc. CVPR."},{"key":"e_1_2_2_79_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00515"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3528223.3530172","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3528223.3530172","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:49Z","timestamp":1750186969000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3528223.3530172"}},"subtitle":["autoregressive semantic scene editing with transformers at high resolutions"],"short-title":[],"issued":{"date-parts":[[2022,7]]},"references-count":79,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2022,7]]}},"alternative-id":["10.1145\/3528223.3530172"],"URL":"https:\/\/doi.org\/10.1145\/3528223.3530172","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"type":"print","value":"0730-0301"},{"type":"electronic","value":"1557-7368"}],"subject":[],"published":{"date-parts":[[2022,7]]},"assertion":[{"value":"2022-07-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}