{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,18]],"date-time":"2025-12-18T20:05:46Z","timestamp":1766088346341,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":18,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,10,28]],"date-time":"2024-10-28T00:00:00Z","timestamp":1730073600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,10,28]]},"DOI":"10.1145\/3689094.3689470","type":"proceedings-article","created":{"date-parts":[[2024,10,8]],"date-time":"2024-10-08T18:25:52Z","timestamp":1728411952000},"page":"3-4","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["From Pixels to Preservation: The Power of Large Vision Models in Heritage Content Understanding"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6595-7661","authenticated-orcid":false,"given":"Jing","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Computer Science, The University of Sydney &amp; School of Computer Science, Wuhan University, Sydney, NSW, Australia"}]}],"member":"320","published-online":{"date-parts":[[2024,10,28]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Xinlei Chen Saining Xie and Kaiming He. 2021. An empirical study of training self-supervised vision transformers. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00950"},{"key":"e_1_3_2_1_2_1","volume-title":"Words: Transformers for Image Recognition at Scale. In ICLR.","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, et al. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR."},{"key":"e_1_3_2_1_3_1","unstructured":"Kaiming He Xinlei Chen Saining Xie Yanghao Li Piotr Doll\u00e1r and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In CVPR."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Alexander Kirillov Eric Mintun Nikhila Ravi Hanzi Mao Chloe Rolland Laura Gustafson Tete Xiao Spencer Whitehead Alexander C Berg Wan-Yen Lo et al. 2023. Segment anything. In ICCV.","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_3_2_1_5_1","unstructured":"Shilong Liu Zhaoyang Zeng Tianhe Ren Feng Li Hao Zhang Jie Yang Chunyuan Li Jianwei Yang Hang Su Jun Zhu et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Ze Liu Yutong Lin Yue Cao Han Hu Yixuan Wei Zheng Zhang Stephen Lin and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_1_7_1","volume-title":"Handrefiner: Refining malformed hands in generated images by diffusion-based conditional inpainting. In ACM Multimedia.","author":"Lu Wenquan","year":"2024","unstructured":"Wenquan Lu, Yufei Xu, Jing Zhang, Chaoyue Wang, and Dacheng Tao. 2024. Handrefiner: Refining malformed hands in generated images by diffusion-based conditional inpainting. In ACM Multimedia."},{"key":"e_1_3_2_1_8_1","volume-title":"Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, et al.","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, et al. 2021. Learning transferable visual models from natural language supervision. In ICML."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Robin Rombach Andreas Blattmann Dominik Lorenz Patrick Esser et al. 2022. High-resolution image synthesis with latent diffusion models. In CVPR.","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_1_10_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones et al. 2017. Attention is All you Need. In NeurIPS."},{"key":"e_1_3_2_1_11_1","first-page":"1","article-title":"Advancing plain vision transformer toward remote sensing foundation model","volume":"61","author":"Wang Di","year":"2022","unstructured":"Di Wang, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du, Dacheng Tao, and Liangpei Zhang. 2022. Advancing plain vision transformer toward remote sensing foundation model. IEEE Transactions on Geoscience and Remote Sensing 61 (2022), 1--15.","journal-title":"IEEE Transactions on Geoscience and Remote Sensing"},{"key":"e_1_3_2_1_12_1","volume-title":"Vitpose: Simple vision transformer baselines for human pose estimation. In NeurIPS.","author":"Xu Yufei","year":"2022","unstructured":"Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2022. Vitpose: Simple vision transformer baselines for human pose estimation. In NeurIPS."},{"key":"e_1_3_2_1_13_1","volume-title":"Vitae: Vision transformer advanced by exploring intrinsic inductive bias. In NeurIPS.","author":"Xu Yufei","year":"2021","unstructured":"Yufei Xu, Qiming Zhang, Jing Zhang, and Dacheng Tao. 2021. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. In NeurIPS."},{"key":"e_1_3_2_1_14_1","volume-title":"Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation. arXiv preprint arXiv:2401.17904","author":"Ye Maoyuan","year":"2024","unstructured":"Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, and Dacheng Tao. 2024. Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation. arXiv preprint arXiv:2401.17904 (2024)."},{"key":"e_1_3_2_1_15_1","volume-title":"Deepsolo: Let transformer decoder with explicit points solo for text spotting. In CVPR.","author":"Ye Maoyuan","year":"2023","unstructured":"Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, and Dacheng Tao. 2023. Deepsolo: Let transformer decoder with explicit points solo for text spotting. In CVPR."},{"key":"e_1_3_2_1_16_1","volume-title":"Vsa: Learning varied-size window attention in vision transformers. In ECCV.","author":"Zhang Qiming","year":"2022","unstructured":"Qiming Zhang, Yufei Xu, Jing Zhang, and Dacheng Tao. 2022. Vsa: Learning varied-size window attention in vision transformers. In ECCV."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-022-01739-w"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3347693"}],"event":{"name":"MM '24: The 32nd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Melbourne VIC Australia","acronym":"MM '24"},"container-title":["Proceedings of the 6th workshop on the analySis, Understanding and proMotion of heritAge Contents"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3689094.3689470","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3689094.3689470","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T18:37:36Z","timestamp":1755974256000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3689094.3689470"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,28]]},"references-count":18,"alternative-id":["10.1145\/3689094.3689470","10.1145\/3689094"],"URL":"https:\/\/doi.org\/10.1145\/3689094.3689470","relation":{},"subject":[],"published":{"date-parts":[[2024,10,28]]},"assertion":[{"value":"2024-10-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}