{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T08:13:33Z","timestamp":1765008813294,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":46,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,12,9]]},"DOI":"10.1145\/3743093.3770999","type":"proceedings-article","created":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T08:08:11Z","timestamp":1765008491000},"page":"1-7","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Unifying Generative Self-Supervised Paradigms with Diffusion Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8762-2424","authenticated-orcid":false,"given":"Luping","family":"Zhou","sequence":"first","affiliation":[{"name":"The University of Sydney, Sydney, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7023-0264","authenticated-orcid":false,"given":"Xiaoyu","family":"Yue","sequence":"additional","affiliation":[{"name":"The University of Sydney, Sydney, Australia"}]}],"member":"320","published-online":{"date-parts":[[2025,12,6]]},"reference":[{"key":"e_1_3_3_1_2_2","unstructured":"Jason Antic. 2019. jantic\/deoldify: A deep learning based project for colorizing and restoring old images (and video!). https:\/\/github.com\/jantic\/DeOldify"},{"key":"e_1_3_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19821-2_26"},{"key":"e_1_3_3_1_4_2","unstructured":"Hangbo Bao Li Dong Songhao Piao and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2106.08254 (2021)."},{"key":"e_1_3_3_1_5_2","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared\u00a0D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et\u00a0al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877\u20131901."},{"key":"e_1_3_3_1_6_2","unstructured":"Mathilde Caron Ishan Misra Julien Mairal Priya Goyal Piotr Bojanowski and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems 33 (2020) 9912\u20139924."},{"key":"e_1_3_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"e_1_3_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"e_1_3_3_1_9_2","first-page":"1691","volume-title":"International conference on machine learning","author":"Chen Mark","year":"2020","unstructured":"Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International conference on machine learning. PMLR, 1691\u20131703."},{"key":"e_1_3_3_1_10_2","first-page":"1597","volume-title":"International conference on machine learning","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597\u20131607."},{"key":"e_1_3_3_1_11_2","doi-asserted-by":"crossref","unstructured":"Xiaokang Chen Mingyu Ding Xiaodi Wang Ying Xin Shentong Mo Yunhao Wang Shumin Han Ping Luo Gang Zeng and Jingdong Wang. 2024. Context autoencoder for self-supervised representation learning. International Journal of Computer Vision 132 1 (2024) 208\u2013223.","DOI":"10.1007\/s11263-023-01852-4"},{"key":"e_1_3_3_1_12_2","unstructured":"Xinlei Chen Zhuang Liu Saining Xie and Kaiming He. 2024. Deconstructing denoising diffusion models for self-supervised learning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.14404 (2024)."},{"key":"e_1_3_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00950"},{"key":"e_1_3_3_1_14_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1810.04805 (2018)."},{"key":"e_1_3_3_1_15_2","unstructured":"Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 (2021) 8780\u20138794."},{"key":"e_1_3_3_1_16_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et\u00a0al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2010.11929 (2020)."},{"key":"e_1_3_3_1_17_2","unstructured":"Jean-Bastien Grill Florian Strub Florent Altch\u00e9 Corentin Tallec Pierre Richemond Elena Buchatskaya Carl Doersch Bernardo Avila\u00a0Pires Zhaohan Guo Mohammad Gheshlaghi\u00a0Azar et\u00a0al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33 (2020) 21271\u201321284."},{"key":"e_1_3_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1117\/12.477378"},{"key":"e_1_3_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"e_1_3_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"e_1_3_3_1_21_2","unstructured":"Martin Heusel Hubert Ramsauer Thomas Unterthiner Bernhard Nessler and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_3_1_22_2","doi-asserted-by":"crossref","unstructured":"Geoffrey Hinton. 2023. How to represent part-whole hierarchies in a neural network. Neural Computation 35 3 (2023) 413\u2013452.","DOI":"10.1162\/neco_a_01557"},{"key":"e_1_3_3_1_23_2","unstructured":"Jonathan Ho Ajay Jain and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020) 6840\u20136851."},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_39"},{"key":"e_1_3_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19787-1_2"},{"key":"e_1_3_3_1_26_2","unstructured":"Tero Karras Miika Aittala Timo Aila and Samuli Laine. 2022. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems 35 (2022) 26565\u201326577."},{"key":"e_1_3_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20071-7_21"},{"key":"e_1_3_3_1_28_2","unstructured":"Diederik\u00a0P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1312.6114 (2013)."},{"key":"e_1_3_3_1_29_2","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey\u00a0E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)."},{"key":"e_1_3_3_1_30_2","unstructured":"Manoj Kumar Dirk Weissenborn and Nal Kalchbrenner. 2021. Colorization transformer. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2102.04432 (2021)."},{"key":"e_1_3_3_1_31_2","unstructured":"Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1711.05101 (2017)."},{"key":"e_1_3_3_1_32_2","unstructured":"Alex Nichol Prafulla Dhariwal Aditya Ramesh Pranav Shyam Pamela Mishkin Bob McGrew Ilya Sutskever and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2112.10741 (2021)."},{"key":"e_1_3_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00387"},{"key":"e_1_3_3_1_34_2","unstructured":"Aditya Ramesh Prafulla Dhariwal Alex Nichol Casey Chu and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2204.06125 1 2 (2022) 3."},{"key":"e_1_3_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3528233.3530757"},{"key":"e_1_3_3_1_37_2","unstructured":"Chitwan Saharia William Chan Saurabh Saxena Lala Li Jay Whang Emily\u00a0L Denton Kamyar Ghasemipour Raphael Gontijo\u00a0Lopes Burcu Karagol\u00a0Ayan Tim Salimans et\u00a0al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems 35 (2022) 36479\u201336494."},{"key":"e_1_3_3_1_38_2","first-page":"2256","volume-title":"International conference on machine learning","author":"Sohl-Dickstein Jascha","year":"2015","unstructured":"Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256\u20132265."},{"key":"e_1_3_3_1_39_2","unstructured":"Yang Song Jascha Sohl-Dickstein Diederik\u00a0P Kingma Abhishek Kumar Stefano Ermon and Ben Poole. 2020. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2011.13456 (2020)."},{"key":"e_1_3_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00799"},{"key":"e_1_3_3_1_41_2","unstructured":"Shuyang Sun Xiaoyu Yue Song Bai and Philip Torr. 2021. Visual parser: Representing part-whole hierarchies with transformers. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2107.05790 (2021)."},{"key":"e_1_3_3_1_42_2","first-page":"10347","volume-title":"International conference on machine learning","author":"Touvron Hugo","year":"2021","unstructured":"Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv\u00e9 J\u00e9gou. 2021. Training data-efficient image transformers & distillation through attention. In International conference on machine learning. PMLR, 10347\u201310357."},{"key":"e_1_3_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01492"},{"key":"e_1_3_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01448"},{"key":"e_1_3_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00943"},{"key":"e_1_3_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46487-9_40"},{"key":"e_1_3_3_1_47_2","unstructured":"Jinghao Zhou Chen Wei Huiyu Wang Wei Shen Cihang Xie Alan Yuille and Tao Kong. 2021. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2111.07832 (2021)."}],"event":{"name":"MMAsia '25: ACM Multimedia Asia","location":"Kuala Lumpur Malaysia","acronym":"MMAsia '25","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 7th ACM International Conference on Multimedia in Asia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3743093.3770999","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T08:08:34Z","timestamp":1765008514000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3743093.3770999"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,6]]},"references-count":46,"alternative-id":["10.1145\/3743093.3770999","10.1145\/3743093"],"URL":"https:\/\/doi.org\/10.1145\/3743093.3770999","relation":{},"subject":[],"published":{"date-parts":[[2025,12,6]]},"assertion":[{"value":"2025-12-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}