{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T11:12:15Z","timestamp":1770981135188,"version":"3.50.1"},"reference-count":48,"publisher":"MDPI AG","issue":"19","license":[{"start":{"date-parts":[[2022,9,29]],"date-time":"2022-09-29T00:00:00Z","timestamp":1664409600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"publisher","award":["NRF-2020R1A2C1008753"],"award-info":[{"award-number":["NRF-2020R1A2C1008753"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"publisher","award":["GCU-202110020001"],"award-info":[{"award-number":["GCU-202110020001"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Gachon University research fund of 2022","award":["NRF-2020R1A2C1008753"],"award-info":[{"award-number":["NRF-2020R1A2C1008753"]}]},{"name":"Gachon University research fund of 2022","award":["GCU-202110020001"],"award-info":[{"award-number":["GCU-202110020001"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Hint-based image colorization is an image-to-image translation task that aims at creating a full-color image from an input luminance image when a small set of color values for some pixels are given as hints. Though traditional deep-learning-based methods have been proposed in the literature, they are based on convolution neural networks (CNNs) that have strong spatial locality due to the convolution operations. This often causes non-trivial visual artifacts in the colorization results, such as false color and color bleeding artifacts. To overcome this limitation, this study proposes a vision transformer-based colorization network. The proposed hint-based colorization network has a hierarchical vision transformer architecture in the form of an encoder-decoder structure based on transformer blocks. As the proposed method uses the transformer blocks that can learn rich long-range dependency, it can achieve visually plausible colorization results, even with a small number of color hints. Through the verification experiments, the results reveal that the proposed transformer model outperforms the conventional CNN-based models. In addition, we qualitatively analyze the effect of the long-range dependency of the transformer model on hint-based image colorization.<\/jats:p>","DOI":"10.3390\/s22197419","type":"journal-article","created":{"date-parts":[[2022,9,29]],"date-time":"2022-09-29T23:09:29Z","timestamp":1664492969000},"page":"7419","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Hint-Based Image Colorization Based on Hierarchical Vision Transformer"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4259-8747","authenticated-orcid":false,"given":"Subin","family":"Lee","sequence":"first","affiliation":[{"name":"School of Computing, Gachon University, Seongnam 13120, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6173-0857","authenticated-orcid":false,"given":"Yong Ju","family":"Jung","sequence":"additional","affiliation":[{"name":"School of Computing, Gachon University, Seongnam 13120, Korea"}]}],"member":"1968","published-online":{"date-parts":[[2022,9,29]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2897824.2925974","article-title":"Let there be color! Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification","volume":"35","author":"Iizuka","year":"2016","journal-title":"ACM Trans. Graph. (ToG)"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Larsson, G., Maire, M., and Shakhnarovich, G. (2016). Learning representations for automatic colorization. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46493-0_35"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Zhang, R., Isola, P., and Efros, A.A. (2016). Colorful image colorization. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46487-9_40"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Vitoria, P., Raad, L., and Ballester, C. (2020, January 1\u20135). Chromagan: Adversarial picture colorization with semantic class distribution. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093389"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Su, J.W., Chu, H.K., and Huang, J.B. (2020, January 13\u201319). Instance-aware image colorization. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00799"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Treneska, S., Zdravevski, E., Pires, I.M., Lameski, P., and Gievska, S. (2022). GAN-Based image colorization for self-supervised visual feature learning. Sensors, 22.","DOI":"10.3390\/s22041599"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhang, R., Zhu, J.Y., Isola, P., Geng, X., Lin, A.S., Yu, T., and Efros, A.A. (2017). Real-time user-guided image colorization with learned deep priors. arXiv.","DOI":"10.1145\/3072959.3073703"},{"key":"ref_8","unstructured":"Ci, Y., Ma, X., Wang, Z., Li, H., and Luo, Z. (2021, January 20\u201324). User-guided deep anime line art colorization with conditional adversarial networks. Proceedings of the 26th ACM International Conference on Multimedia, Virtual Event."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Jang, H.W., and Jung, Y.J. (2020). Deep color transfer for color-plus-mono dual cameras. Sensors, 20.","DOI":"10.3390\/s20092743"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"23661","DOI":"10.1364\/OE.27.023661","article-title":"Deep color reconstruction for a sparse color sensor","volume":"27","author":"Sharif","year":"2019","journal-title":"Opt. Express"},{"key":"ref_11","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_12","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Timofte, R. (2021, January 11\u201317). Swinir: Image restoration using swin transformer. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00210"},{"key":"ref_15","unstructured":"Jiang, Y., Chang, S., and Wang, Z. (2021). Transgan: Two transformers can make one strong gan. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_18","unstructured":"Yu, F., and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1111\/cgf.13659","article-title":"Example-Based Colourization Via Dense Encoding Pyramids","volume":"Volume 39","author":"Xiao","year":"2020","journal-title":"Computer Graphics Forum"},{"key":"ref_20","first-page":"12077","article-title":"SegFormer: Simple and efficient design for semantic segmentation with transformers","volume":"34","author":"Xie","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., and Girdhar, R. (2022, January 19\u201324). Masked-attention mask transformer for universal image segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00135"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Yuan, Y., Chen, X., Chen, X., and Wang, J. (2019). Segmentation transformer: Object-contextual representations for semantic segmentation. arXiv.","DOI":"10.1007\/978-3-030-58539-6_11"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., and Torr, P.H. (2021, January 19\u201325). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-end object detection with transformers. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_25","first-page":"26183","article-title":"You only look at one sequence: Rethinking transformer in vision through object detection","volume":"34","author":"Fang","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Li, Y., Mao, H., Girshick, R., and He, K. (2022). Exploring plain vision transformer backbones for object detection. arXiv.","DOI":"10.1007\/978-3-031-20077-9_17"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"196","DOI":"10.1016\/j.isprsjprs.2022.06.008","article-title":"UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery","volume":"190","author":"Wang","year":"2022","journal-title":"ISPRS J. Photogramm. Remote. Sens."},{"key":"ref_28","unstructured":"Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., and Wang, M. (2021). Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv."},{"key":"ref_29","unstructured":"Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014, January 8\u201313). Generative adversarial nets. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA."},{"key":"ref_30","unstructured":"Kumar, M., Weissenborn, D., and Kalchbrenner, N. (2021). Colorization transformer. arXiv."},{"key":"ref_31","unstructured":"Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans, T. (2019). Axial attention in multidimensional transformers. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Guadarrama, S., Dahl, R., Bieber, D., Norouzi, M., Shlens, J., and Murphy, K. (2017). Pixcolor: Pixel recursive colorization. arXiv.","DOI":"10.5244\/C.31.112"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"866","DOI":"10.1364\/JOSA.66.000866","article-title":"Proposed extension of the CIE recommendation on \u201cUniform color spaces, color difference equations, and metric color terms\u201d","volume":"66","author":"Pauli","year":"1976","journal-title":"J. Opt. Soc. Am."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"353","DOI":"10.1088\/0031-9112\/18\/10\/010","article-title":"Color science, concepts and methods. Quantitative data and formulas","volume":"18","author":"Wright","year":"1967","journal-title":"Phys. Bull."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1145\/31336.31338","article-title":"An experimental comparison of RGB, YIQ, LAB, HSV, and opponent color models","volume":"6","author":"Schwarz","year":"1987","journal-title":"ACM Trans. Graph."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Uddin, S.N., and Jung, Y.J. (2022). SIFNet: Free-form image inpainting using color split-inpaint-fuse approach. Computer Vision and Image Understanding, Elsevier.","DOI":"10.1016\/j.cviu.2022.103446"},{"key":"ref_37","unstructured":"Zhao, Y., Wang, G., Tang, C., Luo, C., Zeng, W., and Zha, Z.J. (2021). A battle of network structures: An empirical study of cnn, transformer, and mlp. arXiv."},{"key":"ref_38","unstructured":"Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., and Shen, C. (2021). Conditional positional encodings for vision transformers. arXiv."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., and Zhang, L. (2021, January 11\u201317). Cvt: Introducing convolutions to vision transformers. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00009"},{"key":"ref_40","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20138). Imagenet classification with deep convolutional neural networks. Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Shi, W., Caballero, J., Husz\u00e1r, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., and Wang, Z. (2016, January 27\u201330). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.207"},{"key":"ref_42","unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., and Antiga, L. (2019, January 8\u201314). Pytorch: An imperative style, high-performance deep learning library. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, USA."},{"key":"ref_43","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"600","DOI":"10.1109\/TIP.2003.819861","article-title":"Image quality assessment: From error visibility to structural similarity","volume":"13","author":"Wang","year":"2004","journal-title":"IEEE Trans. Image Process."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Zhang, R., Isola, P., Efros, A.A., Shechtman, E., and Wang, O. (2018, January 18\u201322). The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00068"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Zhang, B., Gu, S., Zhang, B., Bao, J., Chen, D., Wen, F., Wang, Y., and Guo, B. (2022, January 19\u201324). Styleswin: Transformer-based gan for high-resolution image generation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01102"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., and Li, H. (2022, January 19\u201324). Uformer: A general u-shaped transformer for image restoration. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01716"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Uddin, S.N., and Jung, Y.J. (2020). Global and local attention-based free-form image inpainting. Sensors, 20.","DOI":"10.3390\/s20113204"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/19\/7419\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:41:59Z","timestamp":1760143319000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/19\/7419"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,29]]},"references-count":48,"journal-issue":{"issue":"19","published-online":{"date-parts":[[2022,10]]}},"alternative-id":["s22197419"],"URL":"https:\/\/doi.org\/10.3390\/s22197419","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,9,29]]}}}