{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T23:47:32Z","timestamp":1773186452735,"version":"3.50.1"},"reference-count":51,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2021,7,23]],"date-time":"2021-07-23T00:00:00Z","timestamp":1626998400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Imaging"],"abstract":"<jats:p>To automatically generate accurate and meaningful textual descriptions of images is an ongoing research challenge. Recently, a lot of progress has been made by adopting multimodal deep learning approaches for integrating vision and language. However, the task of developing image captioning models is most commonly addressed using datasets of natural images, while not many contributions have been made in the domain of artwork images. One of the main reasons for that is the lack of large-scale art datasets of adequate image-text pairs. Another reason is the fact that generating accurate descriptions of artwork images is particularly challenging because descriptions of artworks are more complex and can include multiple levels of interpretation. It is therefore also especially difficult to effectively evaluate generated captions of artwork images. The aim of this work is to address some of those challenges by utilizing a large-scale dataset of artwork images annotated with concepts from the Iconclass classification system. Using this dataset, a captioning model is developed by fine-tuning a transformer-based vision-language pretrained model. Due to the complex relations between image and text pairs in the domain of artwork images, the generated captions are evaluated using several quantitative and qualitative approaches. The performance is assessed using standard image captioning metrics and a recently introduced reference-free metric. The quality of the generated captions and the model\u2019s capacity to generalize to new data is explored by employing the model to another art dataset to compare the relation between commonly generated captions and the genre of artworks. The overall results suggest that the model can generate meaningful captions that indicate a stronger relevance to the art historical context, particularly in comparison to captions obtained from models trained only on natural image datasets.<\/jats:p>","DOI":"10.3390\/jimaging7080123","type":"journal-article","created":{"date-parts":[[2021,7,25]],"date-time":"2021-07-25T22:06:21Z","timestamp":1627250781000},"page":"123","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":24,"title":["Towards Generating and Evaluating Iconographic Image Captions of Artworks"],"prefix":"10.3390","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5330-1259","authenticated-orcid":false,"given":"Eva","family":"Cetinic","sequence":"first","affiliation":[{"name":"Rudjer Boskovic Insitute, Bijenicka Cesta 54, 10000 Zagreb, Croatia"},{"name":"Department of Computer Science, Durham University, Durham DH1 3LE, UK"}]}],"member":"1968","published-online":{"date-parts":[[2021,7,23]]},"reference":[{"key":"ref_1","first-page":"740","article-title":"Microsoft coco: Common objects in context","volume":"Volume 8693","author":"Lin","year":"2014","journal-title":"Computer Vision\u2014ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6\u201312 September 2014"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1162\/tacl_a_00166","article-title":"From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions","volume":"2","author":"Young","year":"2014","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","article-title":"Visual genome: Connecting language and vision using crowdsourced dense image annotations","volume":"123","author":"Krishna","year":"2017","journal-title":"Int. J. Comput. Vis."},{"key":"ref_4","unstructured":"Panofsky, E. (1972). Studies in Iconology. Humanistic Themes in the Art of the Renaissance, New York, Harper and Row."},{"key":"ref_5","unstructured":"Posthumus, E. (2021, July 20). Brill Iconclass AI Test Set. Available online: https:\/\/labs.brill.com\/ictestset\/."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1017\/S0307472200003436","article-title":"Iconclass: An iconographic classification system","volume":"8","author":"Couprie","year":"1983","journal-title":"Art Libr. J."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., and Gao, J. (2020, January 7\u201312). Unified Vision-Language Pre-Training for Image Captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA. No. 07.","DOI":"10.1609\/aaai.v34i07.7005"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Cetinic, E. (2021, January 10\u201315). Iconographic Image Captioning for Artworks. Proceedings of the ICPR International Workshops and Challenges, Virtual Event, Milan, Italy.","DOI":"10.1007\/978-3-030-68796-0_36"},{"key":"ref_9","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., and Choi, Y. (2021). CLIPScore: A Reference-free Evaluation Metric for Image Captioning. arXiv.","DOI":"10.18653\/v1\/2021.emnlp-main.595"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1016\/j.eswa.2018.07.026","article-title":"Fine-tuning convolutional neural networks for fine art classification","volume":"114","author":"Cetinic","year":"2018","journal-title":"Expert Syst. Appl."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"41770","DOI":"10.1109\/ACCESS.2019.2907986","article-title":"Two-stage deep learning approach to the classification of fine-art paintings","volume":"7","author":"Sandoval","year":"2019","journal-title":"IEEE Access"},{"key":"ref_13","unstructured":"Milani, F., and Fraternali, P. (2020). A Data Set and a Convolutional Model for Iconography Classification in Paintings. arXiv."},{"key":"ref_14","first-page":"753","article-title":"Visual link retrieval in a database of paintings","volume":"Volume 9913","author":"Seguin","year":"2016","journal-title":"Proceedings of the Computer Vision (ECCV) 2016"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Mao, H., Cheung, M., and She, J. (2017, January 23\u201327). Deepart: Learning joint representations of visual arts. Proceedings of the 25th ACM international conference on Multimedia, Mountain View, CA, USA.","DOI":"10.1145\/3123266.3123405"},{"key":"ref_16","first-page":"105","article-title":"Towards a tool for visual link retrieval and knowledge discovery in painting datasets","volume":"Volume 1177","author":"Castellano","year":"2020","journal-title":"Digital Libraries: The Era of Big Data and Data Science, Proceedings of the 16th Italian Research Conference on Digital Libraries (IRCDL) 2020, Bari, Italy, 30\u201331 January 2020"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Crowley, E.J., and Zisserman, A. (2014, January 6\u201312). In search of art. Proceedings of the Computer Vision (ECCV) 2014 Workshops, Zurich, Switzerland. Lecture Notes in Computer Science.","DOI":"10.1007\/978-3-319-16178-5_4"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3273022","article-title":"Omniart: A large-scale artistic benchmark","volume":"14","author":"Strezoski","year":"2018","journal-title":"ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Madhu, P., Kosti, R., M\u00fchrenberg, L., Bell, P., Maier, A., and Christlein, V. (2019, January 21\u201325). Recognizing Characters in Art History Using Deep Learning. Proceedings of the 1st Workshop on Structuring and Understanding of Multimedia heritAge Contents, Nice, France.","DOI":"10.1145\/3347317.3357242"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Jenicek, T., and Chum, O. (2019, January 20\u201325). Linking Art through Human Poses. Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia.","DOI":"10.1109\/ICDAR.2019.00216"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Shen, X., Efros, A.A., and Aubry, M. (2019, January 16\u201320). Discovering visual patterns in art collections with spatially-consistent feature learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00950"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Deng, Y., Tang, F., Dong, W., Ma, C., Huang, F., Deussen, O., and Xu, C. (2020). Exploring the Representativity of Art Paintings. IEEE Trans. Multimed.","DOI":"10.1109\/TMM.2020.3016887"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.patrec.2019.11.008","article-title":"Learning the Principles of Art History with convolutional neural networks","volume":"129","author":"Cetinic","year":"2020","journal-title":"Pattern Recognit. Lett."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Elgammal, A., Liu, B., Kim, D., Elhoseiny, M., and Mazzone, M. (2018, January 2\u20137). The shape of art history in the eyes of the machine. Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11894"},{"key":"ref_25","first-page":"2041669517715474","article-title":"Subjective ratings of beauty and aesthetics: Correlations with statistical image properties in western oil paintings","volume":"8","author":"Lehmann","year":"2017","journal-title":"i-Perception"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"73694","DOI":"10.1109\/ACCESS.2019.2921101","article-title":"A deep learning perspective on beauty, sentiment, and remembrance of art","volume":"7","author":"Cetinic","year":"2019","journal-title":"IEEE Access"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"283","DOI":"10.3390\/heritage3020017","article-title":"Aesthetical Issues of Leonardo Da Vinci\u2019s and Pablo Picasso\u2019s Paintings with Stochastic Evaluation","volume":"3","author":"Sargentis","year":"2020","journal-title":"Heritage"},{"key":"ref_28","unstructured":"Cetinic, E., and She, J. (2021). Understanding and Creating Art with AI: Review and Outlook. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Castellano, G., and Vessio, G. (2021). Deep learning approaches to pattern extraction and recognition in paintings and drawings: An overview. Neural Comput. Appl., 1\u201320.","DOI":"10.1007\/978-3-030-68796-0_35"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1016\/j.patrec.2020.06.018","article-title":"Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage","volume":"138","author":"Fontanella","year":"2020","journal-title":"Pattern Recognit. Lett."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Garcia, N., and Vogiatzis, G. (2018, January 8\u201314). How to read paintings: Semantic art understanding with multi-modal retrieval. Proceedings of the European Conference on Computer Vision (ECCV) 2018 Workshops, Munich, Germany. Lecture Notes in Computer Science.","DOI":"10.1007\/978-3-030-11012-3_52"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Baraldi, L., Cornia, M., Grana, C., and Cucchiara, R. (2018, January 20\u201324). Aligning text and document illustrations: Towards visually explainable digital humanities. Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China.","DOI":"10.1109\/ICPR.2018.8545064"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., and Cucchiara, R. (2019, January 9\u201313). Artpedia: A new visual-semantic dataset with visual and contextual sentences in the artistic domain. Proceedings of the Image Analysis and Processing (ICIAP) 2019, 20th International Conference, Trento, Italy. Lecture Notes in Computer Science.","DOI":"10.1007\/978-3-030-30645-8_66"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"166","DOI":"10.1016\/j.patrec.2019.11.018","article-title":"Explaining digital humanities by aligning images and textual descriptions","volume":"129","author":"Cornia","year":"2020","journal-title":"Pattern Recognit. Lett."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Banar, N., Daelemans, W., and Kestemont, M. (2021, January 4\u20136). Multi-modal Label Retrieval for the Visual Arts: The Case of Iconclass. Proceedings of the 13th International Conference on Agents and Artificial Intelligence, (ICAART) 2021, Online Streaming.","DOI":"10.5220\/0010390606220629"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Bongini, P., Becattini, F., Bagdanov, A.D., and Del Bimbo, A. (2020). Visual Question Answering for Cultural Heritage. arXiv.","DOI":"10.1088\/1757-899X\/949\/1\/012074"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Garcia, N., Ye, C., Liu, Z., Hu, Q., Otani, M., Chu, C., Nakashima, Y., and Mitamura, T. (2020). A Dataset and Baselines for Visual Question Answering on Art. arXiv.","DOI":"10.1007\/978-3-030-66096-3_8"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Sheng, S., and Moens, M.F. (2019, January 21\u201325). Generating Captions for Images of Ancient Artworks. Proceedings of the 27th ACM International Conference on Multimedia, (MM) 2019, Nice, France.","DOI":"10.1145\/3343031.3350972"},{"key":"ref_39","unstructured":"Gupta, J., Madhu, P., Kosti, R., Bell, P., Maier, A., and Christlein, V. (2020, January 21\u201325). Towards Image Caption Generation for Art Historical Data. Proceedings of the AI Methods for Digital Heritage, Workshop at KI2020 43rd German Conference on Artificial Intelligence, Bamberg, Germany."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_41","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Tan, H., and Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. arXiv.","DOI":"10.18653\/v1\/D19-1514"},{"key":"ref_43","unstructured":"Lu, J., Batra, D., Parikh, D., and Lee, S. (2019, January 8\u201314). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Chen, Y.C., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., and Liu, J. (2019). Uniter: Learning universal image-text representations. arXiv.","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"ref_45","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. (2015, January 7\u201312). Faster r-cnn: Towards real-time object detection with region proposal networks. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Sharma, P., Ding, N., Goodman, S., and Soricut, R. (2018, January 15\u201320). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia.","DOI":"10.18653\/v1\/P18-1238"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 6\u201312). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Denkowski, M., and Lavie, A. (2014, January 26\u201327). Meteor universal: Language specific translation evaluation for any target language. Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, USA.","DOI":"10.3115\/v1\/W14-3348"},{"key":"ref_49","unstructured":"Lin, C.Y. (2004, January 25\u201326). Rouge: A package for automatic evaluation of summaries. Proceedings of the Text Summarization Branches Out, Barcelona, Spain."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015, January 7\u201312). Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Xia, Q., Huang, H., Duan, N., Zhang, D., Ji, L., Sui, Z., Cui, E., Bharti, T., and Zhou, M. (2020). Xgpt: Cross-modal generative pre-training for image captioning. arXiv.","DOI":"10.1007\/978-3-030-88480-2_63"}],"container-title":["Journal of Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2313-433X\/7\/8\/123\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:34:08Z","timestamp":1760164448000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2313-433X\/7\/8\/123"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,23]]},"references-count":51,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2021,8]]}},"alternative-id":["jimaging7080123"],"URL":"https:\/\/doi.org\/10.3390\/jimaging7080123","relation":{},"ISSN":["2313-433X"],"issn-type":[{"value":"2313-433X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,23]]}}}