{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,20]],"date-time":"2026-03-20T00:54:50Z","timestamp":1773968090804,"version":"3.50.1"},"reference-count":53,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2024,6,4]],"date-time":"2024-06-04T00:00:00Z","timestamp":1717459200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"JST CREST Grant","award":["JPMJCR20D3"],"award-info":[{"award-number":["JPMJCR20D3"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Imaging"],"abstract":"<jats:p>We introduce an emotional stimuli detection task that targets extracting emotional regions that evoke people\u2019s emotions (i.e., emotional stimuli) in artworks. This task offers new challenges to the community because of the diversity of artwork styles and the subjectivity of emotions, which can be a suitable testbed for benchmarking the capability of the current neural networks to deal with human emotion. For this task, we construct a dataset called APOLO for quantifying emotional stimuli detection performance in artworks by crowd-sourcing pixel-level annotation of emotional stimuli. APOLO contains 6781 emotional stimuli in 4718 artworks for validation and testing. We also evaluate eight baseline methods, including a dedicated one, to show the difficulties of the task and the limitations of the current techniques through qualitative and quantitative experiments.<\/jats:p>","DOI":"10.3390\/jimaging10060136","type":"journal-article","created":{"date-parts":[[2024,6,4]],"date-time":"2024-06-04T11:48:50Z","timestamp":1717501730000},"page":"136","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Exploring Emotional Stimuli Detection in Artworks: A Benchmark Dataset and Baselines Evaluation"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2544-7744","authenticated-orcid":false,"given":"Tianwei","family":"Chen","sequence":"first","affiliation":[{"name":"Intelligence and Sensing Lab, Osaka University, Suita, Osaka 565-0871, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9200-6359","authenticated-orcid":false,"given":"Noa","family":"Garcia","sequence":"additional","affiliation":[{"name":"Intelligence and Sensing Lab, Osaka University, Suita, Osaka 565-0871, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8879-5957","authenticated-orcid":false,"given":"Liangzhi","family":"Li","sequence":"additional","affiliation":[{"name":"Computer Science Department, Qufu Normal University, Qufu 273165, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8000-3567","authenticated-orcid":false,"given":"Yuta","family":"Nakashima","sequence":"additional","affiliation":[{"name":"Intelligence and Sensing Lab, Osaka University, Suita, Osaka 565-0871, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2024,6,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Wilber, M.J., Fang, C., Jin, H., Hertzmann, A., Collomosse, J.P., and Belongie, S.J. (2017, January 22\u201329). BAM! The Behance Artistic Media Dataset for Recognition Beyond Photography. Proceedings of the ICCV, Venice, Italy.","DOI":"10.1109\/ICCV.2017.136"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Bai, Z., Nakashima, Y., and Garc\u00eda, N. (2021, January 11\u201317). Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation. Proceedings of the ICCV, Virtual Event.","DOI":"10.1109\/ICCV48922.2021.00537"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Crowley, E.J., and Zisserman, A. (2014, January 1\u20135). The State of the Art: Object Retrieval in Paintings using Discriminative Regions. Proceedings of the BMVC, Nottingham, UK.","DOI":"10.5244\/C.28.38"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Gonthier, N., Gousseau, Y., Ladjal, S., and Bonfait, O. (2018, January 8\u201314). Weakly supervised object detection in artworks. Proceedings of the ECCV Workshops, Munich, Germany.","DOI":"10.1007\/978-3-030-11012-3_53"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Mensink, T., and van Gemert, J.C. (2014, January 2). The Rijksmuseum Challenge: Museum-Centered Visual Recognition. Proceedings of the ICMR, Dallas, TX, USA.","DOI":"10.1145\/2578726.2578791"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"88:1","DOI":"10.1145\/3273022","article-title":"OmniArt: A Large-scale Artistic Benchmark","volume":"14","author":"Strezoski","year":"2018","journal-title":"ACM Trans. Multim. Comput. Commun. Appl."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Tonkes, V., and Sabatelli, M. (2022, January 23\u201327). How Well Do Vision Transformers (VTs) Transfer to the Non-natural Image Domain? An Empirical Study Involving Art Classification. Proceedings of the ECCV Workshop, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-25056-9_16"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Reshetnikov, A., Marinescu, M.V., and L\u00f3pez, J.M. (2022, January 23\u201327). DEArt: Dataset of European Art. Proceedings of the ECCV Workshop, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-25056-9_15"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Garcia, N., and Vogiatzis, G. (2018, January 8\u201314). How to Read Paintings: Semantic Art Understanding with Multi-modal Retrieval. Proceedings of the ECCV Workshops, Munich, Germany.","DOI":"10.1007\/978-3-030-11012-3_52"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Garcia, N., Ye, C., Liu, Z., Hu, Q., Otani, M., Chu, C., Nakashima, Y., and Mitamura, T. (2020, January 23\u201328). A Dataset and Baselines for Visual Question Answering on Art. Proceedings of the ECCV Workshops, Glasgow, UK.","DOI":"10.1007\/978-3-030-66096-3_8"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Achlioptas, P., Ovsjanikov, M., Haydarov, K., Elhoseiny, M., and Guibas, L.J. (2021, January 20\u201325). ArtEmis: Affective Language for Visual Art. Proceedings of the CVPR, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01140"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Mohamed, Y., Khan, F.F., Haydarov, K., and Elhoseiny, M. (2022, January 20\u201325). It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection. Proceedings of the CVPR, Nashville, TN, USA.","DOI":"10.1109\/CVPR52688.2022.02058"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"342","DOI":"10.1037\/1089-2680.9.4.342","article-title":"Emotional Responses to Art: From Collation and Arousal to Cognition and Emotion","volume":"9","author":"Silvia","year":"2005","journal-title":"Rev. Gen. Psychol."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"109","DOI":"10.2190\/EM.27.1.f","article-title":"Opposing Art: Rejection as an Action Tendency of Hostile Aesthetic Emotions","volume":"27","author":"Cooper","year":"2009","journal-title":"Empir. Stud. Arts"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"212","DOI":"10.1016\/j.newideapsych.2011.09.003","article-title":"The functional role of emotions in aesthetic judgment","volume":"30","author":"Xenakis","year":"2012","journal-title":"New Ideas Psychol."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"80","DOI":"10.1016\/j.newideapsych.2010.04.001","article-title":"A model of art perception, evaluation and emotion in transformative aesthetic experience","volume":"29","author":"Pelowski","year":"2011","journal-title":"New Ideas Psychol."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"7432","DOI":"10.1109\/TIP.2021.3106813","article-title":"Stimuli-Aware Visual Emotion Analysis","volume":"30","author":"Yang","year":"2021","journal-title":"IEEE Trans. Image Process."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"8686","DOI":"10.1109\/TIP.2021.3118983","article-title":"SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network","volume":"30","author":"Yang","year":"2021","journal-title":"IEEE Trans. Image Process."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Yang, J., She, D., Lai, Y.K., Rosin, P.L., and Yang, M.H. (2018, January 18\u201323). Weakly Supervised Coupled Networks for Visual Sentiment Analysis. Proceedings of the CVPR, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00791"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Xu, L., Wang, Z., Wu, B., and Lui, S.S.Y. (2022, January 18\u201324). MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis. Proceedings of the CVPR, Louisiana, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00926"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Yang, J., Huang, Q., Ding, T., Lischinski, D., Cohen-Or, D., and Huang, H. (2023, January 2\u20136). EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes. Proceedings of the ICCV, Paris, France.","DOI":"10.1109\/ICCV51070.2023.01864"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022, January 18\u201324). High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of the CVPR, Louisiana, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"ref_23","unstructured":"Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Shen, X., Efros, A.A., and Aubry, M. (2019, January 16\u201320). Discovering Visual Patterns in Art Collections with Spatially-Consistent Feature Learning. Proceedings of the CVPR, Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00950"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Lin, H., Jia, J., Guo, Q., Xue, Y., Huang, J., Cai, L., and Feng, L. (2014, January 14\u201318). Psychological stress detection from cross-media microblog data using Deep Sparse Neural Network. Proceedings of the ICME, Chengdu, China.","DOI":"10.1109\/ICME.2014.6890213"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Wang, X., Zhang, H., Cao, L., and Feng, L. (2020, January 12\u201316). Leverage Social Media for Personalized Stress Detection. Proceedings of the ACM MM, Seattle, WA, USA.","DOI":"10.1145\/3394171.3413596"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Truong, Q.T., and Lauw, H.W. (2017, January 23\u201327). Visual Sentiment Analysis for Review Images with Item-Oriented and User-Oriented CNN. Proceedings of the ACM MM, Mountain View, CA, USA.","DOI":"10.1145\/3123266.3123374"},{"key":"ref_28","first-page":"305","article-title":"VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis","volume":"33","author":"Truong","year":"2019","journal-title":"Proc. Aaai Conf. Artif. Intell."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"6729","DOI":"10.1109\/TPAMI.2021.3094362","article-title":"Affective Image Content Analysis: Two Decades Review and New Perspectives","volume":"44","author":"Zhao","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Peng, K.C., Sadovnik, A., Gallagher, A.C., and Chen, T. (2016, January 25\u201328). Where do emotions come from? Predicting the Emotion Stimuli Map. Proceedings of the ICIP, Phoenix, AZ, USA.","DOI":"10.1109\/ICIP.2016.7532430"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Fan, S., Shen, Z., Jiang, M., Koenig, B.L., Xu, J., Kankanhalli, M., and Zhao, Q. (2018, January 18\u201323). Emotional Attention: A Study of Image Sentiment and Visual Attention. Proceedings of the CVPR, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00785"},{"key":"ref_32","unstructured":"Liu, G., Yan, Y., Ricci, E., Yang, Y., Han, Y., Winkler, S., and Sebe, N. (2015, January 25\u201331). Inferring Painting Style with Multi-Task Dictionary Learning. Proceedings of the IJCAI, Buenos Aires, Argentina."},{"key":"ref_33","unstructured":"Ypsilantis, N.A., Garcia, N., Han, G., Ibrahimi, S., Van Noord, N., and Tolias, G. (2021, January 8\u201314). The met dataset: Instance-level recognition for artworks. Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), San Diego, CA, USA."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Ruta, D., Gilbert, A., Aggarwal, P., Marri, N., Kale, A., Briggs, J., Speed, C., Jin, H., Faieta, B., and Filipkowski, A. (2022, January 23\u201327). StyleBabel: Artistic Style Tagging and Captioning. Proceedings of the ECCV, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-20074-8_13"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Machajdik, J., and Hanbury, A. (2010, January 25\u201329). Affective image classification using features inspired by psychology and art theory. Proceedings of the ACM MM, Firenze, Italy.","DOI":"10.1145\/1873951.1873965"},{"key":"ref_36","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the ICML, Virtual."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Kazemzadeh, S., Ordonez, V., Matten, M.A., and Berg, T.L. (2014, January 25\u201329). ReferItGame: Referring to Objects in Photographs of Natural Scenes. Proceedings of the EMNLP, Doha, Qatar.","DOI":"10.3115\/v1\/D14-1086"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., and Murphy, K.P. (2016, January 27\u201330). Generation and Comprehension of Unambiguous Object Descriptions. Proceedings of the CVPR, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.9"},{"key":"ref_39","unstructured":"Lu, J., Batra, D., Parikh, D., and Lee, S. (2019, January 27\u201330). ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Proceedings of the NeurIPS, Las Vegas, NV, USA."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Lu, J., Goswami, V., Rohrbach, M., Parikh, D., and Lee, S. (2020, January 13\u201319). 12-in-1: Multi-Task Vision and Language Representation Learning. Proceedings of the CVPR, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01045"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Sharma, P., Ding, N., Goodman, S., and Soricut, R. (2018, January 15\u201320). Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. Proceedings of the ACL, Melbourne, Australia.","DOI":"10.18653\/v1\/P18-1238"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., and Gao, J. (2021, January 20\u201325). VinVL: Revisiting Visual Representations in Vision-Language Models. Proceedings of the CVPR, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00553"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18\u201323). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Proceedings of the CVPR, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"1682","DOI":"10.1109\/TPAMI.2022.3169234","article-title":"Emotional Attention: From Eye Tracking to Computational Modeling","volume":"45","author":"Fan","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","article-title":"DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs","volume":"40","author":"Chen","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2015, January 7\u201312). Deep Residual Learning for Image Recognition. Proceedings of the CVPR, Boston, MA, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., and Torr, P.H.S. (2022, January 18\u201324). LAVT: Language-Aware Vision Transformer for Referring Image Segmentation. Proceedings of the CVPR, Louisiana, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01762"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Xu, L., Huang, M.H., Shang, X., Yuan, Z., Sun, Y., and Liu, J. (2023, January 15). Meta Compositional Referring Expression Segmentation. Proceedings of the CVPR, Tokyo, Japan.","DOI":"10.1109\/CVPR52729.2023.01866"},{"key":"ref_50","unstructured":"Loshchilov, I., and Hutter, F. (2017, January 24\u201326). Decoupled Weight Decay Regularization. Proceedings of the ICLR, Toulon, France."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Wu, Y., Nakashima, Y., and Garcia, N. (2023, January 12\u201315). Not Only Generative Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis. Proceedings of the ICMR, Thessaloniki, Greece.","DOI":"10.1145\/3591106.3592262"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Brooks, T., Holynski, A., and Efros, A.A. (2023, January 18\u201322). InstructPix2Pix: Learning to Follow Image Editing Instructions. Proceedings of the CVPR, Vancouver, Canada.","DOI":"10.1109\/CVPR52729.2023.01764"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Tang, R., Liu, L., Pandey, A., Jiang, Z., Yang, G., Kumar, K., Stenetorp, P., Lin, J., and Ture, F. (2023, January 9\u201314). What the DAAM: Interpreting Stable Diffusion Using Cross Attention. Proceedings of the ACL, Toronto, ON, Canada.","DOI":"10.18653\/v1\/2023.acl-long.310"}],"container-title":["Journal of Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2313-433X\/10\/6\/136\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:53:39Z","timestamp":1760108019000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2313-433X\/10\/6\/136"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,4]]},"references-count":53,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2024,6]]}},"alternative-id":["jimaging10060136"],"URL":"https:\/\/doi.org\/10.3390\/jimaging10060136","relation":{},"ISSN":["2313-433X"],"issn-type":[{"value":"2313-433X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,4]]}}}