{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:10:41Z","timestamp":1760123441991,"version":"build-2065373602"},"reference-count":51,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2023,2,9]],"date-time":"2023-02-09T00:00:00Z","timestamp":1675900800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>Image captioning is the multi-modal task of automatically describing a digital image based on its contents and their semantic relationship. This research area has gained increasing popularity over the past few years; however, most of the previous studies have been focused on purely objective content-based descriptions of the image scenes. In this study, efforts have been made to generate more engaging captions by leveraging human-like emotional responses. To achieve this task, a mean teacher learning-based method has been applied to the recently introduced ArtEmis dataset. ArtEmis is the first large-scale dataset for emotion-centric image captioning, containing 455K emotional descriptions of 80K artworks from WikiArt. This method includes a self-distillation relationship between memory-augmented language models with meshed connectivity. These language models are trained in a cross-entropy phase and then fine-tuned in a self-critical sequence training phase. According to various popular natural language processing metrics, such as BLEU, METEOR, ROUGE-L, and CIDEr, our proposed model has obtained a new state of the art on ArtEmis.<\/jats:p>","DOI":"10.3390\/a16020097","type":"journal-article","created":{"date-parts":[[2023,2,10]],"date-time":"2023-02-10T05:51:06Z","timestamp":1676008266000},"page":"97","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Nemesis: Neural Mean Teacher Learning-Based Emotion-Centric Speaker"],"prefix":"10.3390","volume":"16","author":[{"given":"Aryan","family":"Yousefi","sequence":"first","affiliation":[{"name":"School of Engineering and Computer Science, Laurentian University, Sudbury, ON P3E 2C6, Canada"}]},{"given":"Kalpdrum","family":"Passi","sequence":"additional","affiliation":[{"name":"School of Engineering and Computer Science, Laurentian University, Sudbury, ON P3E 2C6, Canada"}]}],"member":"1968","published-online":{"date-parts":[[2023,2,9]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"539","DOI":"10.1109\/TPAMI.2022.3148210","article-title":"From show to tell: A survey on deep learning-based image captioning","volume":"45","author":"Stefanini","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_2","unstructured":"Jia-Yu, P., Yang, H.-J., Duygulu, P., and Faloutsos, C. (2004, January 27\u201330). Automatic image captioning. Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No. 04TH8763), Taipei, Taiwan."},{"key":"ref_3","unstructured":"Ordonez, V., Kulkarni, G., and Berg, T. (2011). Im2text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst., 24."},{"key":"ref_4","unstructured":"Yang, Y., Teo, C., Daume, H., and Aloimonos, Y. (2011, January 27\u201331). Corpus-guided sentence generation of natural images. Proceedings of the  2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Gupta, A., Verma, Y., and Jawahar, C. (2012, January 22\u201326). Choosing linguistics over vision to describe images. Proceedings of the AAAI Conference on Artificial Intelligence, Toronto, ON, Canada.","DOI":"10.1609\/aaai.v26i1.8205"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1485","DOI":"10.1109\/JPROC.2010.2050411","article-title":"I2t: Image parsing to text description","volume":"98","author":"Yao","year":"2010","journal-title":"Proc. IEEE"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"652","DOI":"10.1109\/TPAMI.2016.2587640","article-title":"Show and tell: Lessons learned from the 2015 mscoco image captioning challenge","volume":"39","author":"Vinyals","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_9","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 6\u201311). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, PMLR,  Lille, France."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Karpathy, A., and Fei-Fei, L. (2015, January 7\u201312). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"ref_11","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Gan, C., Gan, Z., He, X., Gao, J., and Deng, L. (2017, January 21\u201326). Stylenet: Generating attractive visual captions with styles. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.108"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Mathews, A., Xie, L., and He, X. (2018, January 18\u201323). Semstyle: Learning to generate stylised image captions using unaligned text. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00896"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Guo, L., Liu, J., Yao, P., Li, J., and Lu, H. (2019, January 15\u201320). Mscap: Multi-style image captioning with unpaired stylized text. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00433"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zhao, W., Wu, X., and Zhang, X. (2020, January 7\u201312). Memcap: Memorizing style knowledge for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i07.6998"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Shuster, K., Humeau, S., Hu, H., Bordes, A., and Weston, J. (2019, January 15\u201320). Engaging image captioning via personality. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01280"},{"key":"ref_17","first-page":"1985","article-title":"Sound active attention framework for remote sensing image captioning","volume":"58","author":"Lu","year":"2019","journal-title":"IEEE"},{"key":"ref_18","unstructured":"Wang, B., Dong, G., Zhao, Y., Li, R., Cao, Q., and Chao, Y. (2022). International Conference on Multimedia Modeling, Springer."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Achlioptas, P., Ovsjanikov, M., Haydarov, K., Elhoseiny, M., and Guibas, L.J. (2021, January 20\u201325). Artemis: Affective language for visual art. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01140"},{"key":"ref_20","unstructured":"Laine, S., and Aila, T. (2016). Temporal ensembling for semi-supervised learning. arXiv."},{"key":"ref_21","unstructured":"Tarvainen, A., and Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process.Syst., 30."},{"key":"ref_22","unstructured":"Xu, G., Liu, Z., Li, X., and Loy, C.C. (2020). European Conference on Computer Vision, Springer."},{"key":"ref_23","unstructured":"Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv, 2."},{"key":"ref_24","unstructured":"Ba, J., and Caruana, R. (2014, January 8\u201313). Do deep nets really need to be deep?. Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"229","DOI":"10.1007\/BF00992696","article-title":"Simple statistical gradient-following algorithms for connectionist reinforcement learning","volume":"8","author":"Williams","year":"1992","journal-title":"Mach. Learn."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Liu, S., Zhu, Z., Ye, N., Guadarrama, S., and Murphy, K. (2017, January 22\u201329). Improved image captioning via policy gradient optimization of spider. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.100"},{"key":"ref_27","unstructured":"Sutton, R.S., McAllester, D., Singh, S., and Mansour, Y. (December, January 29). Policy gradient methods for reinforcement learning with function approximation. Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA."},{"key":"ref_28","unstructured":"Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2015). Sequence level training with recurrent neural networks. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. (2017, January 21\u201326). Self-critical sequence training for image captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.131"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Zitnick, C.L., and Parikh, D. (2015, January 7\u201312). CIDEr: Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_31","unstructured":"Yang, L., Shang, S., Liu, Y., Peng, Y., and He, L. (2022). Variational transformer: A framework beyond the trade-off between accuracy and diversity for image captioning. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Yao, T., Pan, Y., Li, Y., Qiu, Z., and Mei, T. (2017, January 22\u201329). Boosting image captioning with attributes. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.524"},{"key":"ref_33","unstructured":"Shen, S., Li, L.H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.-W., Yao, Z., and Keutzer, K. (2022, January 25\u201329). How much can CLIP benefit vision-and-language tasks? In Proceedings of the International Conference on Learning Representations. Virtual."},{"key":"ref_34","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 13\u201314). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, PMLR, Virtual."},{"key":"ref_35","unstructured":"Li, J., Li, D., Xiong, C., and Hoi, S. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv."},{"key":"ref_36","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_39","unstructured":"Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., and Brendel, W. (2019, January 6\u20139). Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Barraco, M., Stefanini, M., Cornia, M., Cascianelli, S., Baraldi, L., and Cucchiara, R. (2022, January 21\u201325). CaMEL: Mean Teacher Learning for Image Captioning. Proceedings of the International Conference on Pattern Recognition, Montr\u00e9al, QC, Canada.","DOI":"10.1109\/ICPR56361.2022.9955644"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Wang, Y., Albrecht, C.M., and Zhu, X.X. (2022, January 17\u201322). Self-Supervised Vision Transformers for Joint SAR-Optical Representation Learning. Proceedings of the IGARSS 2022\u20132022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia.","DOI":"10.1109\/IGARSS46834.2022.9883983"},{"key":"ref_42","unstructured":"Stefanini, M., Baraldi, L., and Cucchiara, R. (2020, January 13\u201319). Meshed-memory transformer for image captioning. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002, January 7\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Lavie, A., and Agarwal, A. (2007, January 23). Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic.","DOI":"10.3115\/1626355.1626389"},{"key":"ref_45","unstructured":"Lin, C.-Y. (2004, January 25\u201326). Rouge: A package for automatic evaluation of summaries. Proceedings of the Text Summarization Branches Out, Barcelona, Spain."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Sennrich, R., Haddow, B., and Birch, A. (2016, January 7\u201312). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany.","DOI":"10.18653\/v1\/P16-1162"},{"key":"ref_47","unstructured":"Kingma, D.P., and Ba, J. (2015, January 7\u20139). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18\u201323). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","article-title":"Visual genome: Connecting language and vision using crowdsourced dense image annotations","volume":"123","author":"Krishna","year":"2017","journal-title":"Int. J. Comput. Vision"},{"key":"ref_50","unstructured":"Tan, M., and Le, Q.V. (2019, January 9\u201315). Efficientnet: Rethinking model scaling for convolutional neural networks. Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA."},{"key":"ref_51","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021, January 3\u20137). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. Proceedings of the International Conference on Learning Representations (ICLR), Virtual."}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/16\/2\/97\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:28:23Z","timestamp":1760120903000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/16\/2\/97"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,9]]},"references-count":51,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2023,2]]}},"alternative-id":["a16020097"],"URL":"https:\/\/doi.org\/10.3390\/a16020097","relation":{},"ISSN":["1999-4893"],"issn-type":[{"type":"electronic","value":"1999-4893"}],"subject":[],"published":{"date-parts":[[2023,2,9]]}}}