{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T21:26:46Z","timestamp":1775251606484,"version":"3.50.1"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"12","license":[{"start":{"date-parts":[[2023,6,1]],"date-time":"2023-06-01T00:00:00Z","timestamp":1685577600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,6,1]],"date-time":"2023-06-01T00:00:00Z","timestamp":1685577600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Kafr El Shiekh University"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Multimed Tools Appl"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The interpretation of medical images into a natural language is a developing field of artificial intelligence (AI) called image captioning. This field integrates two branches of artificial intelligence which are computer vision and natural language processing. This is a challenging topic that goes beyond object recognition, segmentation, and classification since it demands an understanding of the relationships between various components in an image and how these objects function as visual representations. The content-based image retrieval (CBIR) uses an image captioning model to generate captions for the user query image. The common architecture of medical image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem. We aim in this paper to build an optimized model for histopathological captions of stomach adenocarcinoma endoscopic biopsy specimens. For the image feature extraction subsystem, we did two evaluations; first, we tested 5 different vision models (VGG, ResNet, PVT, SWIN-Large, and ConvNEXT-Large) using (LSTM, RNN, and bidirectional-RNN) and then compare the vision models with (LSTM-without augmentation, LSTM-with augmentation and BioLinkBERT-Large as an embedding layer-with augmentation) to find the accurate one. Second, we tested 3 different concatenations of pairs of vision models (SWIN-Large, PVT_v2_b5, and ConvNEXT-Large) to get among them the most expressive extracted feature vector of the image. For the caption generation lingual subsystem, we tested a pre-trained language embedding model which is BioLinkBERT-Large compared to LSTM in both evaluations, to select from them the most accurate model. Our experiments showed that building a captioning system that uses a concatenation of the two models ConvNEXT-Large and PVT_v2_b5 as an image feature extractor, combined with the BioLinkBERT-Large language embedding model produces the best results among the other combinations.<\/jats:p>","DOI":"10.1007\/s11042-023-15884-y","type":"journal-article","created":{"date-parts":[[2023,6,1]],"date-time":"2023-06-01T11:01:55Z","timestamp":1685617315000},"page":"36645-36664","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Enhanced descriptive captioning model for histopathological patches"],"prefix":"10.1007","volume":"83","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2187-0174","authenticated-orcid":false,"given":"Samar","family":"Elbedwehy","sequence":"first","affiliation":[]},{"given":"T.","family":"Medhat","sequence":"additional","affiliation":[]},{"given":"Taher","family":"Hamza","sequence":"additional","affiliation":[]},{"given":"Mohammed F.","family":"Alrahmawy","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,6,1]]},"reference":[{"key":"15884_CR1","doi-asserted-by":"crossref","unstructured":"Atliha V, \u0160e\u0161ok D (2021) Pretrained word embeddings for image captioning. In: 2021 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream). IEEE, pp 1\u20134","DOI":"10.1109\/eStream53087.2021.9431465"},{"key":"15884_CR2","unstructured":"Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp 1597\u20131607"},{"key":"15884_CR3","doi-asserted-by":"crossref","unstructured":"Chen B, Li P, Chen X, Wang B, Zhang L, Hua X-S (2022) Dense learning based semi-supervised object detection. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 4815\u20134824","DOI":"10.1109\/CVPR52688.2022.00477"},{"key":"15884_CR4","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"15884_CR5","doi-asserted-by":"crossref","unstructured":"He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 9729\u20139738","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"15884_CR6","unstructured":"Kiros R, Salakhutdinov R, Zemel R (2014b) Unifying visual-semantic embeddings with multi-modal neural language models. ArXiv: 1411.2539"},{"issue":"12","key":"15884_CR7","doi-asserted-by":"publisher","first-page":"2891","DOI":"10.1109\/TPAMI.2012.162","volume":"35","author":"G Kulkarni","year":"2013","unstructured":"Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg T (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891\u20132903","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"15884_CR8","unstructured":"Kuznetsova P, Ordonez V, Berg AC, Berg T, Choi Y (2012) Collective generation of natural image descriptions. In: ACL, vol 1. ACL, pp 359\u2013368"},{"key":"15884_CR9","unstructured":"Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: CoNLL. ACL, pp 220\u2013228"},{"key":"15884_CR10","unstructured":"Lin M, Chen Q, Yan S (2014) Network in network. In: 2nd Int. Conf. Learn. Represent.ICLR 2014 - Conf. Track Proc., pp 1\u201310"},{"key":"15884_CR11","doi-asserted-by":"crossref","unstructured":"Liu Z, Lin Y, Cao Y, Hu H et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 10012\u201310022","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"15884_CR12","doi-asserted-by":"crossref","unstructured":"Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 11976\u201311986","DOI":"10.1109\/CVPR52688.2022.01167"},{"key":"15884_CR13","unstructured":"Ma E (2019) NLP augmentation. https:\/\/github.com\/makcedward\/nlpaug"},{"key":"15884_CR14","unstructured":"Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632"},{"key":"15884_CR15","unstructured":"Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781"},{"key":"15884_CR16","unstructured":"Netron: a visualizer for neural network, deep learning and machine learning models. Retrieved from https:\/\/netron.app\/"},{"key":"15884_CR17","doi-asserted-by":"crossref","unstructured":"Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311\u2013318","DOI":"10.3115\/1073083.1073135"},{"key":"15884_CR18","doi-asserted-by":"crossref","unstructured":"Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532\u20131543","DOI":"10.3115\/v1\/D14-1162"},{"key":"15884_CR19","doi-asserted-by":"crossref","unstructured":"Saad W, Shalaby WA, Shokair M, Abd El-Samie F, Dessouky M, Abdellatef E (2021) COVID-19 classification using deep feature concatenation technique. J Ambient Intell Humaniz Comput:1\u201319","DOI":"10.1007\/s12652-021-02967-7"},{"key":"15884_CR20","doi-asserted-by":"crossref","unstructured":"Shah A., Chavan P, Jadhav D (2022) Convolutional neural network-based image segmentation techniques. In: Soft Computing and Signal Processing: Proceedings of 3rd ICSCSP 2020, Volume 2. Springer Singapore, pp 553\u2013561","DOI":"10.1007\/978-981-16-1249-7_52"},{"key":"15884_CR21","unstructured":"Shin X, Su H, Xing F, Liang Y, Qu G (2016) Interleaved text\/image deep mining on a large-scale radiology database for automated image interpretation. J Mach Learn Res 17:1\u201331. http:\/\/www.jmlr.org\/papers\/volume17\/15-176\/15-176.pdf"},{"key":"15884_CR22","unstructured":"Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556"},{"issue":"2014","key":"15884_CR23","doi-asserted-by":"publisher","first-page":"207","DOI":"10.1162\/tacl_a_00177","volume":"2","author":"R Socher","year":"2014","unstructured":"Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for fnding and describing images with sentences. Trans Assoc Comput Linguist 2(2014):207\u2013218","journal-title":"Trans Assoc Comput Linguist"},{"issue":"14","key":"15884_CR24","doi-asserted-by":"publisher","first-page":"22732","DOI":"10.1364\/OE.430508","volume":"29","author":"J Song","year":"2021","unstructured":"Song J, Zheng Y, Wang J, Ullah MZ, Jiao W (2021) Multicolor image classification using the multimodal information bottleneck network (MMIB-Net) for detecting diabetic retinopathy. Opt Express 29(14):22732\u201322748","journal-title":"Opt Express"},{"key":"15884_CR25","doi-asserted-by":"publisher","unstructured":"Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp 1\u20139. https:\/\/doi.org\/10.1109\/CVPR.2015.7298594","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"15884_CR26","doi-asserted-by":"crossref","unstructured":"Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 10781\u201310790","DOI":"10.1109\/CVPR42600.2020.01079"},{"key":"15884_CR27","doi-asserted-by":"crossref","unstructured":"Tarj\u00e1n B, Szasz\u00e1k G, Fegy\u00f3 T, Mihajlik P (2019) Investigation on N-gram approximated RNNLMs for recognition of morphologically rich speech. In: International conference on statistical language and speech processing. Springer, Cham, pp 223\u2013234","DOI":"10.1007\/978-3-030-31372-2_19"},{"key":"15884_CR28","unstructured":"Tsuneki M, Kanavati F (2022) Inference of captions from histopathological patches. arXiv preprint arXiv: 2202.03432"},{"issue":"4","key":"15884_CR29","doi-asserted-by":"publisher","first-page":"652","DOI":"10.1109\/TPAMI.2016.2587640","volume":"39","author":"O Vinyals","year":"2016","unstructured":"Vinyals O, Toshev A, Bengio S, Erhan D (2016) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652\u2013663","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"15884_CR30","doi-asserted-by":"publisher","first-page":"66680","DOI":"10.1109\/ACCESS.2019.2917979","volume":"7","author":"S Wang","year":"2019","unstructured":"Wang S, Lan L, Zhang X, Dong G, Luo Z (2019) Cascade semantic fusion for image captioning. IEEE Access 7:66680\u201366688","journal-title":"IEEE Access"},{"issue":"3","key":"15884_CR31","doi-asserted-by":"publisher","first-page":"415","DOI":"10.1007\/s41095-022-0274-8","volume":"8","author":"W Wang","year":"2022","unstructured":"Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Tong Lu, Luo P, Shao L (2022) Pvt v2: improved baselines with pyramid vision transformer. Comput Vis Media 8(3):415\u2013424","journal-title":"Comput Vis Media"},{"key":"15884_CR32","doi-asserted-by":"publisher","unstructured":"Wu L, Wan C, Wu Y, Liu J (2018) Generative caption for diabetic retinopathy images, in: 2017 Int. Conf. Secur. Pattern Anal. Cybern. SPAC 2017, pp 515\u2013519. https:\/\/doi.org\/10.1109\/SPAC.2017.8304332","DOI":"10.1109\/SPAC.2017.8304332"},{"issue":"11","key":"15884_CR33","doi-asserted-by":"publisher","first-page":"2942","DOI":"10.1109\/TMM.2019.2915033","volume":"21","author":"X Xiao","year":"2019","unstructured":"Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimedia 21(11):2942\u20132956","journal-title":"IEEE Trans Multimedia"},{"key":"15884_CR34","unstructured":"Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp 2048\u20132057"},{"key":"15884_CR35","doi-asserted-by":"crossref","unstructured":"Yasunaga M, Leskovec J, Liang P (2022) LinkBERT: pretraining language models with document links. arXiv preprint arXiv: 2203.15827","DOI":"10.18653\/v1\/2022.acl-long.551"},{"key":"15884_CR36","doi-asserted-by":"crossref","unstructured":"You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: CVPR, pp 4651\u20134659","DOI":"10.1109\/CVPR.2016.503"},{"key":"15884_CR37","unstructured":"Yu F, Wang D, Chen Y, Karianakis N, Shen T, Yu P, Lymberopoulos D, Lu S, Shi W, Chen X (2019) Unsupervised domain adaptation for object detection via cross-domain semi-supervised learning. arXiv preprint arXiv:1911.07158"},{"key":"15884_CR38","doi-asserted-by":"publisher","first-page":"2608","DOI":"10.1109\/ACCESS.2019.2962195","volume":"8","author":"Z Yuan","year":"2020","unstructured":"Yuan Z, Li X, Wang Q (2020) Exploring multi-level attention and semantic relationship for remote sensing image captioning. IEEE Access 8:2608\u20132620","journal-title":"IEEE Access"},{"key":"15884_CR39","doi-asserted-by":"publisher","first-page":"18772","DOI":"10.1109\/ACCESS.2019.2896713","volume":"7","author":"J Zakraoui","year":"2019","unstructured":"Zakraoui J, Elloumi S, Alja\u2019am JM, Ben Yahia S (2019) Improving Arabic text to image mapping using a robust machine learning technique. IEEE Access 7:18772\u201318782","journal-title":"IEEE Access"}],"container-title":["Multimedia Tools and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11042-023-15884-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11042-023-15884-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11042-023-15884-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,4,2]],"date-time":"2024-04-02T13:11:17Z","timestamp":1712063477000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11042-023-15884-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,1]]},"references-count":39,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2024,4]]}},"alternative-id":["15884"],"URL":"https:\/\/doi.org\/10.1007\/s11042-023-15884-y","relation":{},"ISSN":["1573-7721"],"issn-type":[{"value":"1573-7721","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,1]]},"assertion":[{"value":"6 July 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 April 2023","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 May 2023","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 June 2023","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflicts of interest to report regarding the present study.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}