{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,2,1]],"date-time":"2024-02-01T05:36:10Z","timestamp":1706765770029},"reference-count":28,"publisher":"Institute of Electronics, Information and Communications Engineers (IEICE)","issue":"7","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IEICE Trans. Inf. &amp; Syst."],"published-print":{"date-parts":[[2021,7,1]]},"DOI":"10.1587\/transinf.2020edp7227","type":"journal-article","created":{"date-parts":[[2021,6,30]],"date-time":"2021-06-30T22:20:57Z","timestamp":1625091657000},"page":"941-947","source":"Crossref","is-referenced-by-count":4,"title":["Image Captioning Algorithm Based on Multi-Branch CNN and Bi-LSTM"],"prefix":"10.1587","volume":"E104.D","author":[{"given":"Shan","family":"HE","sequence":"first","affiliation":[{"name":"School of Information Science and Technology, North China University of Technology"}]},{"given":"Yuanyao","family":"LU","sequence":"additional","affiliation":[{"name":"School of Information Science and Technology, North China University of Technology"}]},{"given":"Shengnan","family":"CHEN","sequence":"additional","affiliation":[{"name":"School of Information Science and Technology, North China University of Technology"}]}],"member":"532","reference":[{"key":"1","doi-asserted-by":"crossref","unstructured":"[1] C.G. Harris and M. Stephens, \u201cA combined corner and edge detector,\u201d Proc. Alvey Vision Conference 1988, pp.23.1-23.6, 1988. 10.5244\/c.2.23","DOI":"10.5244\/C.2.23"},{"key":"2","doi-asserted-by":"publisher","unstructured":"[2] N.M. Oliver, B. Rosario, and A.P. Pentland, \u201cA Bayesian computer vision system for modeling human interactions,\u201d IEEE Trans. Pattern Anal. Mach. Intell., vol.22, no.8, pp.831-843, 2000. 10.1109\/34.868684","DOI":"10.1109\/34.868684"},{"key":"3","doi-asserted-by":"crossref","unstructured":"[3] J. Dai, K. He, and J. Sun, \u201cInstance-aware semantic segmentation via multi-task network cascades,\u201d 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3150-3158, 2016. 10.1109\/cvpr.2016.343","DOI":"10.1109\/CVPR.2016.343"},{"key":"4","doi-asserted-by":"publisher","unstructured":"[4] R. Fergus, P. Perona, and A. Zisserman, \u201cWeakly supervised scale-invariant learning of models for visual recognition,\u201d Int. J. Comput. Vision., vol.71, no.3, pp.273-303, 2007. 10.1007\/s11263-006-8707-x","DOI":"10.1007\/s11263-006-8707-x"},{"key":"5","doi-asserted-by":"publisher","unstructured":"[5] Y. LeCun, Y. Bengio, and G. Hinton, \u201cDeep learning,\u201d Nature., vol.521, no.7553, pp.436-444, 2015. 10.1038\/nature14539","DOI":"10.1038\/nature14539"},{"key":"6","unstructured":"[6] I. Sutskever, O. Vinyals, and Q.V. Le, \u201cSequence to sequence learning with neural networks,\u201d Advances in neural information processing systems, pp.3104-3112, 2014."},{"key":"7","doi-asserted-by":"crossref","unstructured":"[7] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, \u201cShow and tell: A neural image caption generator,\u201d 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3156-3164, 2015. 10.1109\/cvpr.2015.7298935","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"8","doi-asserted-by":"publisher","unstructured":"[8] E. Cambria and B. White, \u201cJumping NLP curves: A review of natural language processing research,\u201d IEEE Comput. Intell. Mag., vol.9, no.2, pp.48-57, 2014. 10.1109\/mci.2014.2307227","DOI":"10.1109\/MCI.2014.2307227"},{"key":"9","doi-asserted-by":"publisher","unstructured":"[9] A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, \u201cEvery Picture Tells a Story: Generating Sentences from Images,\u201d Computer Vision-ECCV 2010, Lecture Notes in Computer Science, vol.6314, pp.15-29, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. 10.1007\/978-3-642-15561-1_2","DOI":"10.1007\/978-3-642-15561-1_2"},{"key":"10","unstructured":"[10] V. Ordonez, G. Kulkarni, and T.L. Berg, \u201cIm2text: Describing images using 1 million captioned photographs,\u201d Advances in neural information processing systems, pp.1143-1151, 2011."},{"key":"11","unstructured":"[11] M. Mitchell, X. Han, J. Dodge, et al., \u201cMidge: Generating image descriptions from computer vision detections,\u201d Proc. 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp.747-756, 2012."},{"key":"12","unstructured":"[12] Y. Yang, C.L. Teo, H. Daum\u00e9 III, et al., \u201cCorpus-guided sentence generation of natural images,\u201d Proc. Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp.444-454, 2011. 10.18653\/v1\/d18-1536"},{"key":"13","doi-asserted-by":"crossref","unstructured":"[13] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, \u201cLearning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,\u201d Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.1724-1734, 2014. 10.3115\/v1\/d14-1179","DOI":"10.3115\/v1\/D14-1179"},{"key":"14","unstructured":"[14] R. Kiros, R. Salakhutdinov, and R.S. Zemel, \u201cUnifying visual-semantic embeddings with multimodal neural language models,\u201d NIPS 2014, arXiv:1411.2539, 2014."},{"key":"15","doi-asserted-by":"publisher","unstructured":"[15] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, \u201cShow and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge,\u201d IEEE Trans. Pattern Anal. Mach. Intell., vol.39, no.4, pp.652-663, 2017. 10.1109\/tpami.2016.2587640","DOI":"10.1109\/TPAMI.2016.2587640"},{"key":"16","doi-asserted-by":"crossref","unstructured":"[16] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, \u201cLong-term recurrent convolutional networks for visual recognition and description,\u201d 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2625-2634, 2015. 10.1109\/cvpr.2015.7298878","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"17","doi-asserted-by":"crossref","unstructured":"[17] A. Karpathy and L. Fei-Fei, \u201cDeep visual-semantic alignments for generating image descriptions,\u201d 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3128-3137, 2015. 10.1109\/cvpr.2015.7298932","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"18","unstructured":"[18] A. Karpathy, A. Joulin, and L.F. Fei-Fei, \u201cDeep fragment embeddings for bidirectional image sentence mapping,\u201d Advances in neural information processing systems, pp.1889-1897, 2014."},{"key":"19","unstructured":"[19] K. Xu, J. Ba, R. Kiros, et al., \u201cShow, attend and tell: Neural image caption generation with visual attention,\u201d International conference on machine learning, pp.2048-2057, 2015."},{"key":"20","doi-asserted-by":"crossref","unstructured":"[20] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, \u201cBLEU: a method for automatic evaluation of machine translation,\u201d Proc. 40th Annual Meeting on Association for Computational Linguistics-ACL &apos;02, pp.311-318, 2001. 10.3115\/1073083.1073135","DOI":"10.3115\/1073083.1073135"},{"key":"21","unstructured":"[21] S. Banerjee and A. Lavie, \u201cMETEOR: An automatic metric for MT evaluation with improved correlation with human judgments,\u201d Proc. ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization, pp.65-72, 2005."},{"key":"22","doi-asserted-by":"crossref","unstructured":"[22] R. Vedantam, C.L. Zitnick, and D. Parikh, \u201cCIDEr: Consensus-based image description evaluation,\u201d 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4566-4575, 2015. 10.1109\/cvpr.2015.7299087","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"23","unstructured":"[23] C. Rashtchian, P. Young, M. Hodosh, et al., \u201cCollecting image annotations using Amazon&apos;s Mechanical Turk,\u201d Proc. NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon&apos;s Mechanical Turk, 2010."},{"key":"24","doi-asserted-by":"publisher","unstructured":"[24] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, \u201cFrom image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,\u201d Transactions of the Association for Computational Linguistics, vol.2, pp.67-78, 2014. 10.1162\/tacl_a_00166","DOI":"10.1162\/tacl_a_00166"},{"key":"25","doi-asserted-by":"crossref","unstructured":"[25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C.L. Zitnick, \u201cMicrosoft COCO: Common objects in context,\u201d Computer Vision-ECCV 2014, Lecture Notes in Computer Science, vol.8693, pp.740-755, Springer International Publishing, Cham, 2014. 10.1007\/978-3-319-10602-1_48","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"26","unstructured":"[26] J. Mao, W. Xu, Y. Yang, et al., \u201cDeep captioning with multimodal recurrent neural networks (m-rnn),\u201d ICLR 2015, arXiv:1412.6632, 2015."},{"key":"27","doi-asserted-by":"crossref","unstructured":"[27] J. Lu, C. Xiong, D. Parikh, and R. Socher, \u201cKnowing when to look: Adaptive attention via a visual sentinel for image captioning,\u201d 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3242-3250, 2017. 10.1109\/cvpr.2017.345","DOI":"10.1109\/CVPR.2017.345"},{"key":"28","doi-asserted-by":"crossref","unstructured":"[28] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, \u201cSCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,\u201d 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.6298-6306, 2017. 10.1109\/cvpr.2017.667","DOI":"10.1109\/CVPR.2017.667"}],"container-title":["IEICE Transactions on Information and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E104.D\/7\/E104.D_2020EDP7227\/_pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,7,3]],"date-time":"2021-07-03T06:44:30Z","timestamp":1625294670000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E104.D\/7\/E104.D_2020EDP7227\/_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,1]]},"references-count":28,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2021]]}},"URL":"https:\/\/doi.org\/10.1587\/transinf.2020edp7227","relation":{},"ISSN":["0916-8532","1745-1361"],"issn-type":[{"value":"0916-8532","type":"print"},{"value":"1745-1361","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,1]]},"article-number":"2020EDP7227"}}