{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T15:05:40Z","timestamp":1771599940117,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":57,"publisher":"ACM","license":[{"start":{"date-parts":[[2018,10,15]],"date-time":"2018-10-15T00:00:00Z","timestamp":1539561600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Snap"},{"name":"State University of New York at Buffalo"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2018,10,15]]},"DOI":"10.1145\/3240508.3240667","type":"proceedings-article","created":{"date-parts":[[2018,10,18]],"date-time":"2018-10-18T17:52:08Z","timestamp":1539885128000},"page":"1425-1434","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":60,"title":["SibNet"],"prefix":"10.1145","author":[{"given":"Sheng","family":"Liu","sequence":"first","affiliation":[{"name":"State University of New York at Buffalo, Buffalo, NY, USA"}]},{"given":"Zhou","family":"Ren","sequence":"additional","affiliation":[{"name":"Snap Research, Los Angeles, CA, USA"}]},{"given":"Junsong","family":"Yuan","sequence":"additional","affiliation":[{"name":"State University of New York at Buffalo, Buffalo, NY, USA"}]}],"member":"320","published-online":{"date-parts":[[2018,10,15]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR) .","author":"Bahdanau Dzmitry","year":"2015","unstructured":"Dzmitry Bahdanau , Kyunghyun Cho , and Yoshua Bengio . 2015 . Neural machine translation by jointly learning to align and translate . In Proceedings of the International Conference on Learning Representations (ICLR) . Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR) ."},{"key":"e_1_3_2_1_2_1","volume-title":"Proceedings of International Conference on Learning Representations (ICLR) .","author":"Ballas Nicolas","year":"2015","unstructured":"Nicolas Ballas , Li Yao , Chris Pal , and Aaron Courville . 2015 . Delving deeper into convolutional networks for learning video representations . In Proceedings of International Conference on Learning Representations (ICLR) . Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. In Proceedings of International Conference on Learning Representations (ICLR) ."},{"key":"e_1_3_2_1_3_1","volume-title":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 190--200","author":"Chen David L","year":"2011","unstructured":"David L Chen and William B Dolan . 2011 . Collecting highly parallel data for paraphrase evaluation . In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 190--200 . David L Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 190--200."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123275"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123420"},{"key":"e_1_3_2_1_6_1","volume-title":"Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325","author":"Chen Xinlei","year":"2015","unstructured":"Xinlei Chen , Hao Fang , Tsung-Yi Lin , Ramakrishna Vedantam , Saurabh Gupta , Piotr Doll\u00e1r , and C Lawrence Zitnick . 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 ( 2015 ). Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll\u00e1r, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1001\/jama.297.12.1344"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2984064"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.127"},{"key":"e_1_3_2_1_10_1","volume-title":"Proceedings of the International Conference on Artificial Intelligence and Statistics . 249--256","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio . 2010 . Understanding the difficulty of training deep feedforward neural networks . In Proceedings of the International Conference on Artificial Intelligence and Statistics . 249--256 . Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics . 249--256."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2967242"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Chiori Hori Takaaki Hori and Teng-Yok Lee. 2017. Attention-based multimodal fusion for video description. (2017).  Chiori Hori Takaaki Hori and Teng-Yok Lee. 2017. Attention-based multimodal fusion for video description. (2017).","DOI":"10.1109\/ICCV.2017.450"},{"key":"e_1_3_2_1_14_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .","author":"Huang Gao","unstructured":"Gao Huang , Zhuang Liu , Kilian Q Weinberger , and Laurens van der Maaten. 2017. Densely connected convolutional networks . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2984065"},{"key":"e_1_3_2_1_16_1","volume-title":"International Conference on Learning Representations (ICLR)","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization . International Conference on Learning Representations (ICLR) (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) (2014)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123366"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806314"},{"key":"e_1_3_2_1_19_1","volume-title":"Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin . 2004 . Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004). Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004)."},{"key":"e_1_3_2_1_20_1","volume-title":"International Conference on Learning Representations (ICLR) .","author":"Lin Zhouhan","year":"2017","unstructured":"Zhouhan Lin , Minwei Feng , Cicero Nogueira dos Santos , Mo Yu , Bing Xiang , Bowen Zhou , and Yoshua Bengio . 2017 . A structured self-attentive sentence embedding . In International Conference on Learning Representations (ICLR) . Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In International Conference on Learning Representations (ICLR) ."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2967298"},{"key":"e_1_3_2_1_22_1","first-page":"1","article-title":"Mel frequency cepstral coefficients for music modeling","volume":"270","author":"Beth Logan","year":"2000","unstructured":"Beth Logan et almbox. 2000 . Mel frequency cepstral coefficients for music modeling . In ISMIR , Vol. 270. 1 -- 11 . Beth Logan et almbox. 2000. Mel frequency cepstral coefficients for music modeling. In ISMIR, Vol. 270. 1--11.","journal-title":"ISMIR"},{"key":"e_1_3_2_1_23_1","unstructured":"Tao Mei Yong Rui Xinmei Tian and Ting Yao. 2017. MSR-VTT Challenge. http:\/\/ms-multimedia-challenge.com\/2017\/challenge . (2017).  Tao Mei Yong Rui Xinmei Tian and Ting Yao. 2017. MSR-VTT Challenge. http:\/\/ms-multimedia-challenge.com\/2017\/challenge . (2017)."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/3104322.3104425"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.117"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.497"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.111"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1117"},{"key":"e_1_3_2_1_30_1","volume-title":"Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification","author":"Peng Yuxin","year":"2018","unstructured":"Yuxin Peng , Yunzhen Zhao , and Junchao Zhang . 2018. Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification . IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) ( 2018 ). Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. 2018. Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) (2018)."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2984066"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2967212"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.5244\/C.31.89"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.128"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.548"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2984062"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/2627435.2670313"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123354"},{"key":"e_1_3_2_1_42_1","volume-title":"Proceedings of Association for Computational Linguistics (ACL). ACL, 384--394","author":"Turian Joseph","year":"2010","unstructured":"Joseph Turian , Lev Ratinov , and Yoshua Bengio . 2010 . Word representations: a simple and general method for semi-supervised learning . In Proceedings of Association for Computational Linguistics (ACL). ACL, 384--394 . Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of Association for Computational Linguistics (ACL). ACL, 384--394."},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1204"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.515"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"crossref","unstructured":"Subhashini Venugopalan Huijuan Xu Jeff Donahue Marcus Rohrbach Raymond Mooney and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. North American Chapter of the Association for Computational Linguistics (NAACL) .  Subhashini Venugopalan Huijuan Xu Jeff Donahue Marcus Rohrbach Raymond Mooney and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. North American Chapter of the Association for Computational Linguistics (NAACL) .","DOI":"10.3115\/v1\/N15-1173"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00795"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2964299"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.571"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123448"},{"key":"e_1_3_2_1_52_1","volume-title":"International Conference on Machine Learning (ICML). 2048--2057","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu , Jimmy Ba , Ryan Kiros , Kyunghyun Cho , Aaron Courville , Ruslan Salakhudinov , Rich Zemel , and Yoshua Bengio . 2015 . Show, attend and tell: Neural image caption generation with visual attention . In International Conference on Machine Learning (ICML). 2048--2057 . Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML). 2048--2057."},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.512"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.496"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.347"},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.147"}],"event":{"name":"MM '18: ACM Multimedia Conference","location":"Seoul Republic of Korea","acronym":"MM '18","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 26th ACM international conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3240508.3240667","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3240508.3240667","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:43:31Z","timestamp":1750207411000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3240508.3240667"}},"subtitle":["Sibling Convolutional Encoder for Video Captioning"],"short-title":[],"issued":{"date-parts":[[2018,10,15]]},"references-count":57,"alternative-id":["10.1145\/3240508.3240667","10.1145\/3240508"],"URL":"https:\/\/doi.org\/10.1145\/3240508.3240667","relation":{},"subject":[],"published":{"date-parts":[[2018,10,15]]},"assertion":[{"value":"2018-10-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}