{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,20]],"date-time":"2026-07-20T11:56:35Z","timestamp":1784548595195,"version":"3.55.0"},"publisher-location":"New York, NY, USA","reference-count":55,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Purple Mountain Laboratories"},{"name":"Jiangsu Provincial Key Laboratory of Network and Information Security","award":["No.BM2003201"],"award-info":[{"award-number":["No.BM2003201"]}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"publisher","award":["No.BK20191258"],"award-info":[{"award-number":["No.BK20191258"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Jiangsu Provincial Key Laboratory of Computer Networking Technology"},{"name":"National Key R&D Project of China","award":["No.2021QY2102"],"award-info":[{"award-number":["No.2021QY2102"]}]},{"name":"Key Laboratory of Computer Network and Information Integration of Ministry of Education of China","award":["No.93K-9"],"award-info":[{"award-number":["No.93K-9"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["No.62172089, No.61972087, No.62172090, No.62106045"],"award-info":[{"award-number":["No.62172089, No.61972087, No.62172090, No.62106045"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548343","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:12Z","timestamp":1665416592000},"page":"3598-3607","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":38,"title":["Two-Stream Transformer for Multi-Label Image Classification"],"prefix":"10.1145","author":[{"given":"Xuelin","family":"Zhu","sequence":"first","affiliation":[{"name":"Southeast University, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jiuxin","family":"Cao","sequence":"additional","affiliation":[{"name":"Southeast University &amp; Purple Mountain Laboratories, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jiawei","family":"Ge","sequence":"additional","affiliation":[{"name":"Southeast University, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Weijia","family":"Liu","sequence":"additional","affiliation":[{"name":"Southeast University, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bo","family":"Liu","sequence":"additional","affiliation":[{"name":"Southeast University &amp; Purple Mountain Laboratories, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E Hinton","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba , Jamie Ryan Kiros, and Geoffrey E Hinton . 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12230"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12281"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00061"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2019.00113"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00532"},{"key":"e_1_3_2_2_9_1","volume-title":"MlTr: Multi-label Classification with Transformer. arXiv preprint arXiv:2106.06195","author":"Cheng Xing","year":"2021","unstructured":"Xing Cheng , Hezheng Lin , Xiangyu Wu , Fan Yang , Dong Shen , Zhongyuan Wang , Nian Shi , and Honglin Liu . 2021. MlTr: Multi-label Classification with Transformer. arXiv preprint arXiv:2106.06195 ( 2021 ). Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen, Zhongyuan Wang, Nian Shi, and Honglin Liu. 2021. MlTr: Multi-label Classification with Transformer. arXiv preprint arXiv:2106.06195 (2021)."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1646396.1646452"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00359"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_2_13_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_14_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_2_15_1","volume-title":"Christopher KI Williams, John Winn, and Andrew Zisserman.","author":"Everingham Mark","year":"2010","unstructured":"Mark Everingham , Luc Van Gool , Christopher KI Williams, John Winn, and Andrew Zisserman. 2010 . The pascal visual object classes (voc) challenge. International journal of computer vision 88, 2 (2010), 303--338. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 2 (2010), 303--338."},{"key":"e_1_3_2_2_16_1","volume-title":"Multi-label image recognition with multiclass attentional regions. arXiv e-prints","author":"Gao Bin-Bin","year":"2020","unstructured":"Bin-Bin Gao and Hong-Yu Zhou . 2020. Multi-label image recognition with multiclass attentional regions. arXiv e-prints ( 2020 ), arXiv--2007. Bin-Bin Gao and Hong-Yu Zhou. 2020. Multi-label image recognition with multiclass attentional regions. arXiv e-prints (2020), arXiv--2007."},{"key":"e_1_3_2_2_17_1","unstructured":"Yunchao Gong Yangqing Jia Thomas Leung Alexander Toshev and Sergey Ioffe. 2014. Deep Convolutional Ranking for Multilabel Image Annotation. arXiv:1312.4894 [cs.CV]  Yunchao Gong Yangqing Jia Thomas Leung Alexander Toshev and Sergey Ioffe. 2014. Deep Convolutional Ranking for Multilabel Image Annotation. arXiv:1312.4894 [cs.CV]"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11770"},{"key":"e_1_3_2_2_20_1","volume-title":"Long short-term memory. Neural computation 9, 8","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long short-term memory. Neural computation 9, 8 ( 1997 ), 1735--1780. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780."},{"key":"e_1_3_2_2_21_1","volume-title":"International Conference on Machine Learning. PMLR, 5583--5594","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim , Bokyung Son , and Ildoo Kim . 2021 . Vilt: Vision-and-language transformer without convolution or region supervision . In International Conference on Machine Learning. PMLR, 5583--5594 . Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01621"},{"key":"e_1_3_2_2_23_1","first-page":"1","article-title":"Multi-label Image Classification with A Probabilistic Label Enhancement Model","volume":"1","author":"Li Xin","year":"2014","unstructured":"Xin Li , Feipeng Zhao , and Yuhong Guo . 2014 . Multi-label Image Classification with A Probabilistic Label Enhancement Model .. In UAI , Vol. 1. 1 -- 10 . Xin Li, Feipeng Zhao, and Yuhong Guo. 2014. Multi-label Image Classification with A Probabilistic Label Enhancement Model.. In UAI, Vol. 1. 1--10.","journal-title":"UAI"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46466-4_41"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_2_26_1","volume-title":"Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834","author":"Liu Shilong","year":"2021","unstructured":"Shilong Liu , Lei Zhang , Xiao Yang , Hang Su , and Jun Zhu . 2021. Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834 ( 2021 ). Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. 2021. Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834 (2021)."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240567"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_2_29_1","volume-title":"Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101","author":"Loshchilov Ilya","year":"2017","unstructured":"Ilya Loshchilov and Frank Hutter . 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 ( 2017 ). Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)."},{"key":"e_1_3_2_2_30_1","volume-title":"Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu , Dhruv Batra , Devi Parikh , and Stefan Lee . 2019 . Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019). Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_2_2_32_1","volume-title":"Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 ( 2015 ). Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)."},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00015"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00144"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299097"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00717"},{"key":"e_1_3_2_2_37_1","volume-title":"Theo Gevers, and Arnold WM Smeulders.","author":"Uijlings Jasper RR","year":"2013","unstructured":"Jasper RR Uijlings , Koen EA Van De Sande , Theo Gevers, and Arnold WM Smeulders. 2013 . Selective search for object recognition. International journal of computer vision 104, 2 (2013), 154--171. Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective search for object recognition. International journal of computer vision 104, 2 (2013), 154--171."},{"key":"e_1_3_2_2_38_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.251"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2612829"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6909"},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.58"},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2913513"},{"key":"e_1_3_2_2_44_1","volume-title":"HCP: A flexible CNN framework for multi-label image classification","author":"Wei Yunchao","year":"2015","unstructured":"Yunchao Wei , Wei Xia , Min Lin , Junshi Huang , Bingbing Ni , Jian Dong , Yao Zhao , and Shuicheng Yan . 2015 . HCP: A flexible CNN framework for multi-label image classification . IEEE transactions on pattern analysis and machine intelligence 38, 9 (2015), 1901--1907. Yunchao Wei,Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2015. HCP: A flexible CNN framework for multi-label image classification. IEEE transactions on pattern analysis and machine intelligence 38, 9 (2015), 1901--1907."},{"key":"e_1_3_2_2_45_1","unstructured":"Yanan Wu He Liu Songhe Feng Yi Jin Gengyu Lyu and Zizhang Wu. 2021. GM-MLIC: Graph Matching based Multi-Label Image Classification. arXiv:2104.14762 [cs.CV]  Yanan Wu He Liu Songhe Feng Yi Jin Gengyu Lyu and Zizhang Wu. 2021. GM-MLIC: Graph Matching based Multi-Label Image Classification. arXiv:2104.14762 [cs.CV]"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.3002185"},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.37"},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806375"},{"key":"e_1_3_2_2_49_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 13440--13449","author":"Yazici Vacit Oguz","unstructured":"Vacit Oguz Yazici , Abel Gonzalez-Garcia , Arnau Ramisa , Bartlomiej Twardowski , and Joost van de Weijer. 2020. Orderless recurrent models for multi-label classification . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 13440--13449 . Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. 2020. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 13440--13449."},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58589-1_39"},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6964"},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2812605"},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475191"},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.219"},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00025"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548343","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548343","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:43Z","timestamp":1750186843000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548343"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":55,"alternative-id":["10.1145\/3503161.3548343","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548343","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}