{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,4]],"date-time":"2026-06-04T23:58:28Z","timestamp":1780617508682,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":81,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Natural ScienceFoundation of China","award":["61825601"],"award-info":[{"award-number":["61825601"]}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"publisher","award":["BK20192004B"],"award-info":[{"award-number":["BK20192004B"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,17]]},"DOI":"10.1145\/3474085.3475292","type":"proceedings-article","created":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T17:45:27Z","timestamp":1634579127000},"page":"1553-1561","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":163,"title":["Former-DFER: Dynamic Facial Expression Recognition Transformer"],"prefix":"10.1145","author":[{"given":"Zengqun","family":"Zhao","sequence":"first","affiliation":[{"name":"Nanjing University of Information Science &amp; Technology, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Qingshan","family":"Liu","sequence":"additional","affiliation":[{"name":"Nanjing University of Information Science &amp; Technology, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,10,17]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Masih Aminbeidokhti Marco Pedersoli Patrick Cardinal and Eric Granger. 2019. Emotion recognition with spatial attention and temporal softmax pooling. In ICIAR. 323--331.  Masih Aminbeidokhti Marco Pedersoli Patrick Cardinal and Eric Granger. 2019. Emotion recognition with spatial attention and temporal softmax pooling. In ICIAR. 323--331.","DOI":"10.1007\/978-3-030-27202-9_29"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10639-019-10004-6"},{"key":"e_1_3_2_1_3_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E Hinton","author":"Ba Jimmy Lei","year":"2016"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Wissam J Baddar and Yong Man Ro. 2019. Mode variational lstm robust to unseen modes of variation: Application to facial expression recognition. In AAAI. 3215--3223.  Wissam J Baddar and Yong Man Ro. 2019. Mode variational lstm robust to unseen modes of variation: Application to facial expression recognition. In AAAI. 3215--3223.","DOI":"10.1609\/aaai.v33i01.33013215"},{"key":"e_1_3_2_1_5_1","volume-title":"Zhiyuan Li, James O'Reilly, Shizhong Han, Ping Liu, Min Chen, and Yan Tong.","author":"Cai Jie","year":"2019"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Nicolas Carion Francisco Massa Gabriel Synnaeve Nicolas Usunier Alexander Kirillov and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. 213--229.  Nicolas Carion Francisco Massa Gabriel Synnaeve Nicolas Usunier Alexander Kirillov and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. 213--229.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"crossref","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR. 6299--6308.  Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR. 6299--6308.","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"crossref","unstructured":"Joyati Chattopadhyay Souvik Kundu Arpita Chakraborty and Jyoti Sekhar Banerjee. 2018. Facial expression recognition for human computer interaction. In ICCVBIC. 1181--1192.  Joyati Chattopadhyay Souvik Kundu Arpita Chakraborty and Jyoti Sekhar Banerjee. 2018. Facial expression recognition for human computer interaction. In ICCVBIC. 1181--1192.","DOI":"10.1007\/978-3-030-41862-5_119"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2663204.2666277"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2016.2593719"},{"key":"e_1_3_2_1_11_1","volume-title":"Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555","author":"Chung Junyoung","year":"2014"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2515606"},{"key":"e_1_3_2_1_13_1","volume-title":"The expression of the emotions in man and animals","author":"Darwin Charles"},{"key":"e_1_3_2_1_14_1","volume-title":"Retinaface: Single-shot multi-level face localisation in the wild. In CVPR. 5203--5212.","author":"Deng Jiankang","year":"2020"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3340555.3355710"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2522848.2531739"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2012.26"},{"key":"e_1_3_2_1_18_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2002.801449"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2818346.2830596"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3242969.3264978"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2993148.2997632"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Rohit Girdhar Joao Carreira Carl Doersch and Andrew Zisserman. 2019. Video action transformer network. In CVPR. 244--253.  Rohit Girdhar Joao Carreira Carl Doersch and Andrew Zisserman. 2019. Video action transformer network. In CVPR. 244--253.","DOI":"10.1109\/CVPR.2019.00033"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"crossref","unstructured":"Kensho Hara Hirokatsu Kataoka and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In CVPR. 6546--6555.  Kensho Hara Hirokatsu Kataoka and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In CVPR. 6546--6555.","DOI":"10.1109\/CVPR.2018.00685"},{"key":"e_1_3_2_1_25_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_1_27_1","volume-title":"RFAU: A Database for Facial Action Unit Analysis in Real Classrooms","author":"Hu Qiaoping","year":"2020"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2663204.2666278"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"crossref","unstructured":"Ramin Irani Kamal Nasrollahi Marc O Simon Ciprian A Corneanu Sergio Escalera Chris Bahnsen Dennis H Lundtoft Thomas B Moeslund Tanja L Pedersen Maria-Louise Klitgaard etal 2015. Spatiotemporal analysis of RGB-DT facial images for multimodal pain level recognition. In CVPRW. 88--95.  Ramin Irani Kamal Nasrollahi Marc O Simon Ciprian A Corneanu Sergio Escalera Chris Bahnsen Dennis H Lundtoft Thomas B Moeslund Tanja L Pedersen Maria-Louise Klitgaard et al. 2015. Spatiotemporal analysis of RGB-DT facial images for multimodal pain level recognition. In CVPRW. 88--95.","DOI":"10.1109\/CVPRW.2015.7301341"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413620"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3051462"},{"key":"e_1_3_2_1_32_1","volume-title":"Fahad Shahbaz Khan, and Mubarak Shah.","author":"Khan Salman","year":"2021"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"crossref","unstructured":"Jean Kossaifi Antoine Toisoul Adrian Bulat Yannis Panagakis Timothy M Hospedales and Maja Pantic. 2020. Factorized higher-order CNNs with an application to spatio-temporal emotion estimation. In CVPR. 6060--6069.  Jean Kossaifi Antoine Toisoul Adrian Bulat Yannis Panagakis Timothy M Hospedales and Maja Pantic. 2020. Factorized higher-order CNNs with an application to spatio-temporal emotion estimation. In CVPR. 6060--6069.","DOI":"10.1109\/CVPR42600.2020.00610"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Felix Kuhnke Lars Rumberg and J\u00f6rn Ostermann. 2020. Two-Stream Aural- Visual Affect Analysis in the Wild. In FG. 366--371.  Felix Kuhnke Lars Rumberg and J\u00f6rn Ostermann. 2020. Two-Stream Aural- Visual Affect Analysis in the Wild. In FG. 366--371.","DOI":"10.1109\/FG47880.2020.00056"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"crossref","unstructured":"Vikas Kumar Shivansh Rao and Li Yu. 2020. Noisy Student Training using Body Language Dataset Improves Facial Expression Recognition. In ECCV. 756--773.  Vikas Kumar Shivansh Rao and Li Yu. 2020. Noisy Student Training using Body Language Dataset Improves Facial Expression Recognition. In ECCV. 756--773.","DOI":"10.1007\/978-3-030-66415-2_53"},{"key":"e_1_3_2_1_36_1","unstructured":"Jiyoung Lee Seungryong Kim Sunok Kim Jungin Park and Kwanghoon Sohn. 2019. Context-aware emotion recognition networks. In ICCV. 10143--10152.  Jiyoung Lee Seungryong Kim Sunok Kim Jungin Park and Kwanghoon Sohn. 2019. Context-aware emotion recognition networks. In ICCV. 10143--10152."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2996086"},{"key":"e_1_3_2_1_38_1","volume-title":"Dae Ha Kim, and Byung Cheol Song.","author":"Lee Min Kyu","year":"2019"},{"key":"e_1_3_2_1_39_1","unstructured":"Beibin Li Sachin Mehta Deepali Aneja Claire Foster Pamela Ventola Frederick Shic and Linda Shapiro. 2019. A facial affect analysis system for autism spectrum disorder. In ICIP. 4549--4553.  Beibin Li Sachin Mehta Deepali Aneja Claire Foster Pamela Ventola Frederick Shic and Linda Shapiro. 2019. A facial affect analysis system for autism spectrum disorder. In ICIP. 4549--4553."},{"key":"e_1_3_2_1_40_1","volume-title":"Deep facial expression recognition: A survey","author":"Li Shan","year":"2020"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3340555.3355719"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2886767"},{"key":"e_1_3_2_1_43_1","unstructured":"Daizong Liu Hongting Zhang and Pan Zhou. 2020. Video-based Facial Expression Recognition using Graph Convolutional Networks. In ICPR.  Daizong Liu Hongting Zhang and Pan Zhou. 2020. Video-based Facial Expression Recognition using Graph Convolutional Networks. In ICPR."},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.226"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2663204.2666274"},{"key":"e_1_3_2_1_46_1","volume-title":"Graph-based Facial Affect Analysis: A Review of Methods, Applications and Challenges. arXiv preprint arXiv:2103.15599","author":"Liu Yang","year":"2021"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3242969.3264992"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Patrick Lucey Jeffrey F Cohn Takeo Kanade Jason Saragih Zara Ambadar and Iain Matthews. 2010. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In CVPRW. 94--101.  Patrick Lucey Jeffrey F Cohn Takeo Kanade Jason Saragih Zara Ambadar and Iain Matthews. 2010. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In CVPRW. 94--101.","DOI":"10.1109\/CVPRW.2010.5543262"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"crossref","unstructured":"Debin Meng Xiaojiang Peng Kai Wang and Yu Qiao. 2019. Frame attention networks for facial expression recognition in videos. In ICIP. 3866--3870.  Debin Meng Xiaojiang Peng Kai Wang and Yu Qiao. 2019. Frame attention networks for facial expression recognition in videos. In ICIP. 3866--3870.","DOI":"10.1109\/ICIP.2019.8803603"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"crossref","unstructured":"Trisha Mittal Uttaran Bhattacharya Rohan Chandra Aniket Bera and Dinesh Manocha. 2020. M3er: Multiplicative multimodal emotion recognition using facial textual and speech cues. In AAAI. 1359--1367.  Trisha Mittal Uttaran Bhattacharya Rohan Chandra Aniket Bera and Dinesh Manocha. 2020. M3er: Multiplicative multimodal emotion recognition using facial textual and speech cues. In AAAI. 1359--1367.","DOI":"10.1609\/aaai.v34i02.5492"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2017.2713783"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3136755.3143012"},{"key":"e_1_3_2_1_53_1","unstructured":"Maja Pantic Michel Valstar Ron Rademaker and Ludo Maat. 2005. Web-based database for facial expression analysis. In ICME.  Maja Pantic Michel Valstar Ron Rademaker and Ludo Maat. 2005. Web-based database for facial expression analysis. In ICME."},{"key":"e_1_3_2_1_54_1","volume-title":"BAM: Bottleneck Attention Module. In BMCV.","author":"Park Jongchan","year":"2018"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3455008"},{"key":"e_1_3_2_1_56_1","unstructured":"Zhaofan Qiu Ting Yao and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV. 5533--5541.  Zhaofan Qiu Ting Yao and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV. 5533--5541."},{"key":"e_1_3_2_1_57_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014"},{"key":"e_1_3_2_1_58_1","volume-title":"Carl Vondrick, Kevin Murphy, and Cordelia Schmid.","author":"Sun Chen","year":"2019"},{"key":"e_1_3_2_1_59_1","volume-title":"Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877","author":"Touvron Hugo","year":"2020"},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"crossref","unstructured":"Du Tran Heng Wang Lorenzo Torresani Jamie Ray Yann LeCun and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450--6459.  Du Tran Heng Wang Lorenzo Torresani Jamie Ray Yann LeCun and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450--6459.","DOI":"10.1109\/CVPR.2018.00675"},{"key":"e_1_3_2_1_62_1","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"der Maaten Laurens Van","year":"2008","journal-title":"JMLR"},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3136755.3143011"},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2020.3007531"},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/3340555.3355720"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.439"},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"crossref","unstructured":"Yandong Wen Kaipeng Zhang Zhifeng Li and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In ECCV. 499--515.  Yandong Wen Kaipeng Zhang Zhifeng Li and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In ECCV. 499--515.","DOI":"10.1007\/978-3-319-46478-7_31"},{"key":"e_1_3_2_1_69_1","doi-asserted-by":"crossref","unstructured":"Torsten Wilhelm. 2019. Towards facial expression analysis in a driver assistance system. In FG. 1--4.  Torsten Wilhelm. 2019. Towards facial expression analysis in a driver assistance system. In FG. 1--4.","DOI":"10.1109\/FG.2019.8756565"},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"crossref","unstructured":"Fuzhi Yang Huan Yang Jianlong Fu Hongtao Lu and Baining Guo. 2020. Learning texture transformer network for image super-resolution. In CVPR. 5791--5800.  Fuzhi Yang Huan Yang Jianlong Fu Hongtao Lu and Baining Guo. 2020. Learning texture transformer network for image super-resolution. In CVPR. 5791--5800.","DOI":"10.1109\/CVPR42600.2020.00583"},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"crossref","unstructured":"Peng Yang Qingshan Liu Xinyi Cui and Dimitris N Metaxas. 2008. Facial expression recognition using encoded dynamic features. In CVPR. 1--8.  Peng Yang Qingshan Liu Xinyi Cui and Dimitris N Metaxas. 2008. Facial expression recognition using encoded dynamic features. In CVPR. 1--8.","DOI":"10.1109\/CVPR.2008.4587717"},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2008.03.014"},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2018.07.028"},{"key":"e_1_3_2_1_74_1","doi-asserted-by":"crossref","unstructured":"Stefanos Zafeiriou Dimitrios Kollias Mihalis A Nicolaou Athanasios Papaioannou Guoying Zhao and Irene Kotsia. 2017. Aff-Wild: valence and arousal 'in-the-wild' challenge. In CVPRW. 34--41.  Stefanos Zafeiriou Dimitrios Kollias Mihalis A Nicolaou Athanasios Papaioannou Guoying Zhao and Irene Kotsia. 2017. Aff-Wild: valence and arousal 'in-the-wild' challenge. In CVPRW. 34--41.","DOI":"10.1109\/CVPRW.2017.248"},{"key":"e_1_3_2_1_75_1","doi-asserted-by":"crossref","unstructured":"Yuan-Hang Zhang Rulin Huang Jiabei Zeng and Shiguang Shan. 2020. M3F: Multi-Modal Continuous Valence-Arousal Estimation in the Wild. In FG. 617--621.  Yuan-Hang Zhang Rulin Huang Jiabei Zeng and Shiguang Shan. 2020. M3F: Multi-Modal Continuous Valence-Arousal Estimation in the Wild. In FG. 617--621.","DOI":"10.1109\/FG47880.2020.00098"},{"key":"e_1_3_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2011.07.002"},{"key":"e_1_3_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3093397"},{"key":"e_1_3_2_1_78_1","doi-asserted-by":"crossref","first-page":"3510","DOI":"10.1609\/aaai.v35i4.16465","article-title":"Robust Lightweight Facial Expression Recognition Network with Label Distribution Training","volume":"35","author":"Zhao Zengqun","year":"2021","journal-title":"AAAI"},{"key":"e_1_3_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2013.173"},{"key":"e_1_3_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2011.5995345"},{"key":"e_1_3_2_1_81_1","unstructured":"Xizhou Zhu Weijie Su Lewei Lu Bin Li Xiaogang Wang and Jifeng Dai. 2021. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR. 1--16.  Xizhou Zhu Weijie Su Lewei Lu Bin Li Xiaogang Wang and Jifeng Dai. 2021. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR. 1--16."}],"event":{"name":"MM '21: ACM Multimedia Conference","location":"Virtual Event China","acronym":"MM '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 29th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475292","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3474085.3475292","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:49:17Z","timestamp":1750193357000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475292"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,17]]},"references-count":81,"alternative-id":["10.1145\/3474085.3475292","10.1145\/3474085"],"URL":"https:\/\/doi.org\/10.1145\/3474085.3475292","relation":{},"subject":[],"published":{"date-parts":[[2021,10,17]]},"assertion":[{"value":"2021-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}