{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,20]],"date-time":"2026-01-20T04:23:39Z","timestamp":1768883019425,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":31,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,8,14]],"date-time":"2021-08-14T00:00:00Z","timestamp":1628899200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Natural Science Foundation of China","award":["U20B2053, 61872022 and 61421003"],"award-info":[{"award-number":["U20B2053, 61872022 and 61421003"]}]},{"DOI":"10.13039\/501100011347","name":"State Key Laboratory of Software Development Environment","doi-asserted-by":"publisher","award":["SKLSDE-2020ZX-12"],"award-info":[{"award-number":["SKLSDE-2020ZX-12"]}],"id":[{"id":"10.13039\/501100011347","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,8,14]]},"DOI":"10.1145\/3447548.3467241","type":"proceedings-article","created":{"date-parts":[[2021,8,12]],"date-time":"2021-08-12T06:12:08Z","timestamp":1628748728000},"page":"2378-2388","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["Triplet Attention"],"prefix":"10.1145","author":[{"given":"Haoyi","family":"Zhou","sequence":"first","affiliation":[{"name":"Beihang University, Beijing, China"}]},{"given":"Jianxin","family":"Li","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}]},{"given":"Jieqi","family":"Peng","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}]},{"given":"Shuai","family":"Zhang","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}]},{"given":"Shanghang","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of California, Berkeley, California, CA, USA"}]}],"member":"320","published-online":{"date-parts":[[2021,8,14]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"crossref","unstructured":"George B Arfken and Hans J Weber. 1999. Mathematical methods for physicists. George B Arfken and Hans J Weber. 1999. Mathematical methods for physicists.","DOI":"10.1119\/1.19217"},{"key":"e_1_3_2_2_2_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E. Hinton","author":"Ba Lei Jimmy","year":"2016","unstructured":"Lei Jimmy Ba , Jamie Ryan Kiros, and Geoffrey E. Hinton . 2016 . Layer Normalization. CoRR , Vol. abs\/ 1607 .06450 (2016). Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR, Vol. abs\/1607.06450 (2016)."},{"key":"e_1_3_2_2_3_1","unstructured":"Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NIPS. Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NIPS."},{"key":"e_1_3_2_2_4_1","volume-title":"David Belanger, Lucy Colwell, and Adrian Weller.","author":"Choromanski Krzysztof","year":"2020","unstructured":"Krzysztof Choromanski , Valerii Likhosherstov , David Dohan , Xingyou Song , Jared Davis , Tam\u00e1 s Sarl\u00f3s , David Belanger, Lucy Colwell, and Adrian Weller. 2020 . Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers. CoRR , Vol. abs\/ 2006 .03555 (2020). Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Jared Davis, Tam\u00e1 s Sarl\u00f3s, David Belanger, Lucy Colwell, and Adrian Weller. 2020. Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers. CoRR, Vol. abs\/2006.03555 (2020)."},{"key":"e_1_3_2_2_5_1","volume-title":"Manning","author":"Clark Kevin","year":"2019","unstructured":"Kevin Clark , Urvashi Khandelwal , Omer Levy , and Christopher D . Manning . 2019 . What Does BERT Look At? An Analysis of BERT's Attention. CoRR , Vol. abs\/ 1906 .04341 (2019). Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look At? An Analysis of BERT's Attention. CoRR, Vol. abs\/1906.04341 (2019)."},{"key":"e_1_3_2_2_6_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL. 4171--4186.","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL. 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL. 4171--4186."},{"key":"e_1_3_2_2_7_1","volume-title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. CoRR","author":"Fedus William","year":"2021","unstructured":"William Fedus , Barret Zoph , and Noam Shazeer . 2021 . Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. CoRR , Vol. abs\/ 2101 .03961 (2021). William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. CoRR, Vol. abs\/2101.03961 (2021)."},{"key":"e_1_3_2_2_8_1","volume-title":"Hauptmann","author":"Huang Po-Yao","year":"2019","unstructured":"Po-Yao Huang , Xiaojun Chang , and Alexander G . Hauptmann . 2019 . Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations. In EMNLP. 1461--1467. Po-Yao Huang, Xiaojun Chang, and Alexander G. Hauptmann. 2019. Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations. In EMNLP. 1461--1467."},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"crossref","unstructured":"Vidur Joshi Matthew E. Peters and Mark Hopkins. 2018. Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples. In ACL. 1190--1199. Vidur Joshi Matthew E. Peters and Mark Hopkins. 2018. Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples. In ACL. 1190--1199.","DOI":"10.18653\/v1\/P18-1110"},{"key":"e_1_3_2_2_10_1","volume-title":"ICML","volume":"119","author":"Katharopoulos Angelos","unstructured":"Angelos Katharopoulos , Apoorv Vyas , Nikolaos Pappas , and Francc ois Fleuret . [n.d.]. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention . In ICML , Vol. 119 . 5156--5165. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Francc ois Fleuret. [n.d.]. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In ICML, Vol. 119. 5156--5165."},{"key":"e_1_3_2_2_11_1","volume-title":"Reformer: The Efficient Transformer. In ICLR.","author":"Kitaev Nikita","year":"2020","unstructured":"Nikita Kitaev , Lukasz Kaiser , and Anselm Levskaya . 2020 . Reformer: The Efficient Transformer. In ICLR. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. In ICLR."},{"key":"e_1_3_2_2_12_1","unstructured":"Matth\u00e4us Kleindessner and Ulrike von Luxburg. 2017. Kernel functions based on triplet comparisons. In NIPS. 6807--6817. Matth\u00e4us Kleindessner and Ulrike von Luxburg. 2017. Kernel functions based on triplet comparisons. In NIPS. 6807--6817."},{"key":"e_1_3_2_2_13_1","unstructured":"Jian Li Zhaopeng Tu Baosong Yang Michael R. Lyu and Tong Zhang. 2018. Multi-Head Attention with Disagreement Regularization. In EMNLP. 2897--2903. Jian Li Zhaopeng Tu Baosong Yang Michael R. Lyu and Tong Zhang. 2018. Multi-Head Attention with Disagreement Regularization. In EMNLP. 2897--2903."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"crossref","unstructured":"Jianxin Li Haoyi Zhou Pengtao Xie and Yingchun Zhang. 2017. Improving the Generalization Performance of Multi-class SVM via Angular Regularization. In IJCAI. 2131--2137. Jianxin Li Haoyi Zhou Pengtao Xie and Yingchun Zhang. 2017. Improving the Generalization Performance of Multi-class SVM via Angular Regularization. In IJCAI. 2131--2137.","DOI":"10.24963\/ijcai.2017\/296"},{"key":"e_1_3_2_2_15_1","volume-title":"DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks. CoRR","author":"Lin Zehui","year":"2019","unstructured":"Zehui Lin , Pengfei Liu , Luyao Huang , Junkun Chen , Xipeng Qiu , and Xuanjing Huang . 2019. DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks. CoRR , Vol. abs\/ 1907 .11065 ( 2019 ). Zehui Lin, Pengfei Liu, Luyao Huang, Junkun Chen, Xipeng Qiu, and Xuanjing Huang. 2019. DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks. CoRR, Vol. abs\/1907.11065 (2019)."},{"key":"e_1_3_2_2_16_1","first-page":"726","article-title":"Multilingual denoising pre-training for neural machine translation","volume":"8","author":"Liu Yinhan","year":"2020","unstructured":"Yinhan Liu , Jiatao Gu , Naman Goyal , Xian Li , Sergey Edunov , Marjan Ghazvininejad , Mike Lewis , and Luke Zettlemoyer . 2020 . Multilingual denoising pre-training for neural machine translation . ACL , Vol. 8 (2020), 726 -- 742 . Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. ACL, Vol. 8 (2020), 726--742.","journal-title":"ACL"},{"key":"e_1_3_2_2_17_1","volume-title":"Image Transformer. In ICML","volume":"80","author":"Parmar Niki","year":"2018","unstructured":"Niki Parmar , Ashish Vaswani , Jakob Uszkoreit , Lukasz Kaiser , Noam Shazeer , Alexander Ku , and Dustin Tran . 2018 . Image Transformer. In ICML 2018, Vol. 80 . 4052--4061. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image Transformer. In ICML 2018, Vol. 80. 4052--4061."},{"key":"e_1_3_2_2_18_1","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018). Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018)."},{"key":"e_1_3_2_2_19_1","volume-title":"Lillicrap","author":"Rae Jack W.","year":"2020","unstructured":"Jack W. Rae , Anna Potapenko , Siddhant M. Jayakumar , Chloe Hillier , and Timothy P . Lillicrap . 2020 . Compressive Transformers for Long-Range Sequence Modelling. In ICLR. Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020. Compressive Transformers for Long-Range Sequence Modelling. In ICLR."},{"key":"e_1_3_2_2_20_1","unstructured":"Ali Rahimi and Benjamin Recht. 2007. Random Features for Large-Scale Kernel Machines. In NIPS. 1177--1184. Ali Rahimi and Benjamin Recht. 2007. Random Features for Large-Scale Kernel Machines. In NIPS. 1177--1184."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"crossref","unstructured":"Pranav Rajpurkar Jian Zhang Konstantin Lopyrev and Percy Liang. 2016. SQuAD: 100 000 Questions for Machine Comprehension of Text. In EMNLP. 2383--2392. Pranav Rajpurkar Jian Zhang Konstantin Lopyrev and Percy Liang. 2016. SQuAD: 100 000 Questions for Machine Comprehension of Text. In EMNLP. 2383--2392.","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_2_2_22_1","volume-title":"a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh , Lysandre Debut , Julien Chaumond , and Thomas Wolf . 2019. DistilBERT , a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR , Vol. abs\/ 1910 .01108 ( 2019 ). Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, Vol. abs\/1910.01108 (2019)."},{"key":"e_1_3_2_2_23_1","volume-title":"Gordon DA Brown, and Nick Chater","author":"Stewart Neil","year":"2005","unstructured":"Neil Stewart , Gordon DA Brown, and Nick Chater . 2005 . Absolute identification by relative judgment. Psychological review, Vol. 112 , 4 (2005), 881. Neil Stewart, Gordon DA Brown, and Nick Chater. 2005. Absolute identification by relative judgment. Psychological review, Vol. 112, 4 (2005), 881."},{"key":"e_1_3_2_2_24_1","volume-title":"Efficient Transformers: A Survey. CoRR","author":"Tay Yi","year":"2020","unstructured":"Yi Tay , Mostafa Dehghani , Dara Bahri , and Donald Metzler . 2020 . Efficient Transformers: A Survey. CoRR , Vol. abs\/ 2009 .06732 (2020). Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient Transformers: A Survey. CoRR, Vol. abs\/2009.06732 (2020)."},{"key":"e_1_3_2_2_25_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In NIPS. 5998--6008. Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In NIPS. 5998--6008."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-3007"},{"key":"e_1_3_2_2_27_1","volume-title":"GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In EMNLP. 353--355.","author":"Wang Alex","year":"2018","unstructured":"Alex Wang , Amanpreet Singh , Julian Michael , Felix Hill , Omer Levy , and Samuel Bowman . 2018 . GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In EMNLP. 353--355. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In EMNLP. 353--355."},{"key":"e_1_3_2_2_28_1","volume-title":"Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush.","author":"Wolf Thomas","year":"2020","unstructured":"Thomas Wolf , Lysandre Debut , Victor Sanh , Julien Chaumond , Clement Delangue , Anthony Moi , Pierric Cistac , Tim Rault , R\u00e9mi Louf , Morgan Funtowicz , Joe Davison , Sam Shleifer , Patrick von Platen , Clara Ma , Yacine Jernite , Julien Plu , Canwen Xu , Teven Le Scao , Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020 . Transformers : State-of-the-Art Natural Language Processing. In EMNLP. 38--45. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R\u00e9mi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP. 38--45."},{"key":"e_1_3_2_2_29_1","first-page":"3811","article-title":"Uncorrelation and Evenness: a New Diversity-Promoting Regularizer","volume":"70","author":"Xie Pengtao","year":"2017","unstructured":"Pengtao Xie , Aarti Singh , and Eric P. Xing . 2017 . Uncorrelation and Evenness: a New Diversity-Promoting Regularizer . In ICML , Vol. 70. 3811 -- 3820 . Pengtao Xie, Aarti Singh, and Eric P. Xing. 2017. Uncorrelation and Evenness: a New Diversity-Promoting Regularizer. In ICML, Vol. 70. 3811--3820.","journal-title":"ICML"},{"key":"e_1_3_2_2_30_1","volume-title":"Jiashi Feng, and Shuicheng Yan.","author":"Yuan Li","year":"2021","unstructured":"Li Yuan , Yunpeng Chen , Tao Wang , Weihao Yu , Yujun Shi , Francis EH Tay , Jiashi Feng, and Shuicheng Yan. 2021 . Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. CoRR , Vol. abs\/ 2101 .11986 (2021). Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. CoRR, Vol. abs\/2101.11986 (2021)."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i12.17325"}],"event":{"name":"KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","location":"Virtual Event Singapore","acronym":"KDD '21","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGKDD ACM Special Interest Group on Knowledge Discovery in Data"]},"container-title":["Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &amp; Data Mining"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3447548.3467241","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3447548.3467241","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:18:28Z","timestamp":1750191508000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3447548.3467241"}},"subtitle":["Rethinking the Similarity in Transformers"],"short-title":[],"issued":{"date-parts":[[2021,8,14]]},"references-count":31,"alternative-id":["10.1145\/3447548.3467241","10.1145\/3447548"],"URL":"https:\/\/doi.org\/10.1145\/3447548.3467241","relation":{},"subject":[],"published":{"date-parts":[[2021,8,14]]},"assertion":[{"value":"2021-08-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}