{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:22:03Z","timestamp":1750220523249,"version":"3.41.0"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2021,3,22]],"date-time":"2021-03-22T00:00:00Z","timestamp":1616371200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Commun. ACM"],"published-print":{"date-parts":[[2021,4]]},"abstract":"<jats:p>Attention, particularly self-attention, is a standard in current NLP literature, but to achieve meaningful models, attention is not enough.<\/jats:p>","DOI":"10.1145\/3430937","type":"journal-article","created":{"date-parts":[[2021,3,22]],"date-time":"2021-03-22T14:36:31Z","timestamp":1616423791000},"page":"154-163","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Transformers aftermath"],"prefix":"10.1145","volume":"64","author":[{"given":"Eduardo Souza Dos","family":"Reis","sequence":"first","affiliation":[{"name":"Softwarelab, Unisinos, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Cristiano Andr\u00e9 Da","family":"Costa","sequence":"additional","affiliation":[{"name":"Softwarelab, Unisinos, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Di\u00f3rgenes Eug\u00eanio Da","family":"Silveira","sequence":"additional","affiliation":[{"name":"Softwarelab, Unisinos, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rodrigo Simon","family":"Bavaresco","sequence":"additional","affiliation":[{"name":"Softwarelab, Unisinos, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rodrigo Da Rosa","family":"Righi","sequence":"additional","affiliation":[{"name":"Softwarelab, Unisinos, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jorge Luis Vict\u00f3ria","family":"Barbosa","sequence":"additional","affiliation":[{"name":"Softwarelab, Unisinos, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rodolfo Stoffel","family":"Antunes","sequence":"additional","affiliation":[{"name":"Softwarelab, Unisinos, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"M\u00e1rcio Miguel","family":"Gomes","sequence":"additional","affiliation":[{"name":"Softwarelab, Unisinos, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gustavo","family":"Federizzi","sequence":"additional","affiliation":[{"name":"Dell Inc., Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,3,22]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"NAACL-HLT","author":"Annervaz K.M.","year":"2018","unstructured":"Annervaz , K.M. , Chowdhury , S.B.R. and Dukkipati , A . Learning beyond datasets: Knowledge graph augmented neural networks for natural language processing . NAACL-HLT , 2018 . Annervaz, K.M., Chowdhury, S.B.R. and Dukkipati, A. Learning beyond datasets: Knowledge graph augmented neural networks for natural language processing. NAACL-HLT, 2018."},{"key":"e_1_2_1_2_1","volume-title":"Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs\/1409.0473","author":"Bahdanau D.","year":"2014","unstructured":"Bahdanau , D. , Cho , K. and Bengio Y . Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs\/1409.0473 , 2014 . Bahdanau, D., Cho, K. and Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs\/1409.0473, 2014."},{"key":"e_1_2_1_3_1","volume-title":"et al. Language models are few-shot learners. 2020","author":"Brown T.B.B.","year":"2005","unstructured":"Brown , T.B.B. et al. Language models are few-shot learners. 2020 ; arXiv: 2005 .14165 (2020). Brown, T.B.B. et al. Language models are few-shot learners. 2020; arXiv:2005.14165 (2020)."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-4012"},{"key":"e_1_2_1_5_1","volume-title":"et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation","author":"Cho K.","year":"2014","unstructured":"Cho , K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation ; arXiv:1406.1078 ( 2014 ). Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation; arXiv:1406.1078 (2014)."},{"key":"e_1_2_1_6_1","volume-title":"Transformer-XL: Attentive language models beyond a fixed-length context. ACL","author":"Dai Z.","year":"2019","unstructured":"Dai , Z. , Yang , Z. , Yang , Y. , Carbonell , J.G. , Le , Q.V. , and Salakhutdinov , R . Transformer-XL: Attentive language models beyond a fixed-length context. ACL ( 2019 ). Dai, Z., Yang, Z., Yang,Y., Carbonell, J.G., Le, Q.V., and Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. ACL (2019)."},{"key":"e_1_2_1_7_1","volume-title":"Building dynamic knowledge graphs from text using machine reading comprehension","author":"Das R.","year":"1810","unstructured":"Das , R. , Munkhdalai , T. , Yuan , X. , Trischler , A. and McCallum , A. Building dynamic knowledge graphs from text using machine reading comprehension ; arXiv: 1810 .05682 (2018). Das, R., Munkhdalai, T., Yuan, X., Trischler, A. and McCallum, A. Building dynamic knowledge graphs from text using machine reading comprehension; arXiv:1810.05682 (2018)."},{"key":"e_1_2_1_8_1","volume-title":"Universal transformers","author":"Dehghani M.","year":"1807","unstructured":"Dehghani , M. , Gouws , S. , Vinyals , O. , Uszkoreit , J. and Kaiser , L . Universal transformers ; arXiv: 1807 .03819 (2018). Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J. and Kaiser, L. Universal transformers; arXiv:1807.03819 (2018)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305381.3305510"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3236009"},{"volume-title":"Proceedings of the 2015 NIPS Deep Learning and Representation Learning Workshop.","author":"Hinton G.","key":"e_1_2_1_12_1","unstructured":"Hinton , G. , Vinyals , O. and Dean , J . Distilling the knowledge in a neural network . In Proceedings of the 2015 NIPS Deep Learning and Representation Learning Workshop. Hinton, G., Vinyals, O. and Dean, J. Distilling the knowledge in a neural network. In Proceedings of the 2015 NIPS Deep Learning and Representation Learning Workshop."},{"key":"e_1_2_1_13_1","volume-title":"NAACL-HLT","author":"Jain S.","year":"2019","unstructured":"Jain , S. and Wallace , B.C . Attention is not explanation . NAACL-HLT , 2019 . Jain, S. and Wallace, B.C. Attention is not explanation. NAACL-HLT, 2019."},{"volume-title":"Proceeding of the Intern. Conf. Learning Representations. (2020)","author":"Lan Z.","key":"e_1_2_1_15_1","unstructured":"Lan , Z. , Chen , M. , Goodman , S. , Gimpel , K. , Sharma , P. , and Soricut , R . ALBERT: A lite BERT for self-supervised learning of language representations . In Proceeding of the Intern. Conf. Learning Representations. (2020) Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceeding of the Intern. Conf. Learning Representations. (2020)"},{"key":"e_1_2_1_16_1","unstructured":"Liang Y. et al. XGLUE: A new benchmark dataset for cross-lingual pre-training understanding and generation. To be published; https:\/\/bit.ly\/3m1OLW7  Liang Y. et al. XGLUE: A new benchmark dataset for cross-lingual pre-training understanding and generation. To be published ; https:\/\/bit.ly\/3m1OLW7"},{"key":"e_1_2_1_17_1","volume-title":"Improving multi-task deep neural networks via knowledge distillation for natural language understanding","author":"Liu X.","year":"1904","unstructured":"Liu , X. , He , P. , Chen , W. and Gao , J . Improving multi-task deep neural networks via knowledge distillation for natural language understanding ; arXiv: 1904 .09482 (2019). Liu, X., He, P., Chen, W. and Gao, J. Improving multi-task deep neural networks via knowledge distillation for natural language understanding; arXiv:1904.09482 (2019)."},{"key":"e_1_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Liu X. He P. Chen W. and Gao J. Multi-task deep neural networks for natural language understanding. ACL.2019.  Liu X. He P. Chen W. and Gao J. Multi-task deep neural networks for natural language understanding. ACL .2019.","DOI":"10.18653\/v1\/P19-1441"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1129"},{"key":"e_1_2_1_20_1","volume-title":"et al. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs\/1907.11692","author":"Liu Y.","year":"2019","unstructured":"Liu , Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs\/1907.11692 ( 2019 ). Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs\/1907.11692 (2019)."},{"key":"e_1_2_1_21_1","unstructured":"Lu J. Batra D. Parikh D. and Lee S. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks 2019; arXiv:cs.CV\/1908.02265  Lu J. Batra D. Parikh D. and Lee S. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks 2019; arXiv:cs.CV\/1908.02265"},{"key":"e_1_2_1_22_1","volume-title":"Efficient estimation of word representations in vector space. CoRR abs\/1301.3781","author":"Mikolov T.","year":"2013","unstructured":"Mikolov , T. , Chen , K. , Corrado , G.S. and Dean , J . Efficient estimation of word representations in vector space. CoRR abs\/1301.3781 ( 2013 ). Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. Efficient estimation of word representations in vector space. CoRR abs\/1301.3781 (2013)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1202"},{"key":"e_1_2_1_24_1","volume-title":"Improving language understanding by generative pre-training","author":"Radford A.","year":"2018","unstructured":"Radford , A. , Narasimhan , K. , Salimans , T and Sutskever , Improving language understanding by generative pre-training , 2018 ; https:\/\/s3-us-west-2.amazonaws.com\/openai-assets\/researchcovers\/languageunsupervised\/languageunderstandingpaper (2018). Radford, A., Narasimhan, K., Salimans, T and Sutskever, Improving language understanding by generative pre-training, 2018; https:\/\/s3-us-west-2.amazonaws.com\/openai-assets\/researchcovers\/languageunsupervised\/languageunderstandingpaper (2018)."},{"key":"e_1_2_1_25_1","first-page":"8","article-title":"2019. Language models are unsupervised multitask learners","volume":"1","author":"Radford A.","year":"2019","unstructured":"Radford , A. , Wu , J. , Child , R. , Luan , D. , Amodei , D. , and Sutskeve , I . 2019. Language models are unsupervised multitask learners . OpenAI Blog 1 , 8 ( 2019 ). Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskeve, I. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).","journal-title":"OpenAI Blog"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1487"},{"key":"e_1_2_1_27_1","unstructured":"Sanh V. Debut L. Chaumond J. and Wolf T. 2019. DistilBERT a distilled version of BERT: smaller faster cheaper and lighter; arXiv:1910.01108 (2019).  Sanh V. Debut L. Chaumond J. and Wolf T. 2019. DistilBERT a distilled version of BERT: smaller faster cheaper and lighter; arXiv:1910.01108 (2019)."},{"key":"e_1_2_1_28_1","volume-title":"CTRL: A conditional transformer language model for controllable generation","author":"Keskar N.S.","year":"2019","unstructured":"Keskar , N.S. , McCann , B. , Varshney , L.R. , Xiong , C. and Socher , R . CTRL: A conditional transformer language model for controllable generation , 2019 , arXiv:1909.05858. Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C. and Socher, R. CTRL: A conditional transformer language model for controllable generation, 2019, arXiv:1909.05858."},{"key":"e_1_2_1_29_1","volume-title":"Megatron-LM: Training multi-billion parameter language models using model parallelism","author":"Shoeybi M.","year":"2019","unstructured":"Shoeybi , M. , Patwary , M. , Puri , R. , LeGresley , P. , Casper , J. and Catanzaro , B . Megatron-LM: Training multi-billion parameter language models using model parallelism , 2019 ; arXiv:cs.CL\/1909.08053 Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J. and Catanzaro, B. Megatron-LM: Training multi-billion parameter language models using model parallelism, 2019; arXiv:cs.CL\/1909.08053"},{"volume-title":"ICML","author":"Song K.","key":"e_1_2_1_30_1","unstructured":"Song , K. , Tan , X. , Qin , T. , Lu , J. and Liu , T-Y. MASS : Masked sequence to sequence pre-training for language Ggeneration . ICML , 2019; https:\/\/bit.ly\/3j90xMN Song, K., Tan, X., Qin, T., Lu, J. and Liu, T-Y. MASS: Masked sequence to sequence pre-training for language Ggeneration. ICML, 2019; https:\/\/bit.ly\/3j90xMN"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969173"},{"key":"e_1_2_1_32_1","volume-title":"Distilling task-specific knowledge from BERT into simple neural networks. arXiv:1903.12136","author":"Tang T.","year":"2019","unstructured":"Tang , T. , Lu , Y. , Liu , L. , Mou , L. , Vechtomova , O. and Lin , J . Distilling task-specific knowledge from BERT into simple neural networks. arXiv:1903.12136 ( 2019 ). Tang, T., Lu, Y., Liu, L., Mou, L., Vechtomova, O. and Lin, J. Distilling task-specific knowledge from BERT into simple neural networks. arXiv:1903.12136 (2019)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1580"},{"key":"e_1_2_1_35_1","volume-title":"et al . SuperGLUE: A stickier benchmark for general-purpose language understanding systems. CoRR abs\/1905.00537","author":"Wang A.","year":"2019","unstructured":"Wang , A. et al . SuperGLUE: A stickier benchmark for general-purpose language understanding systems. CoRR abs\/1905.00537 ( 2019 ). arXiv:1905.00537 http:\/\/arxiv.org\/abs\/1905.00537 Wang, A. et al . SuperGLUE: A stickier benchmark for general-purpose language understanding systems. CoRR abs\/1905.00537 (2019). arXiv:1905.00537 http:\/\/arxiv.org\/abs\/1905.00537"},{"key":"e_1_2_1_36_1","volume-title":"GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR abs\/1804.07461","author":"Wang A.","year":"2018","unstructured":"Wang , A. , Singh , A. , Michael , J. , Hill , F. Levy , O. and Bowman , S.R . GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR abs\/1804.07461 ( 2018 ). Wang, A., Singh, A., Michael, J., Hill, F. Levy, O. and Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR abs\/1804.07461 (2018)."},{"key":"e_1_2_1_37_1","volume-title":"Y et al. Google's neural machine translation system: Bridging the gap between human and machine translation. CoRR abs\/1609.08144","author":"Wu","year":"2016","unstructured":"Wu , Y et al. Google's neural machine translation system: Bridging the gap between human and machine translation. CoRR abs\/1609.08144 ( 2016 ). Wu, Y et al. Google's neural machine translation system: Bridging the gap between human and machine translation. CoRR abs\/1609.08144 (2016)."},{"key":"e_1_2_1_38_1","volume-title":"NeurIPS","author":"Yang Z.","year":"2019","unstructured":"Yang , Z. , Dai , Z. , Yang , Y. , Carbonell , J.G. , Salakhutdinov , R. and Le , Q.V . XLNet: Generalized autoregressive pretraining for language understanding . NeurIPS , 2019 . Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R. and Le, Q.V. XLNet: Generalized autoregressive pretraining for language understanding. NeurIPS, 2019."},{"volume-title":"Proceedings of the 2019 Intern. Conf. Learning Representations.","author":"You Y.","key":"e_1_2_1_39_1","unstructured":"You , Y. et al. Large batch optimization for deep learning: Training BERT in 76 minutes . In Proceedings of the 2019 Intern. Conf. Learning Representations. You, Y. et al. Large batch optimization for deep learning: Training BERT in 76 minutes. In Proceedings of the 2019 Intern. Conf. Learning Representations."},{"key":"e_1_2_1_40_1","volume-title":"et al. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541","author":"Yu A.W.","year":"2018","unstructured":"Yu , A.W. et al. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541 ( 2018 ). Yu, A.W. et al. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541 (2018)."}],"container-title":["Communications of the ACM"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3430937","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3430937","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:43Z","timestamp":1750195483000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3430937"}},"subtitle":["current research and rising trends"],"short-title":[],"issued":{"date-parts":[[2021,3,22]]},"references-count":39,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2021,4]]}},"alternative-id":["10.1145\/3430937"],"URL":"https:\/\/doi.org\/10.1145\/3430937","relation":{},"ISSN":["0001-0782","1557-7317"],"issn-type":[{"type":"print","value":"0001-0782"},{"type":"electronic","value":"1557-7317"}],"subject":[],"published":{"date-parts":[[2021,3,22]]},"assertion":[{"value":"2021-03-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}