{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T02:38:30Z","timestamp":1774579110424,"version":"3.50.1"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,3,15]],"date-time":"2024-03-15T00:00:00Z","timestamp":1710460800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Research Foundation, Singapore, under its Industry Alignment Fund\u2013Pre-positioning (IAF-PP) Funding Initiative"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Softw. Eng. Methodol."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers\u2019 interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks (i.e., tag recommendation, relatedness prediction, and API recommendation). The results show that Post2Vec cannot further improve the SOTA techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, and GPT2) and (2) language models built with software engineering related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the \u201cNo Silver Bullet\u201d concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the SOTA performance significantly for all the downstream tasks.<\/jats:p>","DOI":"10.1145\/3635711","type":"journal-article","created":{"date-parts":[[2023,12,7]],"date-time":"2023-12-07T11:56:24Z","timestamp":1701950184000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":22,"title":["Representation Learning for Stack Overflow Posts: How Far Are We?"],"prefix":"10.1145","volume":"33","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3370-8585","authenticated-orcid":false,"given":"Junda","family":"He","sequence":"first","affiliation":[{"name":"Singapore Management University, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4558-0622","authenticated-orcid":false,"given":"Xin","family":"Zhou","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1006-8493","authenticated-orcid":false,"given":"Bowen","family":"Xu","sequence":"additional","affiliation":[{"name":"North Carolina State University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6001-1372","authenticated-orcid":false,"given":"Ting","family":"Zhang","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4462-6916","authenticated-orcid":false,"given":"Kisub","family":"Kim","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5938-1918","authenticated-orcid":false,"given":"Zhou","family":"Yang","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5566-3819","authenticated-orcid":false,"given":"Ferdian","family":"Thung","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6350-2700","authenticated-orcid":false,"given":"Ivana Clairine","family":"Irsan","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4367-7201","authenticated-orcid":false,"given":"David","family":"Lo","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}]}],"member":"320","published-online":{"date-parts":[[2024,3,15]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/SANER.2018.8330213"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","unstructured":"Wasi Uddin Ahmad Saikat Chakraborty Baishakhi Ray and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. In Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: Human Language Technologies Kristina Toutanova Anna Rumshisky Luke Zettlemoyer Dilek Hakkani-T\u00fcr Iz Beltagy Steven Bethard Ryan Cotterell Tanmoy Chakraborty and Yichao Zhou (Eds.). Association for Computational Linguistics 2655\u20132668. DOI:10.18653\/v1\/2021.naacl-main.211","DOI":"10.18653\/v1\/2021.naacl-main.211"},{"key":"e_1_3_2_4_2","volume-title":"Proceedings of the AAAI Reasoning for Complex Question Answering Workshop","author":"Shirani Amirreza","year":"2019","unstructured":"Amirreza Shirani, X. Bowen, L. David, Thamar Solorio, and Amin Alipour. 2019. Question relatedness on Stack Overflow: The task, dataset, and corpus-inspired models. In Proceedings of the AAAI Reasoning for Complex Question Answering Workshop."},{"key":"e_1_3_2_5_2","article-title":"Longformer: The long-document transformer","volume":"2004","author":"Beltagy Iz","year":"2020","unstructured":"Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. CoRR abs\/2004.05150 (2020). https:\/\/arxiv.org\/abs\/2004.05150","journal-title":"CoRR"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3196321.3196333"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3341174"},{"key":"e_1_3_2_8_2","article-title":"Evaluating large language models trained on code","volume":"2107","author":"Chen Mark","year":"2021","unstructured":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond\u00e9 de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. CoRR abs\/2107.03374 (2021). https:\/\/arxiv.org\/abs\/2107.03374","journal-title":"CoRR"},{"key":"e_1_3_2_9_2","first-page":"2552","volume-title":"Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018","author":"Chen Xinyun","year":"2018","unstructured":"Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol\u00f2 Cesa-Bianchi, and Roman Garnett (Eds.). Curran Associates, 2552\u20132562. https:\/\/proceedings.neurips.cc\/paper\/2018\/hash\/d759175de8ea5b1d9a2660e45554894f-Abstract.html"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2021.3128234"},{"key":"e_1_3_2_11_2","volume-title":"Proceedings of the 8th International Conference on Learning Representations (ICLR\u201920)","author":"Clark Kevin","year":"2020","unstructured":"Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In Proceedings of the 8th International Conference on Learning Representations (ICLR\u201920). https:\/\/openreview.net\/forum?id=r1xMH1BtvB"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.4324\/9781315806730"},{"key":"e_1_3_2_13_2","volume-title":"Practical Nonparametric Statistics","author":"Conover William Jay","year":"1999","unstructured":"William Jay Conover. 1999. Practical Nonparametric Statistics. Vol. 350. John Wiley & Sons."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/n19-1423"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","unstructured":"Zhangyin Feng Daya Guo Duyu Tang Nan Duan Xiaocheng Feng Ming Gong Linjun Shou Bing Qin Ting Liu Daxin Jiang and Ming Zhou. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020 Trevor Cohn Yulan He and Yang Liu (Eds.). Vol. EMNLP 2020. Association for Computational Linguistics 1536\u20131547. DOI:10.18653\/v1\/2020.findings-emnlp.139","DOI":"10.18653\/v1\/2020.findings-emnlp.139"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/2950290.2950334"},{"key":"e_1_3_2_17_2","volume-title":"Proceedings of the 9th International Conference on Learning Representations (ICLR\u201921)","author":"Guo Daya","year":"2021","unstructured":"Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT: Pre-training code representations with data flow. In Proceedings of the 9th International Conference on Learning Representations (ICLR\u201921). https:\/\/openreview.net\/forum?id=jLoC4ez43PZ"},{"key":"e_1_3_2_18_2","article-title":"Don\u2019t stop pretraining: Adapt language models to domains and tasks","author":"Gururangan Suchin","year":"2020","unstructured":"Suchin Gururangan, Ana Marasovi\u0107, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don\u2019t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020).","journal-title":"arXiv preprint arXiv:2004.10964"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2203.10965"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_21_2","article-title":"API method recommendation without worrying about the task-API knowledge gap","author":"Huang Qiao","year":"2018","unstructured":"Qiao Huang, Xin Xia, Zhenchang Xing, D. Lo, and Xinyu Wang. 2018. API method recommendation without worrying about the task-API knowledge gap. In Proceedings of the 2018 33rd IEEE\/ACM International Conference on Automated Software Engineering (ASE\u201918). 293\u2013304.","journal-title":"Proceedings of the 2018 33rd IEEE\/ACM International Conference on Automated Software Engineering (ASE\u201918)."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3238147.3238191"},{"key":"e_1_3_2_23_2","article-title":"CodeSearchNet challenge: Evaluating the state of semantic code search","volume":"1909","author":"Husain Hamel","year":"2019","unstructured":"Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet challenge: Evaluating the state of semantic code search. CoRR abs\/1909.09436 (2019). http:\/\/arxiv.org\/abs\/1909.09436","journal-title":"CoRR"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.482"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature14539"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2020.110783"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00040"},{"key":"e_1_3_2_29_2","article-title":"RoBERTa: A robustly optimized BERT pretraining approach","volume":"1907","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs\/1907.11692 (2019). http:\/\/arxiv.org\/abs\/1907.11692","journal-title":"CoRR"},{"key":"e_1_3_2_30_2","article-title":"CodeXGLUE: A machine learning benchmark dataset for code understanding and generation","author":"Lu Shuai","year":"2021","unstructured":"Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).","journal-title":"arXiv preprint arXiv:2102.04664"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3510003.3511561"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00041"},{"key":"e_1_3_2_33_2","unstructured":"Erik Nijkamp Bo Pang Hiroaki Hayashi Lifu Tu Huan Wang Yingbo Zhou Silvio Savarese and Caiming Xiong. 2023. CodeGen: An open large language model for code with multi-turn program synthesis. In Proceedings of the 11th International Conference on Learning Representations (ICLR\u201923)."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/MSR52588.2021.00023"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/n18-1202"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSME.2014.90"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331341"},{"key":"e_1_3_2_38_2","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Preprint."},{"issue":"8","key":"e_1_3_2_39_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.5555\/3455716.3455856"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASE.2019.00024"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.infsof.2020.106367"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/78.650093"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/n19-1035"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.443"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1374"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340544"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3510003.3510621"},{"key":"e_1_3_2_49_2","first-page":"5998","volume-title":"Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). Curran Associates, 5998\u20136008. https:\/\/proceedings.neurips.cc\/paper\/2017\/hash\/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2022.3178469"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.685"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2017\/423"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3510003.3510159"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10664-017-9514-4"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2021.3093761"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3239235.3240503"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASE.2017.8115681"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/SANER53432.2022.00054"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSME46990.2020.00017"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.infsof.2019.01.002"}],"container-title":["ACM Transactions on Software Engineering and Methodology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3635711","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3635711","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:56:59Z","timestamp":1750291019000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3635711"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,15]]},"references-count":59,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3635711"],"URL":"https:\/\/doi.org\/10.1145\/3635711","relation":{},"ISSN":["1049-331X","1557-7392"],"issn-type":[{"value":"1049-331X","type":"print"},{"value":"1557-7392","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,15]]},"assertion":[{"value":"2023-03-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-10-30","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}