{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T03:31:42Z","timestamp":1762918302028,"version":"build-2065373602"},"reference-count":53,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2022,2,3]],"date-time":"2022-02-03T00:00:00Z","timestamp":1643846400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository. In general, these code changes are either to add a new feature or to fix an existing bug. However, this mechanism is distributed and allows different contributors to submit unintentionally similar pull-requests that perform similar development activities. Similar pull-requests may be submitted to review in parallel time by different reviewers. This will cause redundant reviewing time and efforts. Moreover, it will complicate the collaboration process. Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort. In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team. This proposal allows saving reviewing efforts and time. Method: To do so, we first extract descriptive textual information from pull-requests content to link similar pull-requests together. Then, we employ the extracted information to find similarities among pull-requests. Finally, machine learning algorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together. Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset. The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall. Furthermore, it helps to save the reviewer time and effort. Conclusion: According to the obtained results, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively. Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.<\/jats:p>","DOI":"10.3390\/info13020073","type":"journal-article","created":{"date-parts":[[2022,2,4]],"date-time":"2022-02-04T11:35:17Z","timestamp":1643974517000},"page":"73","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Automatic Identification of Similar Pull-Requests in GitHub\u2019s Repositories Using Machine Learning"],"prefix":"10.3390","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3258-7304","authenticated-orcid":false,"given":"Hamzeh","family":"Eyal Salman","sequence":"first","affiliation":[{"name":"Software Engineering Department, IT Faculty, Mutah University, Al-Karak 61710, Jordan"}]},{"given":"Zakarea","family":"Alshara","sequence":"additional","affiliation":[{"name":"Software Engineering Department, IT Faculty, Jordan University of Science and Technology, Irbid 22110, Jordan"}]},{"given":"Abdelhak-Djamel","family":"Seriai","sequence":"additional","affiliation":[{"name":"LIRMM Lab, University of Montpellier, 34000 Montpellier, France"}]}],"member":"1968","published-online":{"date-parts":[[2022,2,3]]},"reference":[{"key":"ref_1","unstructured":"Li, Z., Yu, Y., Zhou, M., Wang, T., Yin, G., Lan, L., and Wang, H. (2020). Redundancy, Context, and Preference: An Empirical Study of Duplicate Pull Requests in OSS Projects. IEEE Trans. Softw. Eng., 1\u201328."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Rahman, M.M., and Roy, C.K. (June, January 31). An Insight into the Pull Requests of GitHub. Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014), Hyderabad, India.","DOI":"10.1145\/2597073.2597121"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"69","DOI":"10.1142\/S0218194015400045","article-title":"Feature-Level Change Impact Analysis Using Formal Concept Analysis","volume":"25","author":"Salman","year":"2015","journal-title":"Int. J. Softw. Eng. Knowl. Eng."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Eyal Salman, H., Seriai, A.D., and Dony, C. (2013, January 4\u20136). Feature-to-Code Traceability in Legacy Software Variants. Proceedings of the 2013 39th Euromicro Conference on Software Engineering and Advanced Applications, Santander, Spain.","DOI":"10.1109\/SEAA.2013.65"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Wang, Q., Xu, B., Xia, X., Wang, T., and Li, S. (2019, January 28\u201329). Duplicate Pull Request Detection: When Time Matters. Proceedings of the 11th Asia-Pacific Symposium on Internetware (Internetware \u201919), Fukuoka, Japan.","DOI":"10.1145\/3361242.3361254"},{"key":"ref_6","unstructured":"Zhou, S., St\u0103nciulescu, c., Le\u00dfenich, O., Xiong, Y., W\u0105sowski, A., and K\u00e4stner, C. (June, January 27). Identifying Features in Forks. Proceedings of the 40th International Conference on Software Engineering (ICSE \u201918), Gothenburg Sweden."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"5683","DOI":"10.1007\/s00500-019-04217-7","article-title":"Core-reviewer recommendation based on Pull Request topic model and collaborator social network","volume":"24","author":"Liao","year":"2020","journal-title":"Soft Comput."},{"key":"ref_8","unstructured":"Wang, X., Lo, D., and Shihab, E. (2019, January 24\u201327). Identifying Redundancies in Fork-based Development. Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019, Hangzhou, China."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Li, Z., Yin, G., Yu, Y., Wang, T., and Wang, H. (2017, January 23). Detecting Duplicate Pull-Requests in GitHub. Proceedings of the 9th Asia-Pacific Symposium on Internetware (Internetware\u201917), Shanghai, China.","DOI":"10.1145\/3131704.3131725"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"204","DOI":"10.1016\/j.infsof.2016.01.004","article-title":"Reviewer recommendation for pull-requests in GitHub: What can we learn from code review and bug assignment?","volume":"74","author":"Yu","year":"2016","journal-title":"Inf. Softw. Technol."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Thongtanunam, P., Kula, R.G., Cruz, A.E.C., Yoshida, N., and Iida, H. (2014, January 2\u20133). Improving Code Review Effectiveness through Reviewer Recommendations. Proceedings of the 7th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE 2014), Hyderabad, India.","DOI":"10.1145\/2593702.2593705"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Xia, Z., Sun, H., Jiang, J., Wang, X., and Liu, X. (2017, January 3). A hybrid approach to code reviewer recommendation with collaborative filtering. Proceedings of the 2017 6th International Workshop on Software Mining (SoftwareMining), Urbana, IL, USA.","DOI":"10.1109\/SOFTWAREMINING.2017.8100850"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Chueshev, A., Lawall, J., Bendraou, R., and Ziadi, T. (October, January 28). Expanding the Number of Reviewers in Open-Source Projects by Recommending Appropriate Developers. Proceedings of the ICSME 2020\u2014International Conference on Software Maintenance and Evolution, Adelaide, Australia.","DOI":"10.1109\/ICSME46990.2020.00054"},{"key":"ref_14","unstructured":"Jain, A.K., and Dubes, R.C. (1988). Algorithms for Clustering Data, Prentice-Hall, Inc."},{"key":"ref_15","unstructured":"Zhao, H., and Qi, Z. (2010, January 9\u201310). Hierarchical Agglomerative Clustering with Ordering Constraints. Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data Mining, Phuket, Thailand."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"72","DOI":"10.1145\/1060710.1060712","article-title":"Challenges of Migrating to Agile Methodologies","volume":"48","author":"Nerur","year":"2005","journal-title":"Commun. ACM"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Dabbish, L., Stuart, C., Tsay, J., and Herbsleb, J. (2012, January 11\u201315). Social Coding in GitHub: Transparency and Collaboration in an Open Software Repository. Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (CSCW \u201912), Seattle, WA, USA.","DOI":"10.1145\/2145204.2145396"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Yu, S., Xu, L., Zhang, Y., Wu, J., Liao, Z., and Li, Y. (2018, January 20\u201324). NBSL: A Supervised Classification Model of Pull Request in Github. Proceedings of the 2018 IEEE International Conference on Communications (ICC), Kansas City, MO, USA.","DOI":"10.1109\/ICC.2018.8422103"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1016\/j.infsof.2016.10.006","article-title":"Who should comment on this pull request? Analyzing attributes for more accurate commenter recommendation in pull-based development","volume":"84","author":"Jiang","year":"2017","journal-title":"Inf. Softw. Technol."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Yu, Y., Wang, H., Filkov, V., Devanbu, P., and Vasilescu, B. (2015, January 16\u201317). Wait for It: Determinants of Pull Request Evaluation Latency on GitHub. Proceedings of the 2015 IEEE\/ACM 12th Working Conference on Mining Software Repositories, Florence, Italy.","DOI":"10.1109\/MSR.2015.42"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1007\/s11390-020-9935-1","article-title":"Detecting Duplicate Contributions in Pull-Based Model Combining Textual and Change Similarities","volume":"36","author":"Li","year":"2021","journal-title":"J. Comput. Sci. Technol."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1006\/jcss.1997.1504","article-title":"A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting","volume":"55","author":"Freund","year":"1997","journal-title":"J. Comput. Syst. Sci."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Runeson, P., Alexandersson, M., and Nyholm, O. (2007, January 20\u201326). Detection of Duplicate Defect Reports Using Natural Language Processing. Proceedings of the 29th International Conference on Software Engineering (ICSE\u201907), Minneapolis, MN, USA.","DOI":"10.1109\/ICSE.2007.32"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Wang, X., Zhang, L., Xie, T., Anvik, J., and Sun, J. (2008, January 10\u201318). An approach to detecting duplicate bug reports using natural language and execution information. Proceedings of the 2008 ACM\/IEEE 30th International Conference on Software Engineering, Leipzig, Germany.","DOI":"10.1145\/1368088.1368151"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Sun, C., Lo, D., Khoo, S.C., and Jiang, J. (2011, January 6\u201310). Towards more accurate retrieval of duplicate bug reports. Proceedings of the 2011 26th IEEE\/ACM International Conference on Automated Software Engineering (ASE 2011), Lawrence, KS, USA.","DOI":"10.1109\/ASE.2011.6100061"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"He, J., Xu, L., Yan, M., Xia, X., and Lei, Y. (2020, January 13\u201315). Duplicate Bug Report Detection Using Dual-Channel Convolutional Neural Networks. Proceedings of the 28th International Conference on Program Comprehension (ICPC \u201920), Seoul, Korea.","DOI":"10.1145\/3387904.3389263"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Lipcak, J., and Rossi, B. (2018, January 29\u201331). A Large-Scale Study on Source Code Reviewer Recommendation. Proceedings of the 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Prague, Czech Republic.","DOI":"10.1109\/SEAA.2018.00068"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Balachandran, V. (2013, January 18\u201326). Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation. Proceedings of the 2013 35th International Conference on Software Engineering (ICSE), San Francisco, CA, USA.","DOI":"10.1109\/ICSE.2013.6606642"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Thongtanunam, P., Tantithamthavorn, C., Kula, R.G., Yoshida, N., Iida, H., and Matsumoto, K. (2015, January 2\u20136). Who should review my code? A file location-based code-reviewer recommendation approach for Modern Code Review. Proceedings of the 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Montreal, QC, Canada.","DOI":"10.1109\/SANER.2015.7081824"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Xia, X., Lo, D., Wang, X., and Yang, X. (October, January 29). Who should review this change?: Putting text and file location analyses together for more accurate recommendations. Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), Bremen, Germany.","DOI":"10.1109\/ICSM.2015.7332472"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"530","DOI":"10.1109\/TSE.2015.2500238","article-title":"Automatically Recommending Peer Reviewers in Modern Code Review","volume":"42","author":"Zanjani","year":"2016","journal-title":"IEEE Trans. Softw. Eng."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Hannebauer, C., Patalas, M., St\u00fcnkelt, S., and Gruhn, V. (2016, January 3\u20137). Automatically recommending code reviewers based on their expertise: An empirical comparison. Proceedings of the 2016 31st IEEE\/ACM International Conference on Automated Software Engineering (ASE), Singapore.","DOI":"10.1145\/2970276.2970306"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Rahman, M.M., Roy, C.K., and Collins, J.A. (2016, January 14\u201322). CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Technology Experience. Proceedings of the 2016 IEEE\/ACM 38th International Conference on Software Engineering Companion (ICSE-C), Austin, TX, USA.","DOI":"10.1145\/2889160.2889244"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Mirsaeedi, E., and Rigby, P.C. (2020, January 6\u201311). Mitigating Turnover with Code Review Recommendation: Balancing Expertise, Workload, and Knowledge Distribution. Proceedings of the 2020 IEEE\/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea.","DOI":"10.1145\/3377811.3380335"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Yu, Y., Wang, H., Yin, G., and Ling, C.X. (2014, January 1\u20134). Who Should Review this Pull-Request: Reviewer Recommendation to Expedite Crowd Collaboration. Proceedings of the 2014 21st Asia-Pacific Software Engineering Conference, Jeju, Korea.","DOI":"10.1109\/APSEC.2014.57"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1016\/j.jss.2017.05.039","article-title":"Identification multi-level frequent usage patterns from apis","volume":"130","author":"Salman","year":"2017","journal-title":"J. Syst. Softw."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Tarawneh, A.S., Hassanat, A.B., Chetverikov, D., Lendak, I., and Verma, C. (2019, January 9\u201311). Invoice classification using deep features and machine learning techniques. Proceedings of the 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan.","DOI":"10.1109\/JEEIT.2019.8717504"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Hassanat, A.B. (2018). Two-point-based binary search trees for accelerating big data classification using KNN. PLoS ONE, 13.","DOI":"10.1371\/journal.pone.0207772"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Tarawneh, A.S., Chetverikov, D., Verma, C., and Hassanat, A.B. (2018, January 17\u201319). Stability and reduction of statistical features for image classification and retrieval: Preliminary results. Proceedings of the 2018 9th International Conference on Information and Communication Systems (ICICS), Jeju Island, Korea.","DOI":"10.1109\/IACS.2018.8355452"},{"key":"ref_40","first-page":"347","article-title":"Classification and gender recognition from veiled-faces","volume":"9","author":"Hassanat","year":"2017","journal-title":"Int. J. Biom."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"59069","DOI":"10.1109\/ACCESS.2020.2983003","article-title":"Smotefuna: Synthetic minority over-sampling technique based on furthest neighbour algorithm","volume":"8","author":"Tarawneh","year":"2020","journal-title":"IEEE Access"},{"key":"ref_42","unstructured":"Jeong, G., Kim, S., Zimmermann, T., and Yi, K. (2009). Improving Code Review by Predicting Reviewers and Acceptance of Patches. Research on Software Analysis for Error-free Computing Center Tech-Memo (ROSAEC MEMO 2009-006), RSAEC Center."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"998","DOI":"10.1007\/s11390-015-1577-3","article-title":"CoreDevRec: Automatic Core Member Recommendation for Contribution Evaluation","volume":"30","author":"Jiang","year":"2015","journal-title":"J. Comput. Sci. Technol."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1129","DOI":"10.1007\/s11771-018-3812-x","article-title":"RevRec: A two-layer reviewer recommendation algorithm in pull-based development model","volume":"25","author":"Yang","year":"2018","journal-title":"J. Cent. South Univ."},{"key":"ref_45","unstructured":"Manning, C.D., and Sch\u00fctze, H. (1999). Foundations of Statistical Natural Language Processing, MIT Press."},{"key":"ref_46","unstructured":"Porter, M.F. (1997). An Algorithm for Suffix Stripping. Readings in Information Retrieval, Morgan Kaufmann Publishers Inc."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","article-title":"Term-Weighting Approaches in Automatic Text Retrieval","volume":"24","author":"Salton","year":"1988","journal-title":"Inf. Process. Manag."},{"key":"ref_48","unstructured":"Rahman, M.M., Chakraborty, S., Kaiser, G.E., and Ray, B. (2018). A Case Study on the Impact of Similarity Measure on Information Retrieval based Software Engineering Tasks. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Eyal Salman, H., Hammad, M., Seriai, A.D., and Al-Sbou, A. (2018). Semantic Clustering of Functional Requirements Using Agglomerative Hierarchical Clustering. Information, 9.","DOI":"10.3390\/info9090222"},{"key":"ref_50","first-page":"39","article-title":"Comparison between Standard K-Mean Clustering and Improved K-Mean Clustering","volume":"146","author":"Pandey","year":"2016","journal-title":"Int. J. Comput. Appl."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1089\/big.2018.0175","article-title":"Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review","volume":"7","author":"Alfeilat","year":"2019","journal-title":"Big Data"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Manning, C.D., Raghavan, P., and Sch\u00fctze, H. (2008). Introduction to Information Retrieval, Cambridge University Press.","DOI":"10.1017\/CBO9780511809071"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Yu, Y., Li, Z., Yin, G., Wang, T., and Wang, H. (2018). A Dataset of Duplicate Pull-Requests in Github, Association for Computing Machinery.","DOI":"10.1145\/3196398.3196455"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/13\/2\/73\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:13:45Z","timestamp":1760134425000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/13\/2\/73"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,3]]},"references-count":53,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2022,2]]}},"alternative-id":["info13020073"],"URL":"https:\/\/doi.org\/10.3390\/info13020073","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2022,2,3]]}}}