{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,6]],"date-time":"2025-11-06T20:16:05Z","timestamp":1762460165624,"version":"build-2065373602"},"reference-count":45,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2023,6,14]],"date-time":"2023-06-14T00:00:00Z","timestamp":1686700800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>This paper presents a dataset containing automatically collected source codes solving unique programming exercises of different types. The programming exercises were automatically generated by the Digital Teaching Assistant (DTA) system that automates a massive Python programming course at MIREA\u2014Russian Technological University (RTU MIREA). Source codes of the small programs grouped by the type of the solved task can be used for benchmarking source code classification and clustering algorithms. Moreover, the data can be used for training intelligent program synthesizers or benchmarking mutation testing frameworks, and more applications are yet to be discovered. We describe the architecture of the DTA system, aiming to provide detailed insight regarding how and why the dataset was collected. In addition, we describe the algorithms responsible for source code analysis in the DTA system. These algorithms use vector representations of programs based on Markov chains, compute pairwise Jensen\u2013Shannon divergences of programs, and apply hierarchical clustering algorithms in order to automatically discover high-level concepts used by students while solving unique tasks. The proposed approach can be incorporated into massive programming courses when there is a need to identify approaches implemented by students.<\/jats:p>","DOI":"10.3390\/data8060109","type":"journal-article","created":{"date-parts":[[2023,6,15]],"date-time":"2023-06-15T01:32:57Z","timestamp":1686792777000},"page":"109","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4516-3746","authenticated-orcid":false,"given":"Liliya A.","family":"Demidova","sequence":"first","affiliation":[{"name":"Institute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA\u2014Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6418-6797","authenticated-orcid":false,"given":"Elena G.","family":"Andrianova","sequence":"additional","affiliation":[{"name":"Institute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA\u2014Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1039-2429","authenticated-orcid":false,"given":"Peter N.","family":"Sovietov","sequence":"additional","affiliation":[{"name":"Institute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA\u2014Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1977-8165","authenticated-orcid":false,"given":"Artyom V.","family":"Gorchakov","sequence":"additional","affiliation":[{"name":"Institute of Information Technologies, Federal State Budget Educational Institution of Higher Education, MIREA\u2014Russian Technological University, 78, Vernadsky Avenue, 119454 Moscow, Russia"}]}],"member":"1968","published-online":{"date-parts":[[2023,6,14]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1016\/j.entcs.2008.06.039","article-title":"A Comparative Study of Industrial Static Analysis Tools","volume":"217","author":"Emanuelsson","year":"2008","journal-title":"Electron. Notes Theor. Comput. Sci."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1109\/MS.2008.130","article-title":"Using Static Analysis to Find Bugs","volume":"25","author":"Ayewah","year":"2008","journal-title":"IEEE Softw."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Jiang, H., Yang, H., Qin, S., Su, Z., Zhang, J., and Yan, J. (2017, January 13\u201317). Detecting Energy Bugs in Android Apps Using Static Analysis. Proceedings of the Formal Methods and Software Engineering: 19th International Conference on Formal Engineering Methods, ICFEM 2017, Xi\u2019an, China.","DOI":"10.1007\/978-3-319-68690-5_12"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"McPeak, S., Gros, C.H., and Ramanathan, M.K. (2013, January 18\u201326). Scalable and Incremental Software Bug Detection. Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, Saint Petersburg, Russia.","DOI":"10.1145\/2491411.2501854"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1109\/MS.2016.147","article-title":"Cyclomatic complexity","volume":"33","author":"Ebert","year":"2016","journal-title":"IEEE Softw."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Campbell, G.A. (2018, January 27\u201328). Cognitive complexity: An overview and evaluation. Proceedings of the 2018 International Conference on Technical Debt, Gothenburg, Sweden.","DOI":"10.1145\/3194164.3194186"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Bruch, M., Monperrus, M., and Mezini, M. (2009, January 24\u201328). Learning from Examples to Improve Code Completion Systems. Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Amsterdam, The Netherlands.","DOI":"10.1145\/1595696.1595728"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Svyatkovskiy, A., Zhao, Y., Fu, S., and Sundaresan, N. (2019, January 3\u20137). Pythia: Ai-assisted Code Completion System. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA.","DOI":"10.1145\/3292500.3330699"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Terada, K., and Watanobe, Y. (2019, January 9\u201310). Code Completion for Programming Education Based on Recurrent Neural Network. Proceedings of the 2019 IEEE 11th International Workshop on Computational Intelligence and Applications (IWCIA), Hiroshima, Japan.","DOI":"10.1109\/IWCIA47330.2019.8955090"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Alon, U., Zilberstein, M., Levy, O., and Yahav, E. (2019, January 22\u201326). code2vec: Learning Distributed Representations of Code. Proceedings of the ACM on Programming Languages, Providence, RI, USA.","DOI":"10.1145\/3290353"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Li, Y., Wang, S., and Nguyen, T. (2021, January 22\u201330). A Context-based Automated Approach for Method Name Consistency Checking and Suggestion. Proceedings of the 2021 IEEE\/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain.","DOI":"10.1109\/ICSE43902.2021.00060"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Lacomis, J., Yin, P., Schwarts, E., Allamanis, M., Goues, C., Neubig, G., and Vasilescu, B. (2019, January 11\u201315). Dire: A Neural Approach to Decompiled Identifier Naming. Proceedings of the 2019 34th IEEE\/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, USA.","DOI":"10.1109\/ASE.2019.00064"},{"key":"ref_13","unstructured":"Marcus, A., and Maletic, J.I. (2001, January 26\u201329). Identification of High-level Concept Clones in Source Code. Proceedings of the 16th Annual International Conference on Automated Software Engineering (ASE 2001), San Diego, CA, USA."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"651","DOI":"10.1093\/comjnl\/bxh119","article-title":"PDetect: A Clustering Approach for Detecting Plagiarism in Source Code Datasets","volume":"48","author":"Moussiades","year":"2005","journal-title":"Comput. J."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Sovietov, P.N., and Gorchakov, A.V. (2022, January 26\u201327). Digital Teaching Assistant for the Python Programming Course. Proceedings of the 2022 2nd International Conference on Technology Enhanced Learning in Higher Education (TELE), Lipetsk, Russia.","DOI":"10.1109\/TELE55498.2022.9801060"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"7","DOI":"10.32362\/2500-316X-2022-10-3-7-23","article-title":"Pedagogical Design of a Digital Teaching Assistant in Massive Professional Training for the Digital Economy","volume":"10","author":"Andrianova","year":"2022","journal-title":"Russ. Technol. J."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"81154","DOI":"10.1109\/ACCESS.2020.2990980","article-title":"Building a Comprehensive Automated Programming Assessment System","volume":"8","year":"2020","journal-title":"IEEE Access"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Queir\u00f3s, R.A.P., and Leal, J.P. (2012, January 3\u20135). PETCHA: A Programming Exercises Teaching Assistant. Proceedings of the 17th ACM Annual Conference on Innovation and Technology in Computer Science Education, Haifa, Israel.","DOI":"10.1145\/2325296.2325344"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"3","DOI":"10.3390\/software1010002","article-title":"Automated Code Assessment for Education: Review, Classification and Perspectives on Techniques and Tools","volume":"1","year":"2022","journal-title":"Software"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Jiang, L., Misherghi, G., Su, Z., and Glondu, S. (2007, January 20\u201326). Deckard: Scalable and Accurate Tree-Based Detection of Code Clones. Proceedings of the 29-th International Conference on Software Engineering (ICSE\u201907), Minneapolis, MN, USA.","DOI":"10.1109\/ICSE.2007.30"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Kustanto, C., and Liem, I. (2009, January 27\u201329). Automatic Source Code Plagiarism Detection. Proceedings of the 2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel\/Distributed Computing, Daegu, Republic of Korea.","DOI":"10.1109\/SNPD.2009.62"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Yasaswi, J., Kailash, S., Chilupuri, A., Purini, S., and Jawahar, C.V. (2017, January 5\u20137). Unsupervised Learning-Based Approach for Plagiarism Detection in Programming Assignments. Proceedings of the 10th Innovations in Software Engineering Conference, Jaipur, India.","DOI":"10.1145\/3021460.3021473"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Sovietov, P. (2021, January 7\u20139). Automatic Generation of Programming Exercises. Proceedings of the 2021 1st International Conference on Technology Enhanced Learning in Higher Education (TELE), Lipetsk, Russia.","DOI":"10.1109\/TELE52840.2021.9482762"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"51","DOI":"10.21667\/1995-4565-2022-81-51-64","article-title":"Clustering of Program Source Text Representations Based on Markov Chains","volume":"81","author":"Demidova","year":"2022","journal-title":"Vestn. Ryazan State Radio Eng. Univ."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Demidova, L.A., and Gorchakov, A.V. (2022). Classification of Program Texts Represented as Markov Chains with Biology-Inspired Algorithms-Enhanced Extreme Learning Machines. Algorithms, 15.","DOI":"10.3390\/a15090329"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Allamanis, M., and Sutton, C. (2014, January 16\u201321). Mining Idioms from Source Code. Proceedings of the 22nd ACM Sigsoft International Symposium on Foundations of Software Engineering, Hong Kong, China.","DOI":"10.1145\/2635868.2635901"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Pham, H.S., Nijssen, S., Mens, K., Nucci, D.D., Molderez, T., Roover, C.D., Fabry, J., and Zaytsev, V. (2019, January 28\u201330). Mining Patterns in Source Code using Tree Mining Algorithms. Proceedings of the Discovery Science: 22nd International Conference, DS 2019, Split, Croatia.","DOI":"10.1007\/978-3-030-33778-0_35"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1109\/18.61115","article-title":"Divergence Measures Based on the Shannon Entropy","volume":"37","author":"Lin","year":"1991","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Nielsen, F. (2019). On the Jensen\u2013Shannon Symmetrization of Distances Relying on Abstract Means. Entropy, 21.","DOI":"10.3390\/e21050485"},{"key":"ref_30","first-page":"130","article-title":"A Statistical Method for Evaluating Systematic Relationships","volume":"11","author":"Sokal","year":"1957","journal-title":"Evolution"},{"key":"ref_31","unstructured":"Peveler, M., Maicus, E., and Cutler, B. (March, January 27). Comparing Jailed Sandboxes vs Containers Within an Autograding System. Proceedings of the 50th ACM Technical Symposium on Computer Science Education, Minneapolis, MN, USA."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"1497","DOI":"10.1007\/s10586-021-03517-8","article-title":"Performance and Isolation Analysis of RunC, gVisor and Kata Containers Runtimes","volume":"25","author":"Wang","year":"2022","journal-title":"Clust. Comput."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"557","DOI":"10.1016\/S0377-2217(98)00364-6","article-title":"Constraint Satisfaction Problems: Algorithms and Applications","volume":"119","author":"Brailsford","year":"1999","journal-title":"Eur. J. Oper. Res."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Mailund, T. (2019). Introducing Markdown and Pandoc: Using Markup Language and Document Converter, Apress.","DOI":"10.1007\/978-1-4842-5149-2"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"1203","DOI":"10.1002\/1097-024X(200009)30:11<1203::AID-SPE338>3.0.CO;2-N","article-title":"An Open Graph Visualization System and its Applications to Software Engineering","volume":"30","author":"Gansner","year":"2000","journal-title":"Softw. Pract. Exp."},{"key":"ref_36","unstructured":"Fowler, M., Rice, D., Foemmel, M., Hieatt, E., Mee, R., and Stafford, R. (2002). Patterns of Enterprise Application Architecture, Addison-Wesley Professional. Chapter 14."},{"key":"ref_37","first-page":"20","article-title":"SQLAlchemy","volume":"2","author":"Bayer","year":"2012","journal-title":"Archit. Open-Source Appl."},{"key":"ref_38","unstructured":"Python Software Foundation (2023, March 28). AST\u2014Abstract Syntax Trees. Available online: https:\/\/docs.python.org\/3\/library\/ast.html."},{"key":"ref_39","first-page":"9129","article-title":"Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMAP, and PaCMAP for Data Visualization","volume":"22","author":"Wang","year":"2021","journal-title":"J. Mach. Learn. Res."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Demidova, L.A., and Gorchakov, A.V. (2022). Fuzzy Information Discrimination Measures and Their Application to Low Dimensional Embedding Construction in the UMAP Algorithm. J. Imaging, 8.","DOI":"10.3390\/jimaging8040113"},{"key":"ref_41","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Shahapure, K.R., and Nicholas, C. (2020, January 6\u20139). Cluster Quality Analysis Using Silhouette Score. Proceedings of the 2020 IEEE 7th international conference on data science and advanced analytics (DSAA), Sydney, Australia.","DOI":"10.1109\/DSAA49011.2020.00096"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Xing, Z., Xia, X., Xu, X., and Zhu, L. (2022, January 14\u201316). Making Python code idiomatic by automatic refactoring non-idiomatic Python code with pythonic idioms. Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, University Town, Singapore.","DOI":"10.1145\/3540250.3549143"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Russell, R.L., Kim, L., Hamilton, L.H., Lazovich, T., Harer, J.A., Ozdemir, O., Ellingwood, P.M., and McConley, M.W. (2018, January 17\u201320). Automated vulnerability detection in source code using deep representation learning. Proceedings of the 17th IEEE international conference on machine learning and applications (ICMLA), Orlando, FL, USA.","DOI":"10.1109\/ICMLA.2018.00120"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Bogomolov, E., Kovalenko, V., Rebryk, Y., Baccheli, A., and Bryksin, T. (2021, January 23\u201328). Authorship attribution of source code: A language-agnostic approach and applicability in software engineering. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece.","DOI":"10.1145\/3468264.3468606"}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/8\/6\/109\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:55:08Z","timestamp":1760126108000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/8\/6\/109"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,14]]},"references-count":45,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2023,6]]}},"alternative-id":["data8060109"],"URL":"https:\/\/doi.org\/10.3390\/data8060109","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2023,6,14]]}}}