{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,7]],"date-time":"2025-10-07T14:41:21Z","timestamp":1759848081054,"version":"3.37.3"},"reference-count":38,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2019,12,27]],"date-time":"2019-12-27T00:00:00Z","timestamp":1577404800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2019,12,27]],"date-time":"2019-12-27T00:00:00Z","timestamp":1577404800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Appl Intell"],"published-print":{"date-parts":[[2020,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In this paper we propose a novel method for learning how algorithms perform. Classically, algorithms are compared on a finite number of existing (or newly simulated) benchmark datasets based on some fixed metrics. The algorithm(s) with the smallest value of this metric are chosen to be the \u2018best performing\u2019. We offer a new approach to flip this paradigm. We instead aim to gain a richer picture of the performance of an algorithm by generating artificial data through genetic evolution, the purpose of which is to create populations of datasets for which a particular algorithm performs well on a given metric. These datasets can be studied so as to learn what attributes lead to a particular progression of a given algorithm. Following a detailed description of the algorithm as well as a brief description of an open source implementation, a case study in clustering is presented. This case study demonstrates the performance and nuances of the method which we call Evolutionary Dataset Optimisation. In this study, a number of known properties about preferable datasets for the clustering algorithms known as<jats:italic>k<\/jats:italic>-means and DBSCAN are realised in the generated datasets.<\/jats:p>","DOI":"10.1007\/s10489-019-01592-4","type":"journal-article","created":{"date-parts":[[2019,12,27]],"date-time":"2019-12-27T05:47:32Z","timestamp":1577425652000},"page":"1172-1191","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Evolutionary dataset optimisation: learning algorithm quality through evolution"],"prefix":"10.1007","volume":"50","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3788-7691","authenticated-orcid":false,"given":"Henry","family":"Wilde","sequence":"first","affiliation":[]},{"given":"Vincent","family":"Knight","sequence":"additional","affiliation":[]},{"given":"Jonathan","family":"Gillard","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2019,12,27]]},"reference":[{"issue":"Int J Comput Sc","key":"1592_CR1","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1016\/j.engappai.2018.05.003","volume":"73","author":"LM Abualigah","year":"2018","unstructured":"Abualigah LM, Khader AT, Hanandeh ES (2018) A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Eng Appl Artif Intel 73(Int J Comput Sci Eng Appl 5 1 2015):111\u2013125. https:\/\/doi.org\/10.1016\/j.engappai.2018.05.003","journal-title":"Eng Appl Artif Intel"},{"issue":"11","key":"1592_CR2","doi-asserted-by":"publisher","first-page":"4047","DOI":"10.1007\/s10489-018-1190-6","volume":"48","author":"LM Abualigah","year":"2018","unstructured":"Abualigah LM, Khader AT, Hanandeh ES (2018) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48(11):4047\u20134071. https:\/\/doi.org\/10.1007\/s10489-018-1190-6","journal-title":"Appl Intell"},{"key":"1592_CR3","doi-asserted-by":"publisher","first-page":"570","DOI":"10.1016\/j.procs.2016.09.444","volume":"102","author":"A Amirjanov","year":"2016","unstructured":"Amirjanov A (2016) Modeling the dynamics of a changing range genetic algorithm. Procedia Comput Sci 102:570\u2013577. https:\/\/doi.org\/10.1016\/j.procs.2016.09.444","journal-title":"Procedia Comput Sci"},{"key":"1592_CR4","doi-asserted-by":"publisher","unstructured":"B\u00e4ck T (1994) Selective pressure in evolutionary algorithms: a characterization of selection mechanisms. In: Proceedings of the first IEEE conference on evolutionary computation. IEEE World congress on computational intelligence, pp 57\u201362, DOI https:\/\/doi.org\/10.1109\/ICEC.1994.350042","DOI":"10.1109\/ICEC.1994.350042"},{"issue":"4","key":"1592_CR5","doi-asserted-by":"publisher","first-page":"891","DOI":"10.1007\/s10618-015-0444-8","volume":"30","author":"G Campos","year":"2016","unstructured":"Campos G, Zimek A, Sander J, Campello R, Micenkov\u00e1 B, Schubert E, Assent I, Houle M (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891\u2013927. https:\/\/doi.org\/10.1007\/s10618-015-0444-8","journal-title":"Data Min Knowl Disc"},{"key":"1592_CR6","doi-asserted-by":"crossref","unstructured":"Chen Y, Elliot M, Sakshaug J (2016) A genetic algorithm approach to synthetic data production. In: PrAISe@ECAI","DOI":"10.1145\/2970030.2970034"},{"key":"1592_CR7","doi-asserted-by":"publisher","first-page":"160","DOI":"10.1007\/978-3-319-99771-1_11","volume-title":"Privacy in Statistical Databases","author":"Yingrui Chen","year":"2018","unstructured":"Chen Y, Elliot M, Smith D (2018) The application of genetic algorithms to data synthesis: a comparison of three crossover methods. In: Privacy in statistical databases. Springer International Publishing, pp 160\u2013171, DOI https:\/\/doi.org\/10.1007\/978-3-319-99771-1_11"},{"key":"1592_CR8","doi-asserted-by":"crossref","unstructured":"Dau HA, Keogh E, Kamgar K, Yeh CCM, Zhu Y, Gharghabi S, Ratanamahatana CA, Yanping HB, Begum N, Bagnall A, Mueen A, Batista G (2018) Hexagon-ML: the UCR time series classification archive. https:\/\/www.cs.ucr.edu\/eamonn\/time_series_data_2018\/","DOI":"10.1109\/JAS.2019.1911747"},{"issue":"1","key":"1592_CR9","doi-asserted-by":"publisher","first-page":"102","DOI":"10.1137\/040617364","volume":"44","author":"Q Du","year":"2006","unstructured":"Du Q, Emelianenko M, Ju L (2006) Convergence of the lloyd algorithm for computing centroidal voronoi tessellations. SIAM J Numer Anal 44(1):102\u2013119. https:\/\/doi.org\/10.1137\/040617364","journal-title":"SIAM J Numer Anal"},{"issue":"4","key":"1592_CR10","doi-asserted-by":"publisher","first-page":"551","DOI":"10.1109\/TIT.1983.1056714","volume":"29","author":"H Edelsbrunner","year":"1983","unstructured":"Edelsbrunner H, Kirkpatrick D, Seidel R (1983) On the shape of a set of points in the plane. IEEE Trans Inf Theory 29(4):551\u2013559. https:\/\/doi.org\/10.1109\/TIT.1983.1056714","journal-title":"IEEE Trans Inf Theory"},{"key":"1592_CR11","unstructured":"Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD"},{"key":"1592_CR12","unstructured":"Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems. http:\/\/papers.nips.cc\/paper\/5423-generative-adversarial-nets.pdf, vol 27. Curran Associates, Inc., pp 2672\u20132680"},{"issue":"3","key":"1592_CR13","doi-asserted-by":"publisher","first-page":"283","DOI":"10.1023\/A:1009769707641","volume":"2","author":"Z Huang","year":"1998","unstructured":"Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowl Discov 2(3):283\u2013304. https:\/\/doi.org\/10.1023\/A:1009769707641","journal-title":"Data Mining Knowl Discov"},{"key":"1592_CR14","doi-asserted-by":"publisher","first-page":"333","DOI":"10.14355\/ijes.2013.0305.05","volume":"3","author":"K Jebari","year":"2013","unstructured":"Jebari K (2013) Selection methods for genetic algorithms. Int J Emerg Sci 3:333\u2013344","journal-title":"Int J Emerg Sci"},{"key":"1592_CR15","doi-asserted-by":"publisher","first-page":"876","DOI":"10.12688\/f1000research.11407.1","volume":"6","author":"Rafael C. Jim\u00e9nez","year":"2017","unstructured":"Jim\u00e9nez RC, Kuzak M, Alhamdoosh M, Barker M, Batut B, Borg M, Capella-Gutierrez S, Chue Hong N, Cook M, Corpas M, Flannery M, Garcia L, Gelp\u00ed JL, Gladman S, Goble C, Gonz\u00e1lez Ferreiro M, Gonzalez-Beltran A, Griffin PC, Gr\u00fcning B, Hagberg J, Holub P, Hooft R, Ison J, Katz DS, Lesko\u0161ek B, L\u00f3pez G\u00f3mez F, Oliveira LJ, Mellor D, Mosbergen R, Mulder N, Perez-Riverol Y, Pergl R, Pichler H, Pope B, Sanz F, Schneider MV, Stodden V, Suchecki R, Svobodov\u00e1Va\u0159ekov\u00e1 R, Talvik HA, Todorov I, Treloar A, Tyagi S, van Gompel M, Vaughan D, Via A, Wang X, Watson-Haigh NS, Crouch S (2017) Four simple recommendations to encourage best practices in research software. F1000Research 6 ELIXIR\u2013876. https:\/\/doi.org\/10.12688\/f1000research.11407.1","journal-title":"F1000Research"},{"key":"1592_CR16","doi-asserted-by":"crossref","unstructured":"Koleejan C, Xue B, Zhang M (2015) Code coverage optimisation in genetic algorithms and particle swarm optimisation for automatic software test data generation. 2015 IEEE Congress on Evolutionary Computation (CEC), pp 1204\u20131211","DOI":"10.1109\/CEC.2015.7257026"},{"key":"1592_CR17","doi-asserted-by":"publisher","first-page":"31","DOI":"10.5120\/12636-9343","volume":"72","author":"M Kuehn","year":"2013","unstructured":"Kuehn M, Severin T, Salzwedel H (2013) Variable mutation rate at genetic algorithms: introduction of chromosome fitness in connection with multi-chromosome representation. Int J Comput Appl 72:31\u201338. https:\/\/doi.org\/10.5120\/12636-9343","journal-title":"Int J Comput Appl"},{"issue":"1","key":"1592_CR18","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1109\/4235.910464","volume":"5","author":"YW Leung","year":"2001","unstructured":"Leung YW, Wang Y (2001) An orthogonal genetic algorithm with quantization for global numerical optimization. IEEE Trans Evol Comput 5(1):41\u201353. https:\/\/doi.org\/10.1109\/4235.910464","journal-title":"IEEE Trans Evol Comput"},{"key":"1592_CR19","doi-asserted-by":"crossref","unstructured":"Liu L, Cheng L, Liu Y, Jia Y, Rosenblum D (2016) Recognizing complex activities by a probabilistic interval-based model","DOI":"10.1609\/aaai.v30i1.10155"},{"key":"1592_CR20","doi-asserted-by":"publisher","first-page":"700","DOI":"10.1016\/j.ins.2016.07.025","volume":"367-368","author":"L Mart\u00ed","year":"2016","unstructured":"Mart\u00ed L, Garc\u00eda J, Berlanga A, Molina JM (2016) A stopping criterion for multi-objective optimization evolutionary algorithms. Inform Sci 367-368:700\u2013718. https:\/\/doi.org\/10.1016\/j.ins.2016.07.025","journal-title":"Inform Sci"},{"key":"1592_CR21","doi-asserted-by":"publisher","unstructured":"Matejka J, Fitzmaurice G (2017) Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In: Proceedings of the 2017 CHI conference on human factors in computing systems, CHI \u201917. ACM, pp 1290\u20131294, DOI https:\/\/doi.org\/10.1145\/3025453.3025912","DOI":"10.1145\/3025453.3025912"},{"key":"1592_CR22","doi-asserted-by":"crossref","unstructured":"McKinney W Data structures for statistical computing in python. In: Proceedings of the 9th Python in science conference (2010\u2013). https:\/\/pandas.pydata.org\/. [Online; accessed 2019-03-01]","DOI":"10.25080\/Majora-92bf1922-00a"},{"key":"1592_CR23","doi-asserted-by":"publisher","first-page":"1085","DOI":"10.1109\/32.988709","volume":"27","author":"CC Michael","year":"2001","unstructured":"Michael CC, McGraw G, Schatz M (2001) Generating software test data by evolution. IEEE Trans Softw Eng 27:1085\u20131110","journal-title":"IEEE Trans Softw Eng"},{"issue":"4","key":"1592_CR24","doi-asserted-by":"publisher","first-page":"397","DOI":"10.1162\/106365602760972776","volume":"10","author":"T Motoki","year":"2002","unstructured":"Motoki T (2002) Calculating the expected loss of diversity of selection schemes. Evol Comput 10(4):397\u2013422. https:\/\/doi.org\/10.1162\/106365602760972776","journal-title":"Evol Comput"},{"key":"1592_CR25","unstructured":"Oliphant T NumPy: a guide to NumPy. USA: Trelgol Publishing (2006\u2013). http:\/\/www.numpy.org\/. [Online; accessed 2019-03-01]"},{"issue":"1","key":"1592_CR26","doi-asserted-by":"publisher","first-page":"36","DOI":"10.1186\/s13040-017-0154-4","volume":"10","author":"RS Olson","year":"2017","unstructured":"Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH (2017) PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10(1):36. https:\/\/doi.org\/10.1186\/s13040-017-0154-4","journal-title":"BioData Mining"},{"key":"1592_CR27","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825\u20132830","journal-title":"J Mach Learn Res"},{"key":"1592_CR28","first-page":"414","volume-title":"Lecture Notes in Computer Science","author":"Eugene Semenkin","year":"2012","unstructured":"Semenkin E, Semenkina M (2012) Self-configuring genetic algorithm with modified uniform crossover operator. In: Advances in swarm intelligence, pp 414\u2013421"},{"key":"1592_CR29","doi-asserted-by":"publisher","first-page":"76","DOI":"10.1016\/j.swevo.2017.12.009","volume":"40","author":"H Sharifipour","year":"2018","unstructured":"Sharifipour H, Shakeri M, Haghighi H (2018) Structural test data generation using a memetic ant colony optimization based on evolution strategies. Swarm Evol Comput 40:76\u201391. https:\/\/doi.org\/10.1016\/j.swevo.2017.12.009","journal-title":"Swarm Evol Comput"},{"key":"1592_CR30","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4899-3216-7","volume-title":"Image processing, analysis and machine vision","author":"M Sonka","year":"1993","unstructured":"Sonka M, Hlavac V, Boyle R (1993) Image processing, analysis and machine vision. Springer, US"},{"key":"1592_CR31","doi-asserted-by":"crossref","unstructured":"Suganuma M, Shirakawa S, Nagao T (2017) A genetic programming approach to designing convolutional neural network architectures. In: GECCO","DOI":"10.1145\/3071178.3071229"},{"key":"1592_CR32","unstructured":"Sun Y, Xue B, Zhang M, Yen GG (2018) Automatically designing CNN architectures using genetic algorithm for image classification CoRR abs\/1808.03818"},{"key":"1592_CR33","doi-asserted-by":"publisher","unstructured":"The EDO library developers (2019) EDO: v0.2.1. https:\/\/doi.org\/10.5281\/zenodo.2651075","DOI":"10.5281\/zenodo.2651075"},{"key":"1592_CR34","doi-asserted-by":"publisher","unstructured":"Torralba A, Efros AA (2011) Unbiased look at dataset bias. In: Proceedings of the 2011 IEEE conference on computer vision and pattern recognition. https:\/\/doi.org\/10.1109\/CVPR.2011.5995347","DOI":"10.1109\/CVPR.2011.5995347"},{"key":"1592_CR35","doi-asserted-by":"publisher","unstructured":"Vikhar PA (2016) Evolutionary algorithms: a critical review and its future prospects. In: 2016 International conference on global trends in signal processing, information computing and communication (ICGTSPICC), pp 261\u2013265, DOI https:\/\/doi.org\/10.1109\/ICGTSPICC.2016.7955308","DOI":"10.1109\/ICGTSPICC.2016.7955308"},{"key":"1592_CR36","doi-asserted-by":"publisher","unstructured":"Wang N, Shi J, Yeung DY, Jia J (2015) Understanding and diagnosing visual tracking systems. https:\/\/doi.org\/10.1109\/ICCV.2015.355","DOI":"10.1109\/ICCV.2015.355"},{"key":"1592_CR37","doi-asserted-by":"crossref","unstructured":"Wu X, Wu X, Kumar V (2009) The top ten algorithms in data mining CRC","DOI":"10.1201\/9781420089653"},{"key":"1592_CR38","doi-asserted-by":"publisher","first-page":"674","DOI":"10.1007\/978-3-642-10665-1_71","volume-title":"Lecture Notes in Computer Science","author":"Weizhong Zhao","year":"2009","unstructured":"Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on MapReduce. In: Cloud computing. Springer, Berlin, pp 674\u2013679, DOI https:\/\/doi.org\/10.1007\/978-3-642-10665-1_71"}],"container-title":["Applied Intelligence"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s10489-019-01592-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1007\/s10489-019-01592-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s10489-019-01592-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,10,9]],"date-time":"2022-10-09T11:36:06Z","timestamp":1665315366000},"score":1,"resource":{"primary":{"URL":"http:\/\/link.springer.com\/10.1007\/s10489-019-01592-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,12,27]]},"references-count":38,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,4]]}},"alternative-id":["1592"],"URL":"https:\/\/doi.org\/10.1007\/s10489-019-01592-4","relation":{},"ISSN":["0924-669X","1573-7497"],"issn-type":[{"type":"print","value":"0924-669X"},{"type":"electronic","value":"1573-7497"}],"subject":[],"published":{"date-parts":[[2019,12,27]]},"assertion":[{"value":"27 December 2019","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Compliance with Ethical Standards"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"<!--Emphasis Type='Bold' removed-->Conflict of interests"}}]}}