{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,19]],"date-time":"2026-01-19T09:08:55Z","timestamp":1768813735439,"version":"3.49.0"},"reference-count":26,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T00:00:00Z","timestamp":1756944000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Bioinform."],"abstract":"<jats:p>How can we be sure that there is sufficient data for our model, such that the predictions remain reliable on unseen data and the conclusions drawn from the fitted model would not vary significantly when using a different sample of the same size? We answer these and related questions through a systematic approach that examines the data size and the corresponding gains in accuracy. Assuming the sample data are drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. Our approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relationship is estimated using sampling methods, which introduces a variation in the convergence speed results across different runs. To stabilize results\u2014so that conclusions do not depend on the run\u2014and extract the most reliable information encoded in the available data regarding convergence speed, the presented method automatically determines a sufficient number of repetitions to reduce sampling deviations below a predefined threshold, thereby ensuring the reliability of conclusions about the required amount of data.<\/jats:p>","DOI":"10.3389\/fbinf.2025.1528515","type":"journal-article","created":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T07:48:11Z","timestamp":1756972091000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Adaptive sampling methods facilitate the determination of reliable dataset sizes for evidence-based modeling"],"prefix":"10.3389","volume":"5","author":[{"given":"Tim","family":"Breitenbach","sequence":"first","affiliation":[]},{"given":"Thomas","family":"Dandekar","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,9,4]]},"reference":[{"key":"B1","volume-title":"Convergence of probability measures","author":"Billingsley","year":"2013"},{"key":"B2","doi-asserted-by":"publisher","first-page":"111222","DOI":"10.1016\/j.jtbi.2022.111222","article-title":"A modular systems biological modelling framework studies cyclic nucleotide signaling in platelets","volume":"550","author":"Breitenbach","year":"2022","journal-title":"J. Theor. Biol."},{"key":"B3","doi-asserted-by":"publisher","first-page":"16165","DOI":"10.1038\/s41598-021-95391-y","article-title":"An effective model of endogenous clocks and external stimuli determining circadian rhythms","volume":"11","author":"Breitenbach","year":"2021","journal-title":"Sci. Rep."},{"key":"B4","doi-asserted-by":"publisher","first-page":"e1007075","DOI":"10.1371\/journal.pcbi.1007075","article-title":"Analyzing pharmacological intervention points: a method to calculate external stimuli to switch between steady states in regulatory networks","volume":"15","author":"Breitenbach","year":"2019","journal-title":"PLoS Comput. Biol."},{"key":"B5","doi-asserted-by":"publisher","first-page":"20170387","DOI":"10.1098\/rsif.2017.0387","article-title":"Opportunities and obstacles for deep learning in biology and medicine","volume":"15","author":"Ching","year":"2018","journal-title":"J. R. Soc. interface"},{"key":"B6","article-title":"How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?","author":"Cho","year":"2015","journal-title":"arXiv preprint"},{"key":"B7","article-title":"Learning curves: asymptotic values and rate of convergence","volume":"6","author":"Cortes","year":"1993","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B8","doi-asserted-by":"publisher","first-page":"1755","DOI":"10.1016\/j.csbj.2024.04.010","article-title":"DataXflow: synergizing data-driven modeling with best parameter fit and optimal control\u2013An efficient data analysis for cancer research","volume":"23","author":"Crouch","year":"2024","journal-title":"Comput. Struct. Biotechnol. J."},{"key":"B9","volume-title":"Real analysis and probability","author":"Dudley","year":"2018"},{"key":"B10","doi-asserted-by":"crossref","DOI":"10.1017\/9781108591034","volume-title":"Probability: theory and examples","author":"Durrett","year":"2019"},{"key":"B11","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1186\/1472-6947-12-8","article-title":"Predicting sample size required for classification performance","volume":"12","author":"Figueroa","year":"2012","journal-title":"BMC Med. Inf. Decis. Mak."},{"key":"B12","article-title":"MINDE: mutual information neural diffusion estimation","volume-title":"The twelfth international conference on learning representations","author":"Franzese","year":"2024"},{"key":"B13","doi-asserted-by":"crossref","DOI":"10.2172\/1476219","volume-title":"A practical approach to sizing neural networks","author":"Friedland","year":"2018"},{"key":"B14","article-title":"Training compute-optimal large language models","author":"Hoffmann","year":"2022","journal-title":"arXiv preprint"},{"key":"B15","doi-asserted-by":"publisher","first-page":"1293","DOI":"10.1038\/s12276-024-01243-w","article-title":"Big data and deep learning for RNA biology","volume":"56","author":"Hwang","year":"2024","journal-title":"Exp. and Mol. Med."},{"key":"B16","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1186\/1742-4682-3-13","article-title":"A method for the generation of standardized qualitative dynamical systems of regulatory networks","volume":"3","author":"Mendoza","year":"2006","journal-title":"Theor. Biol. Med. Model."},{"key":"B17","first-page":"77","article-title":"Sample size and modeling accuracy of decision tree based data mining tools","volume":"6","author":"Morgan","year":"2003","journal-title":"Acad. Inf. Manag. Sci. J."},{"key":"B18","doi-asserted-by":"publisher","first-page":"119","DOI":"10.1089\/106652703321825928","article-title":"Estimating dataset size requirements for classifying DNA microarray data","volume":"10","author":"Mukherjee","year":"2003","journal-title":"J. Comput. Biol."},{"key":"B19","doi-asserted-by":"publisher","first-page":"e74335","DOI":"10.1371\/journal.pone.0074335","article-title":"Lessons learned from quantitative dynamical modeling in systems biology","volume":"8","author":"Raue","year":"2013","journal-title":"PloS one"},{"key":"B20","doi-asserted-by":"publisher","first-page":"3558","DOI":"10.1093\/bioinformatics\/btv405","article-title":"Data2Dynamics: a modeling environment tailored to parameter estimation in dynamical systems","volume":"31","author":"Raue","year":"2015","journal-title":"Bioinformatics"},{"key":"B21","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1016\/j.ddtec.2020.07.001","article-title":"The good, the bad, and the ugly in chemical and biological data for machine learning","volume":"32","author":"Rodrigues","year":"2019","journal-title":"Drug Discov. Today Technol."},{"key":"B22","doi-asserted-by":"publisher","first-page":"1728","DOI":"10.1038\/s41467-022-29268-7","article-title":"Current progress and open challenges for applying deep learning across the biosciences","volume":"13","author":"Sapoval","year":"2022","journal-title":"Nat. Commun."},{"key":"B23","doi-asserted-by":"publisher","first-page":"525","DOI":"10.1016\/j.cels.2023.05.007","article-title":"BioAutoMATED: an end-to-end automated machine learning tool for explanation and design of biological sequences","volume":"14","author":"Valeri","year":"2023","journal-title":"Cell. Syst."},{"key":"B24","volume-title":"Asymptotic statistics","author":"Van der Vaart","year":"2000"},{"key":"B25","doi-asserted-by":"publisher","first-page":"e0299811","DOI":"10.1371\/journal.pone.0299811","article-title":"Analysis of learning curves in predictive modeling using exponential curve fitting with an asymptotic approach","volume":"19","author":"Vianna","year":"2024","journal-title":"Plos one"},{"key":"B26","article-title":"Towards measuring predictability: to which extent data-driven approaches can extract deterministic relations from data exemplified with time series prediction and classification","author":"Zadeh","year":"2025","journal-title":"Trans. Mach. Learn. Res"}],"container-title":["Frontiers in Bioinformatics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2025.1528515\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T07:48:13Z","timestamp":1756972093000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2025.1528515\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,4]]},"references-count":26,"alternative-id":["10.3389\/fbinf.2025.1528515"],"URL":"https:\/\/doi.org\/10.3389\/fbinf.2025.1528515","relation":{},"ISSN":["2673-7647"],"issn-type":[{"value":"2673-7647","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,4]]},"article-number":"1528515"}}