{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T14:21:35Z","timestamp":1774621295988,"version":"3.50.1"},"reference-count":41,"publisher":"Springer Science and Business Media LLC","issue":"10","license":[{"start":{"date-parts":[[2024,9,24]],"date-time":"2024-09-24T00:00:00Z","timestamp":1727136000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,9,24]],"date-time":"2024-09-24T00:00:00Z","timestamp":1727136000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2024,10]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The 2023 Soccer Prediction Challenge invited the machine learning community to develop innovative methods to predict the outcomes of 736 future soccer matches. The Challenge included two tasks. Task\u00a01 was to forecast the exact match <jats:italic>score<\/jats:italic>, i.e., the number of goals scored by each team. Task\u00a02 was to predict the match outcome as probability vector over the three possible <jats:italic>result<\/jats:italic> categories: victory of the home team, draw, and victory of the away team. Here, we present a new data- and knowledge-driven framework for building machine learning models from readily available data to predict soccer match outcomes. A key component of this framework is an innovative approach to modeling interdependent time series data of competing entities. Using this framework, we developed various predictive models based on <jats:italic>k<\/jats:italic>-nearest neighbors, artificial neural networks, naive Bayes, and ordinal forests, which we applied to the two tasks of the 2023 Soccer Prediction Challenge. Among all submissions to the Challenge, our machine learning models based on <jats:italic>k<\/jats:italic>-nearest neighbors and neural networks achieved top performances. Our main insights from the Challenge are that relatively simple learning algorithms perform remarkably well compared to more complex algorithms, and that the key to successful predictions lies in how well soccer domain knowledge can be incorporated in the modeling process.<\/jats:p>","DOI":"10.1007\/s10994-024-06625-9","type":"journal-article","created":{"date-parts":[[2024,9,24]],"date-time":"2024-09-24T15:03:34Z","timestamp":1727190214000},"page":"8165-8204","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["A data- and knowledge-driven framework for developing machine learning models to predict soccer match outcomes"],"prefix":"10.1007","volume":"113","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7038-2601","authenticated-orcid":false,"given":"Daniel","family":"Berrar","sequence":"first","affiliation":[]},{"given":"Philippe","family":"Lopes","sequence":"additional","affiliation":[]},{"given":"Werner","family":"Dubitzky","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,9,24]]},"reference":[{"issue":"7","key":"6625_CR1","doi-asserted-by":"publisher","first-page":"795","DOI":"10.1002\/for.2471","volume":"36","author":"G Angelini","year":"2017","unstructured":"Angelini, G., & De Angelis, L. (2017). PARX model for football match predictions. Journal of Forecasting, 36(7), 795\u2013807.","journal-title":"Journal of Forecasting"},{"key":"6625_CR2","first-page":"403","volume-title":"Encyclopedia of bioinformatics and computational biology","author":"D Berrar","year":"2018","unstructured":"Berrar, D. (2018). Bayes\u2019 theorem and naive Bayes classifier. In S. Ranganathan, K. Nakai, C. Sch\u00f6nbach, & M. Gribskov (Eds.), Encyclopedia of bioinformatics and computational biology (pp. 403\u2013412). Elsevier."},{"issue":"73","key":"6625_CR3","first-page":"1","volume":"7","author":"D Berrar","year":"2006","unstructured":"Berrar, D., Bradbury, I., & Dubitzky, W. (2006). Instance-based concept learning from multiclass DNA microarray data. BMC Bioinformatics, 7(73), 1\u201312.","journal-title":"BMC Bioinformatics"},{"key":"6625_CR4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s10994-018-5763-8","volume":"108","author":"D Berrar","year":"2019","unstructured":"Berrar, D., Lopes, P., Davis, J., & Dubitzky, W. (2019). Guest editorial: Special issue on machine learning for soccer. Machine Learning, 108, 1\u20137.","journal-title":"Machine Learning"},{"issue":"1","key":"6625_CR5","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1007\/s10994-018-5747-8","volume":"108","author":"D Berrar","year":"2019","unstructured":"Berrar, D., Lopes, P., & Dubitzky, W. (2019). Incorporating domain knowledge in machine learning for soccer outcome prediction. Machine Learning, 108(1), 97\u2013126.","journal-title":"Machine Learning"},{"key":"6625_CR6","unstructured":"Beygelzimer, A., Kakadet, S., Langford, J., Arya, S., Mount, D., & Li, S. (2024). FNN: Fast nearest neighbor search algorithms and applications. R package version 1.1.4. https:\/\/CRAN.R-project.org\/package=FNN"},{"issue":"2","key":"6625_CR7","doi-asserted-by":"publisher","first-page":"123","DOI":"10.1007\/BF00058655","volume":"24","author":"L Breiman","year":"1996","unstructured":"Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123\u2013140.","journal-title":"Machine Learning"},{"issue":"3","key":"6625_CR8","doi-asserted-by":"publisher","first-page":"199","DOI":"10.1214\/ss\/1009213726","volume":"16","author":"L Breiman","year":"2001","unstructured":"Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199\u2013231.","journal-title":"Statistical Science"},{"key":"6625_CR9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1515\/1559-0410.1418","volume":"8","author":"A Constantinou","year":"2012","unstructured":"Constantinou, A., & Fenton, N. (2012). Solving the problem of inadequate scoring rules for assessing probabilistic football forecast models. Journal of Quantitative Analysis in Sports, 8, 1\u201314.","journal-title":"Journal of Quantitative Analysis in Sports"},{"issue":"1","key":"6625_CR10","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1109\/TIT.1967.1053964","volume":"13","author":"T Cover","year":"1967","unstructured":"Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21\u201327.","journal-title":"IEEE Transactions on Information Theory"},{"issue":"2","key":"6625_CR11","first-page":"265","volume":"46","author":"M Dixon","year":"1997","unstructured":"Dixon, M., & Coles, S. (1997). Modelling association football scores and inefficiencies in the football betting market. Applied Statistics, 46(2), 265\u2013280.","journal-title":"Applied Statistics"},{"issue":"1","key":"6625_CR12","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1007\/s10994-018-5726-0","volume":"108","author":"W Dubitzky","year":"2019","unstructured":"Dubitzky, W., Lopes, P., Davis, J., & Berrar, D. (2019). The Open International Soccer Database for machine learning. Machine Learning, 108(1), 9\u201328.","journal-title":"Machine Learning"},{"key":"6625_CR13","volume-title":"Pattern classification","author":"R Duda","year":"2001","unstructured":"Duda, R., Hart, P., & Stork, D. (2001). Pattern classification (2nd ed.). USA: John Wiley & Sons.","edition":"2"},{"key":"6625_CR14","first-page":"132","volume-title":"A practical approach to microarray data analysis","author":"S Dudoit","year":"2002","unstructured":"Dudoit, S., & Fridlyand, J. (2002). Introduction to classification in microarray experiments. In D. Berrar, M. Granzow, & W. Dubitzky (Eds.), A practical approach to microarray data analysis (pp. 132\u2013149). Springer."},{"key":"6625_CR15","doi-asserted-by":"publisher","first-page":"301","DOI":"10.1007\/978-3-540-87479-9_38","volume-title":"Machine learning and knowledge discovery in databases","author":"W Duivesteijn","year":"2008","unstructured":"Duivesteijn, W., & Feelders, A. (2008). Nearest neighbour classification with monotonicity constraints. In W. Daelemans, B. Goethals, & K. Morik (Eds.), Machine learning and knowledge discovery in databases (pp. 301\u2013316). Berlin Heidelberg: Springer."},{"issue":"6","key":"6625_CR16","doi-asserted-by":"publisher","first-page":"985","DOI":"10.1175\/1520-0450(1969)008<0985:ASSFPF>2.0.CO;2","volume":"8","author":"ES Epstein","year":"1969","unstructured":"Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8(6), 985\u2013987.","journal-title":"Journal of Applied Meteorology"},{"issue":"113","key":"6625_CR17","first-page":"556","volume":"150","author":"A Gosiewska","year":"2021","unstructured":"Gosiewska, A., Kozak, A., & Biecek, P. (2021). Simpler is better: Lifting interpretability-performance trade-off via automated feature engineering. Decision Support Systems, 150(113), 556.","journal-title":"Decision Support Systems"},{"key":"6625_CR18","unstructured":"Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? In Proceedings of the 36th conference on neural information processing systems (NeurIPS 2022) Track on datasets and benchmarks (pp. 1\u201348)."},{"issue":"1","key":"6625_CR19","first-page":"1","volume":"21","author":"D Hand","year":"2006","unstructured":"Hand, D. (2006). Classifier technology and the illusion of progress. Statistical Science, 21(1), 1\u201315.","journal-title":"Statistical Science"},{"issue":"2","key":"6625_CR20","doi-asserted-by":"publisher","first-page":"203","DOI":"10.2307\/2347001","volume":"23","author":"I Hill","year":"1974","unstructured":"Hill, I. (1974). Association football and statistical inference. Applied Statistics, 23(2), 203\u2013208.","journal-title":"Applied Statistics"},{"key":"6625_CR21","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1007\/s00357-018-9302-x","volume":"37","author":"R Hornung","year":"2020","unstructured":"Hornung, R. (2020). Ordinal forests. Journal of Classification, 37, 4\u201317.","journal-title":"Journal of Classification"},{"issue":"1","key":"6625_CR22","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1007\/s10994-018-5704-6","volume":"108","author":"O Hub\u00e1\u010dek","year":"2019","unstructured":"Hub\u00e1\u010dek, O., \u0160ourek, G., & \u017delezn\u00fd, F. (2019). Learning to predict soccer results from relational data with gradient boosted trees. Machine Learning, 108(1), 29\u201347.","journal-title":"Machine Learning"},{"issue":"106","key":"6625_CR23","first-page":"997","volume":"222","author":"R Ievoli","year":"2021","unstructured":"Ievoli, R., Palazzo, L., & Ragozini, G. (2021). On the use of passing network indicators to predict football outcomes. Knowledge-Based Systems, 222(106), 997.","journal-title":"Knowledge-Based Systems"},{"issue":"1","key":"6625_CR24","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1089\/big.2018.0076","volume":"7","author":"G Jurman","year":"2020","unstructured":"Jurman, G. (2020). Seasonal linear predictivity in national football championships. Big Data, 7(1), 21\u201334.","journal-title":"Big Data"},{"key":"6625_CR25","doi-asserted-by":"crossref","unstructured":"Kundu, T., Roy, A., & Rai, C. (2021). Predicting English premier league matches using classification and regression. In Proceedings of international conference on communication and computational technologies (pp. 555\u2013568). Springer.","DOI":"10.1007\/978-981-15-5077-5_50"},{"issue":"3","key":"6625_CR26","doi-asserted-by":"publisher","first-page":"109","DOI":"10.1111\/j.1467-9574.1982.tb00782.x","volume":"36","author":"M Maher","year":"1982","unstructured":"Maher, M. (1982). Modelling association football scores. Statistica Neerlandica, 36(3), 109\u2013118.","journal-title":"Statistica Neerlandica"},{"issue":"133","key":"6625_CR27","first-page":"1","volume":"11","author":"MC Malamatinos","year":"2022","unstructured":"Malamatinos, M. C., Vrochidou, E., & Papakostas, G. (2022). On predicting soccer outcomes in the Greek league using machine learning. Computers, 11(133), 1\u201324.","journal-title":"Computers"},{"issue":"4","key":"6625_CR28","doi-asserted-by":"publisher","first-page":"221","DOI":"10.2165\/00007256-199928040-00001","volume":"28","author":"A Nevill","year":"1999","unstructured":"Nevill, A., & Holder, R. (1999). Home advantage in sport: An overview of studies on the advantage of playing at home. Sports Medicine, 28(4), 221\u2013236.","journal-title":"Sports Medicine"},{"issue":"6","key":"6625_CR29","first-page":"513","volume":"22","author":"P O\u2019Donoghue","year":"2004","unstructured":"O\u2019Donoghue, P., Dubitzky, W., Lopes, P., Berrar, D., Lagan, K., Hassan, D., Bairner, A., & Darby, P. (2004). An evaluation of quantitative and qualitative methods of predicting the 2002 FIFA World Cup. Journal of Sports Sciences, 22(6), 513\u2013514.","journal-title":"Journal of Sports Sciences"},{"key":"6625_CR30","doi-asserted-by":"crossref","unstructured":"Razali, N., Mustapha, A., Arbaiy, N., & Lin, P.C. (2022). Deep learning for football outcomes prediction based on football rating system. In 10th international conference on applied science and technology, pp. 1\u20137","DOI":"10.1063\/5.0104587"},{"issue":"4","key":"6625_CR31","doi-asserted-by":"publisher","first-page":"581","DOI":"10.2307\/2343726","volume":"131","author":"C Reep","year":"1968","unstructured":"Reep, C., & Benjamin, B. (1968). Skill and chance in association football. Journal of the Royal Statistical Society: Series A (General), 131(4), 581\u2013585.","journal-title":"Journal of the Royal Statistical Society: Series A (General)"},{"key":"6625_CR32","unstructured":"Ren, Y., & Susnjak, T. (2022). Predicting football match outcomes with eXplainable machine learning and the Kelly index. Preprint retrieved from https:\/\/arxiv.org\/abs\/2211.15734, 2211.15734"},{"key":"6625_CR33","doi-asserted-by":"publisher","first-page":"206","DOI":"10.1038\/s42256-019-0048-x","volume":"1","author":"C Rudin","year":"2019","unstructured":"Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1, 206\u2013215.","journal-title":"Nature Machine Intelligence"},{"key":"6625_CR34","doi-asserted-by":"publisher","first-page":"533","DOI":"10.1038\/323533a0","volume":"323","author":"D Rumelhart","year":"1986","unstructured":"Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323, 533\u2013536.","journal-title":"Nature"},{"issue":"1","key":"6625_CR35","doi-asserted-by":"publisher","first-page":"46","DOI":"10.3390\/app10010046","volume":"10","author":"J St\u00fcbinger","year":"2020","unstructured":"St\u00fcbinger, J., Mangold, B., & Knoll, J. (2020). Machine learning in football betting: Prediction of match results based on player characteristics. Applied Sciences, 10(1), 46.","journal-title":"Applied Sciences"},{"issue":"3","key":"6625_CR36","first-page":"823","volume":"76","author":"N Thei\u00dfen","year":"2020","unstructured":"Thei\u00dfen, N., Schmid, M., & Boulesteix, A. (2020). Ordinal forests: Prediction and variable ranking with ordinal target variables. Biometrics, 76(3), 823\u2013833.","journal-title":"Biometrics"},{"issue":"3","key":"6625_CR37","doi-asserted-by":"publisher","first-page":"682","DOI":"10.3390\/math11030682","volume":"11","author":"Y Tian","year":"2023","unstructured":"Tian, Y., Zhang, Y., & Zhang, H. (2023). Recent advances in stochastic gradient descent in deep learning. Mathematics, 11(3), 682.","journal-title":"Mathematics"},{"key":"6625_CR38","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1007\/s10994-005-4258-6","volume":"58","author":"G Webb","year":"2005","unstructured":"Webb, G., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: Aggregating one-dependence estimators. Machine Learning, 58, 5\u201324.","journal-title":"Machine Learning"},{"key":"6625_CR39","unstructured":"Wortsman, M., Ilharco, G., Gadre, S., Roelofs, R., Gontijo-Lopes, R., Morcos, A., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., & Schmidt, L. (2022). Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, S. Sabato (Eds.) Proceedings of the 39th international conference on machine learning, proceedings of machine learning research, vol 162 (pp. 23,965\u201323,998)"},{"issue":"1","key":"6625_CR40","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s10115-007-0114-2","volume":"14","author":"X Wu","year":"2008","unstructured":"Wu, X., Kumar, V., Quinlan, J., Ghosh, J., Yang, Q., & Hea, M. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1\u201337.","journal-title":"Knowledge and Information Systems"},{"issue":"3","key":"6625_CR41","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0248590","volume":"16","author":"F Wunderlich","year":"2021","unstructured":"Wunderlich, F., Weigelt, M., Rein, R., & Memmert, D. (2021). How does spectator presence affect football? Home advantage remains in European top-class football matches played without spectators during the COVID-19 pandemic. PLoS ONE, 16(3), e0248,590.","journal-title":"PLoS ONE"}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-024-06625-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-024-06625-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-024-06625-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,17]],"date-time":"2024-10-17T21:13:11Z","timestamp":1729199591000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-024-06625-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,24]]},"references-count":41,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2024,10]]}},"alternative-id":["6625"],"URL":"https:\/\/doi.org\/10.1007\/s10994-024-06625-9","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"value":"0885-6125","type":"print"},{"value":"1573-0565","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,24]]},"assertion":[{"value":"23 October 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 July 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 September 2024","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 September 2024","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no conflict of interest to declare.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to participate"}},{"value":"Not applicable.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}}]}}