{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,6]],"date-time":"2025-11-06T13:15:35Z","timestamp":1762434935425,"version":"build-2065373602"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"9","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["12131001"],"award-info":[{"award-number":["12131001"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Fundamental Research Funds for the Central Universities, LPMC, and KLMDASR"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Knowl. Discov. Data"],"published-print":{"date-parts":[[2025,11,30]]},"abstract":"<jats:p>The presence of covariate shift between training and test datasets, coupled with model misspecification, can lead to instability in regression predictions across diverse datasets. Meanwhile, training complex models with massive data imposes significant computational burden. In this article, we present a novel model-free subsampling algorithm for stable prediction, which employs uniform design and confounder balancing methods. Our subsampling algorithm aims to find the nearest neighbor subsampling points of uniform design with the goal of minimizing global stability loss, thereby reducing the data volume while achieving stable predictions. Theoretic analyses show that the uniform measure minimizes the maximum integrated mean square error (MIMSE) and the global stability loss evaluates the independence among variables in each candidate MIMSE-optimal subsampled sets. Simulation studies conducted on synthetic datasets, as well as applications on real datasets, demonstrate the superiority of our proposed method under model misspecification and covariate shift.<\/jats:p>","DOI":"10.1145\/3769077","type":"journal-article","created":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T13:21:23Z","timestamp":1758633683000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Stable Subsampling under Model Misspecification and Covariate Shift"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-0627-685X","authenticated-orcid":false,"given":"Jinjing","family":"Yang","sequence":"first","affiliation":[{"name":"NITFID, School of Statistics and Data Science, Nankai University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-0793-9391","authenticated-orcid":false,"given":"Shaohua","family":"Xu","sequence":"additional","affiliation":[{"name":"NITFID, School of Statistics and Data Science, Nankai University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5683-7502","authenticated-orcid":false,"given":"Zebin","family":"Yang","sequence":"additional","affiliation":[{"name":"Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9729-9018","authenticated-orcid":false,"given":"Aijun","family":"Zhang","sequence":"additional","affiliation":[{"name":"Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3805-7021","authenticated-orcid":false,"given":"Yongdao","family":"Zhou","sequence":"additional","affiliation":[{"name":"NITFID, School of Statistics and Data Science, Nankai University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,11,6]]},"reference":[{"key":"e_1_3_4_2_2","unstructured":"Martin Arjovsky L\u00e9on Bottou Ishaan Gulrajani and David Lopez-Paz. 2019. Invariant risk minimization. arXiv:1907.02893. Retrieved from https:\/\/arxiv.org\/abs\/1907.02893."},{"key":"e_1_3_4_3_2","first-page":"253","volume-title":"Proceedings of the 34th International Conference on Machine LearningVol. 70","author":"Avron Haim","year":"2017","unstructured":"Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Veling Ker, and Amir Zandieh. 2017. Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the 34th International Conference on Machine Learning. Doina Precup and Yee Whye Teh (Eds.), Vol. 70, PMLR, 253\u2013262."},{"key":"e_1_3_4_4_2","doi-asserted-by":"publisher","DOI":"10.1214\/21-AOS2073"},{"key":"e_1_3_4_5_2","first-page":"2137","article-title":"Discriminative learning under covariate shift","volume":"10","author":"Bickel Steffen","year":"2009","unstructured":"Steffen Bickel and Tobias Scheffer. 2009. Discriminative learning under covariate shift. Journal of Machine Learning Research 10 (2009), 2137\u20132155.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_4_6_2","volume-title":"Classification and Regression Trees","author":"Breiman Leo","year":"1984","unstructured":"Leo Breiman, Jerome Friedman, R. A. Olshen, and Charles J. Stone. 1984. Classification and Regression Trees. CRC Press."},{"issue":"3","key":"e_1_3_4_7_2","first-page":"404","article-title":"Invariance, causality and robustness","volume":"35","author":"B\u00fchlmann Peter","year":"2020","unstructured":"Peter B\u00fchlmann. 2020. Invariance, causality and robustness. Statistical Science 35, 3 (2020), 404\u2013426.","journal-title":"Statistical Science"},{"key":"e_1_3_4_8_2","doi-asserted-by":"publisher","DOI":"10.5555\/3546258.3546519"},{"issue":"3","key":"e_1_3_4_9_2","first-page":"1047","article-title":"Optimal designs for free knot least squares splines","volume":"18","author":"Dette Holger","year":"2008","unstructured":"Holger Dette, Viatcheslav Melas, and Andrey Pepelyshev. 2008. Optimal designs for free knot least squares splines. Statistica Sinica 18, 3 (2008), 1047\u20131062.","journal-title":"Statistica Sinica"},{"key":"e_1_3_4_10_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00211-010-0331-6"},{"key":"e_1_3_4_11_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-13-2041-5"},{"key":"e_1_3_4_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4899-3095-8"},{"key":"e_1_3_4_13_2","doi-asserted-by":"publisher","DOI":"10.1214\/aos\/1176347963"},{"key":"e_1_3_4_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.293"},{"issue":"1","key":"e_1_3_4_15_2","doi-asserted-by":"crossref","first-page":"294","DOI":"10.1214\/aoms\/1177706730","article-title":"The spacing of observations in polynomial regression","volume":"29","author":"Guest P. G.","year":"1958","unstructured":"P. G. Guest. 1958. The spacing of observations in polynomial regression. The Annals of Mathematical Statistics 29, 1 (1958), 294\u2013299.","journal-title":"The Annals of Mathematical Statistics"},{"key":"e_1_3_4_16_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107383"},{"key":"e_1_3_4_17_2","doi-asserted-by":"publisher","DOI":"10.1111\/rssb.12027"},{"key":"e_1_3_4_18_2","volume-title":"Advances in Neural Information Processing Systems","author":"Jacot Arthur","year":"2018","unstructured":"Arthur Jacot, Franck Gabriel, and Clement Hongler. 2018. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, Curran Associates, Inc."},{"issue":"2","key":"e_1_3_4_19_2","doi-asserted-by":"crossref","first-page":"166","DOI":"10.1080\/00401706.2021.1921037","article-title":"Split: An optimal method for data splitting","volume":"64","author":"Joseph Roshan V.","year":"2022","unstructured":"Roshan V. Joseph and Akhil Vakayil. 2022. Split: An optimal method for data splitting. Technometrics 64, 2 (2022), 166\u2013176.","journal-title":"Technometrics"},{"key":"e_1_3_4_20_2","doi-asserted-by":"publisher","DOI":"10.1214\/16-SS116"},{"issue":"3","key":"e_1_3_4_21_2","doi-asserted-by":"crossref","first-page":"324","DOI":"10.1093\/pan\/mpw012","article-title":"Why experimenters might not always want to randomize, and what they could do instead","volume":"24","author":"Kasy Maximilian","year":"2016","unstructured":"Maximilian Kasy. 2016. Why experimenters might not always want to randomize, and what they could do instead. Political Analysis 24, 3 (2016), 324\u2013338.","journal-title":"Political Analysis"},{"key":"e_1_3_4_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3220082"},{"key":"e_1_3_4_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477052"},{"key":"e_1_3_4_24_2","unstructured":"Shubham Kulkarni. 2020. US Top 10 Cities - Electricity and Weather Data. Retrieved from https:\/\/www.kaggle.com\/datasets\/shubhamkulkarni01\/us-top-10-cities-electricity-and-weather-data\/data"},{"key":"e_1_3_4_25_2","volume-title":"Proceedings of 1st International Workshop on Learning over Multiple Contexts (LMCE) at ECML-PKDD","volume":"5","author":"Kull Meelis","year":"2014","unstructured":"Meelis Kull and Peter Flach. 2014. Patterns of dataset shift. In Proceedings of 1st International Workshop on Learning over Multiple Contexts (LMCE) at ECML-PKDD, Vol. 5."},{"key":"e_1_3_4_26_2","first-page":"292","volume-title":"Proceedings of 18th SIGBioMed Workshop on Biomedical Natural Language Processing (BioNLP \u201919)","author":"Kyriakakis Manolis","year":"2019","unstructured":"Manolis Kyriakakis, Ion Androutsopoulos, Joan Gin\u00e9s I. Ametll\u00e9, and Artur Saudabayev. 2019. Transfer learning for causal sentence detection. In Proceedings of 18th SIGBioMed Workshop on Biomedical Natural Language Processing (BioNLP \u201919), 292\u2013297."},{"key":"e_1_3_4_27_2","first-page":"37","volume-title":"Proceedings of the 27th International Conference on Neural Information Processing Systems","volume":"1","author":"Liu Anqi","year":"2014","unstructured":"Anqi Liu and Brian D. Ziebart. 2014. Robust classification under sample selection bias. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 1, 37\u201345."},{"issue":"3","key":"e_1_3_4_28_2","doi-asserted-by":"crossref","first-page":"694","DOI":"10.1080\/10618600.2020.1844215","article-title":"LowCon: A design-based subsampling approach in a misspecified linear model","volume":"30","author":"Meng Cheng","year":"2021","unstructured":"Cheng Meng, Rui Xie, Abhyuday Mandal, Xinlian Zhang, Wenxuan Zhong, and Ping Ma. 2021. LowCon: A design-based subsampling approach in a misspecified linear model. Journal of Computational and Graphical Statistics 30, 3 (2021), 694\u2013708.","journal-title":"Journal of Computational and Graphical Statistics"},{"key":"e_1_3_4_29_2","first-page":"10","volume-title":"Proceedings of International Conference on Machine Learning","author":"Muandet Krikamol","year":"2013","unstructured":"Krikamol Muandet, David Balduzzi, and Bernhard Sch\u00f6lkopf. 2013. Domain generalization via invariant feature representation. In Proceedings of International Conference on Machine Learning, 10\u201318."},{"key":"e_1_3_4_30_2","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.2018.1491403"},{"key":"e_1_3_4_31_2","volume-title":"Dataset Shift in Machine Learning","author":"Qui\u00f1onero-Candela Joaquin","year":"2022","unstructured":"Joaquin Qui\u00f1onero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence. 2022. Dataset Shift in Machine Learning. MIT Press."},{"key":"e_1_3_4_32_2","doi-asserted-by":"publisher","DOI":"10.5555\/3291125.3291161"},{"key":"e_1_3_4_33_2","unstructured":"Shiori Sagawa Pang Wei Koh Tatsunori B. Hashimoto and Percy Liang. 2019. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv:1911.08731. Retrieved from https:\/\/arxiv.org\/abs\/1911.08731"},{"key":"e_1_3_4_34_2","first-page":"3076","volume-title":"Proceedings of International Conference on Machine Learning","author":"Shalit Uri","year":"2017","unstructured":"Uri Shalit, Fredrik D. Johansson, and David Sontag. 2017. Estimating individual treatment effect: Generalization bounds and algorithms. In Proceedings of International Conference on Machine Learning. PMLR, 3076\u20133085."},{"issue":"2","key":"e_1_3_4_35_2","first-page":"851","article-title":"On Azadkia\u2013Chatterjee\u2019s conditional dependence coefficient","volume":"30","author":"Shi Hongjian","year":"2024","unstructured":"Hongjian Shi, Mathias Drton, and Fang Han. 2024. On Azadkia\u2013Chatterjee\u2019s conditional dependence coefficient. Bernoulli 30, 2, (2024), 851\u2013877.","journal-title":"Bernoulli"},{"key":"e_1_3_4_36_2","doi-asserted-by":"publisher","DOI":"10.2307\/3029337"},{"key":"e_1_3_4_37_2","doi-asserted-by":"publisher","DOI":"10.1002\/9781118162934"},{"issue":"525","key":"e_1_3_4_38_2","doi-asserted-by":"crossref","first-page":"393","DOI":"10.1080\/01621459.2017.1408468","article-title":"Information-based optimal subdata selection for big data linear regression","volume":"114","author":"Wang HaiYing","year":"2019","unstructured":"HaiYing Wang, Min Yang, and John Stufken. 2019. Information-based optimal subdata selection for big data linear regression. Journal of American Statistical Association 114, 525 (2019), 393\u2013405.","journal-title":"Journal of American Statistical Association"},{"key":"e_1_3_4_39_2","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.2017.1292914"},{"key":"e_1_3_4_40_2","first-page":"719","volume-title":"Handbook of Design and Analysis of Experiments","author":"Wiens Douglas P.","year":"2015","unstructured":"Douglas P. Wiens. 2015. Robustness of design. In Handbook of Design and Analysis of Experiments. Angela M. Dean, Max Morris, John Stufken, and Derek Bingham (Eds.), CRC Press Taylor & Francis Group, 719\u2013753."},{"issue":"1","key":"e_1_3_4_41_2","doi-asserted-by":"crossref","first-page":"101","DOI":"10.1016\/S0378-3758(99)00089-0","article-title":"Admissibility and minimaxity of the uniform design measure in nonparametric regression model","volume":"83","author":"Xie Min-Yu","year":"2000","unstructured":"Min-Yu Xie and Kai-Tai Fang. 2000. Admissibility and minimaxity of the uniform design measure in nonparametric regression model. Journal of Statistical Planning and Inference 83, 1 (2000), 101\u2013111.","journal-title":"Journal of Statistical Planning and Inference"},{"issue":"3","key":"e_1_3_4_42_2","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1016\/j.jco.2012.11.006","article-title":"Mixture discrepancy for quasi-random point sets","volume":"29","author":"Zhou Yong-Dao","year":"2013","unstructured":"Yong-Dao Zhou, Kai-Tai Fang, and Jian-Hui Ning. 2013. Mixture discrepancy for quasi-random point sets. Journal of Complexity 29, 3\u20134 (2013), 283\u2013301.","journal-title":"Journal of Complexity"},{"issue":"2","key":"e_1_3_4_43_2","doi-asserted-by":"crossref","first-page":"240","DOI":"10.1080\/00401706.2023.2271091","article-title":"Efficient model-free subsampling method for massive data","volume":"66","author":"Zhou Zheng","year":"2024","unstructured":"Zheng Zhou, Zebin Yang, Aijun Zhang, and Yongdao Zhou. 2024. Efficient model-free subsampling method for massive data. Technometrics 66, 2 (2024), 240\u2013252.","journal-title":"Technometrics"},{"key":"e_1_3_4_44_2","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.2015.1023805"}],"container-title":["ACM Transactions on Knowledge Discovery from Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3769077","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,6]],"date-time":"2025-11-06T13:11:30Z","timestamp":1762434690000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3769077"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,6]]},"references-count":43,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,11,30]]}},"alternative-id":["10.1145\/3769077"],"URL":"https:\/\/doi.org\/10.1145\/3769077","relation":{},"ISSN":["1556-4681","1556-472X"],"issn-type":[{"type":"print","value":"1556-4681"},{"type":"electronic","value":"1556-472X"}],"subject":[],"published":{"date-parts":[[2025,11,6]]},"assertion":[{"value":"2024-09-17","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}