{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,28]],"date-time":"2025-08-28T00:06:13Z","timestamp":1756339573739,"version":"3.44.0"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"5","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,1]]},"abstract":"<jats:p>Due to a variety of reasons, such as privacy, data in the wild often misses the grouping information required for identifying minorities. On the other hand, it is known that machine learning models are only as good as the data they are trained on and, hence, may underperform for the under-represented minority groups. The missing grouping information presents a dilemma for responsible data scientists who find themselves in an unknown-unknown situation, where not only do they not have access to the grouping attributes but do not also know what groups to consider.<\/jats:p>\n          <jats:p>This paper is an attempt to address this dilemma. Specifically, we propose a minority mining problem, where we find vectors in the attribute space that reveal potential groups that are under-represented and under-performing. Technically speaking, we propose a geometric transformation of data into a dual space and use notions such as the arrangement of hyperplanes to design an efficient algorithm for the problem in lower dimensions. Generalizing our solution to the higher dimensions is cursed by dimensionality. Therefore, we propose a solution based on smart exploration of the search space for such cases. We conduct comprehensive experiments using real-world and synthetic datasets alongside the theoretical analysis. Our experiment results demonstrate the effectiveness of our proposed solutions in mining the unknown, under-represented, and under-performing minorities.<\/jats:p>","DOI":"10.14778\/3718057.3718072","type":"journal-article","created":{"date-parts":[[2025,8,27]],"date-time":"2025-08-27T18:11:49Z","timestamp":1756318309000},"page":"1453-1480","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Mining the Minoria: Unknown, Under-Represented, and Under-Performing Minority Groups"],"prefix":"10.14778","volume":"18","author":[{"given":"Mohsen","family":"Dehghankar","sequence":"first","affiliation":[{"name":"University of Illinois Chicago"}]},{"given":"Abolfazl","family":"Asudeh","sequence":"additional","affiliation":[{"name":"University of Illinois Chicago"}]}],"member":"320","published-online":{"date-parts":[[2025,8,27]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Chicago crimes - 2001 to present. https:\/\/data.cityofchicago.org\/Public-Safety\/Crimes-2001-to-Present\/ijzp-q8t2. Accessed: 2024-05-20."},{"key":"e_1_2_1_2_1","unstructured":"City of chicago open data portal. https:\/\/data.cityofchicago.org."},{"key":"e_1_2_1_3_1","unstructured":"College admission data. https:\/\/www.kaggle.com\/datasets\/eswarchandt\/admission. Accessed: 2024-05-20."},{"key":"e_1_2_1_4_1","first-page":"75","volume-title":"Proceedings of the tenth annual symposium on Computational geometry","author":"Agarwal Pankaj K","year":"1994","unstructured":"Pankaj K Agarwal, Mark De Berg, Ji\u0159\u00ed Matou\u0161ek, and Otfried Schwarzkopf. Constructing levels in arrangements and higher order voronoi diagrams. In Proceedings of the tenth annual symposium on Computational geometry, pages 67\u201375, 1994."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/0097-3165(86)90122-6"},{"key":"e_1_2_1_6_1","first-page":"134","volume-title":"2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS)","author":"Anari Nima","unstructured":"Nima Anari, Yang P Liu, and Thuy-Duong Vuong. Optimal sublinear sampling of spanning trees and determinantal point processes via average-case entropic independence. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 123\u2013134. IEEE, 2022."},{"key":"e_1_2_1_7_1","volume-title":"Propublica compas dataset","author":"Angwin Julia","year":"2016","unstructured":"Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Propublica compas dataset, 2016."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/3291264.3291269"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3300080"},{"key":"e_1_2_1_10_1","first-page":"19","volume-title":"Proceedings of the eleventh annual symposium on Computational geometry","author":"Chan Timothy M","year":"1995","unstructured":"Timothy M Chan. Output-sensitive results on convex hulls, extreme points, and related problems. In Proceedings of the eleventh annual symposium on Computational geometry, pages 10\u201319, 1995."},{"key":"e_1_2_1_11_1","volume-title":"Remarks on k-level algorithms in the plane","author":"Chan Timothy M","year":"1999","unstructured":"Timothy M Chan. Remarks on k-level algorithms in the plane, 1999."},{"key":"e_1_2_1_12_1","volume-title":"Why is my classifier discriminatory? Advances in neural information processing systems, 31","author":"Chen Irene","year":"2018","unstructured":"Irene Chen, Fredrik D Johansson, and David Sontag. Why is my classifier discriminatory? Advances in neural information processing systems, 31, 2018."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732269.2732275"},{"key":"e_1_2_1_14_1","first-page":"1553","volume-title":"2019 IEEE 35th International Conference on Data Engineering (ICDE)","author":"Chung Yeounoh","unstructured":"Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. Slice finder: Automated data slicing for model validation. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1550\u20131553. IEEE, 2019."},{"key":"e_1_2_1_15_1","volume-title":"race, and predictive policing: A critical race theory analysis of the strategic subject list. American journal of community psychology, 73(1\u20132):91\u2013103","author":"DaViera Andrea L","year":"2024","unstructured":"Andrea L DaViera, Marbella Uriostegui, Aaron Gottlieb, and Ogechi Onyeka. Risk, race, and predictive policing: A critical race theory analysis of the strategic subject list. American journal of community psychology, 73(1\u20132):91\u2013103, 2024."},{"key":"e_1_2_1_16_1","volume-title":"Mining the minoria: Unknown, under-represented, and under-performing minority groups. arXiv preprint arXiv:2411.04761","author":"Dehghankar Mohsen","year":"2024","unstructured":"Mohsen Dehghankar and Abolfazl Asudeh. Mining the minoria: Unknown, under-represented, and under-performing minority groups. arXiv preprint arXiv:2411.04761, 2024."},{"key":"e_1_2_1_17_1","first-page":"161","volume-title":"Proceedings 38th Annual Symposium on Foundations of Computer Science","author":"Dey Tamal K","unstructured":"Tamal K Dey. Improved bounds on planar k-sets and k-levels. In Proceedings 38th Annual Symposium on Foundations of Computer Science, pages 156\u2013161. IEEE, 1997."},{"key":"e_1_2_1_18_1","volume-title":"The demographic statistical atlas of the united states - statistical atlas. statisticalatlas.com\/place\/Illinois\/Chicago\/Race-and-Ethnicity#data-map\/tract, [visited on","author":"Diebel James","year":"2019","unstructured":"James Diebel, Jacob Norda, and Orna Kretchmer. The demographic statistical atlas of the united states - statistical atlas. statisticalatlas.com\/place\/Illinois\/Chicago\/Race-and-Ethnicity#data-map\/tract, [visited on June 2019]."},{"key":"e_1_2_1_19_1","volume-title":"Measuring skewness: a forgotten statistic? Journal of statistics education, 19(2)","author":"Doane David P","year":"2011","unstructured":"David P Doane and Lori E Seward. Measuring skewness: a forgotten statistic? Journal of statistics education, 19(2), 2011."},{"key":"e_1_2_1_20_1","volume-title":"Uci machine learning repository: Adult data set","author":"Dua Dheeru","year":"2019","unstructured":"Dheeru Dua and Casey Graff. Uci machine learning repository: Adult data set, 2019. Accessed: 2024-05-20."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.5555\/28905"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0304-3975(02)00738-7"},{"key":"e_1_2_1_23_1","volume-title":"Clustering by passing messages between data points. science, 315(5814):972\u2013976","author":"Frey Brendan J","year":"2007","unstructured":"Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. science, 315(5814):972\u2013976, 2007."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-012-0266-x"},{"key":"e_1_2_1_25_1","volume-title":"Strategies for rare population detection and sampling: A methodological approach in liguria. arXiv preprint arXiv:2405.01342","author":"Lancia G","year":"2024","unstructured":"G Lancia and E Riccomagno. Strategies for rare population detection and sampling: A methodological approach in liguria. arXiv preprint arXiv:2405.01342, 2024."},{"key":"e_1_2_1_26_1","first-page":"571","volume-title":"Big holes in big data: A monte carlo algorithm for detecting large hyper-rectangles in high dimensional data. In 2016 IEEE 40th annual computer software and applications conference (COMPSAC)","author":"Lemley Joseph","unstructured":"Joseph Lemley, Filip Jagodzinski, and Razvan Andonie. Big holes in big data: A monte carlo algorithm for detecting large hyper-rectangles in high dimensional data. In 2016 IEEE 40th annual computer software and applications conference (COMPSAC), volume 1, pages 563\u2013571. IEEE, 2016."},{"key":"e_1_2_1_27_1","first-page":"935","volume-title":"IJCAI (2)","author":"Liu Bing","year":"1997","unstructured":"Bing Liu, Liang-Ping Ku, and Wynne Hsu. Discovering interesting holes in data. In IJCAI (2), pages 930\u2013935, 1997."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability\/University of California Press","author":"Macqueen J","year":"1967","unstructured":"J Macqueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability\/University of California Press, 1967."},{"key":"e_1_2_1_29_1","doi-asserted-by":"crossref","first-page":"299","DOI":"10.1007\/s10115-009-0226-y","article-title":"Subspace and projected clustering: experimental evaluation and analysis","volume":"21","author":"Moise Gabriela","year":"2009","unstructured":"Gabriela Moise, Arthur Zimek, Peer Kr\u00f6ger, Hans-Peter Kriegel, and J\u00f6rg Sander. Subspace and projected clustering: experimental evaluation and analysis. Knowledge and Information Systems, 21:299\u2013326, 2009.","journal-title":"Knowledge and Information Systems"},{"key":"e_1_2_1_30_1","volume-title":"Plotisc and City Life","author":"Moser Whet","year":"2017","unstructured":"Whet Moser. How redlining segregated chicago, and america. Chicago Magazine, Plotisc and City Life, 2017."},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the International Conference on Extending Database Technology (EDBT)","author":"Mousavi Melika","year":"2024","unstructured":"Melika Mousavi, Nima Shahbazi, and Abolfazl Asudeh. Data coverage for detecting representation bias in image data sets: A crowdsourcing approach. In Proceedings of the International Conference on Extending Database Technology (EDBT), 2024."},{"key":"e_1_2_1_32_1","volume-title":"Subspace clustering for high dimensional data: a review. Acm sigkdd explorations newsletter, 6(1):90\u2013105","author":"Parsons Lance","year":"2004","unstructured":"Lance Parsons, Ehtesham Haque, and Huan Liu. Subspace clustering for high dimensional data: a review. Acm sigkdd explorations newsletter, 6(1):90\u2013105, 2004."},{"key":"e_1_2_1_33_1","first-page":"1412","volume-title":"Proceedings of the 2021 International Conference on Management of Data","author":"Pastor Eliana","year":"2021","unstructured":"Eliana Pastor, Luca De Alfaro, and Elena Baralis. Looking for trouble: Analyzing classifier behavior via pattern divergence. In Proceedings of the 2021 International Conference on Management of Data, pages 1400\u20131412, 2021."},{"key":"e_1_2_1_34_1","volume-title":"Clustering by fast search and find of density peaks. science, 344(6191):1492\u20131496","author":"Rodriguez Alex","year":"2014","unstructured":"Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. science, 344(6191):1492\u20131496, 2014."},{"key":"e_1_2_1_35_1","first-page":"2299","volume-title":"Proceedings of the 2021 International Conference on Management of Data","author":"Sagadeeva Svetlana","year":"2021","unstructured":"Svetlana Sagadeeva and Matthias Boehm. Sliceline: Fast, linear-algebra-based slice finding for ml model debugging. In Proceedings of the 2021 International Conference on Management of Data, pages 2290\u20132299, 2021."},{"key":"e_1_2_1_36_1","first-page":"28","volume-title":"The VLDB Journal","author":"Shahbazi Nima","year":"2024","unstructured":"Nima Shahbazi and Abolfazl Asudeh. Reliability evaluation of individual predictions: a data-centric approach. The VLDB Journal, pages 1\u201328, 2024."},{"issue":"1","key":"e_1_2_1_37_1","first-page":"3","article-title":"Coverage-based data-centric approaches for responsible and trustworthy ai","volume":"47","author":"Shahbazi Nima","year":"2024","unstructured":"Nima Shahbazi, Mahdi Erfanian, and Abolfazl Asudeh. Coverage-based data-centric approaches for responsible and trustworthy ai. IEEE Data Eng. Bull., 47(1):3\u201317, 2024.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588433"},{"key":"e_1_2_1_39_1","first-page":"49","volume-title":"Proceedings of the sixteenth annual symposium on Computational geometry","author":"Sharir Micha","year":"2000","unstructured":"Micha Sharir, Shakhar Smorodinsky, and G\u00e1bor Tardos. An improved bound for k-sets in three dimensions. In Proceedings of the sixteenth annual symposium on Computational geometry, pages 43\u201349, 2000."},{"key":"e_1_2_1_40_1","volume-title":"Uci machine learning repository: Pima indians diabetes dataset","author":"Smith J.W.","year":"1988","unstructured":"J.W. Smith, J.E. Everhart, W.C. Dickson, W.C. Knowler, and R.S. Johannes. Uci machine learning repository: Pima indians diabetes dataset, 1988."},{"key":"e_1_2_1_41_1","volume-title":"Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM)","author":"Sun Chenkai","year":"2019","unstructured":"Chenkai Sun, Abolfazl Asudeh, HV Jagadish, Bill Howe, and Julia Stoyanovich. Mithralabel: Flexible dataset nutritional labels for responsible data science. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), 2019."},{"key":"e_1_2_1_42_1","volume-title":"Reinforcement learning: An introduction","author":"Sutton Richard S","year":"2018","unstructured":"Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018."},{"key":"e_1_2_1_43_1","first-page":"1783","volume-title":"Proceedings of the 2021 International Conference on Management of Data","author":"Tae Ki Hyun","year":"2021","unstructured":"Ki Hyun Tae and Steven Euijong Whang. Slice tuner: A selective data acquisition framework for accurate and fair machine learning models. In Proceedings of the 2021 International Conference on Management of Data, pages 1771\u20131783, 2021."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3718057.3718072","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,27]],"date-time":"2025-08-27T18:12:35Z","timestamp":1756318355000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3718057.3718072"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1]]},"references-count":43,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,1]]}},"alternative-id":["10.14778\/3718057.3718072"],"URL":"https:\/\/doi.org\/10.14778\/3718057.3718072","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,1]]},"assertion":[{"value":"2025-08-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}