{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T14:19:11Z","timestamp":1754144351315,"version":"3.41.2"},"reference-count":17,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T00:00:00Z","timestamp":1752537600000},"content-version":"vor","delay-in-days":14,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000054","name":"National Cancer Institute","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000054","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>In the era of precision medicine, performing comparative analysis over diverse patient populations is a fundamental step toward tailoring healthcare interventions. However, the aspect of fairly selecting molecular features across multiple patients is often overlooked.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>To address this challenge, we introduce FALAFL (FAir muLti-sAmple Feature seLection), an algorithmic approach based on combinatorial optimization. FALAFL is designed to perform feature selection in sequencing data which ensures a balanced selection of features from all patient samples in a cohort. We have applied FALAFL to the problem of selecting lineage-informative CpG sites within a cohort of colorectal cancer patients subjected to low-coverage single-cell methylation sequencing. Our results demonstrate that FALAFL can rapidly and robustly determine the optimal set of CpG sites, which are each well covered by cells across the vast majority of the patients, while ensuring that in each patient, a large proportion of these sites have high read coverage. An analysis of the FALAFL-selected sites reveals that their tumor lineage-informativeness exhibits a strong correlation across a spectrum of diverse patient profiles. Furthermore, these universally lineage-informative sites are highly enriched in the inter-CpG island regions. We hope that FALAFL will aid in designing panels for diagnostic and prognostic purposes and help propel fair data science practices in the exploration of complex diseases.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The source code is available at: https:\/\/github.com\/algo-cancer\/FALAFL.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf237","type":"journal-article","created":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T13:02:06Z","timestamp":1752584526000},"page":"i150-i159","source":"Crossref","is-referenced-by-count":0,"title":["Fair molecular feature selection unveils universally tumor lineage-informative methylation sites in colorectal cancer"],"prefix":"10.1093","volume":"41","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4774-8230","authenticated-orcid":false,"given":"Xuan Cindy","family":"Li","sequence":"first","affiliation":[{"name":"Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health , Bethesda, MD, 20892,","place":["United States"]},{"name":"Program in Computational Biology, Bioinformatics, and Genomics, University of Maryland , College Park, MD, 20740,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1051-7236","authenticated-orcid":false,"given":"Yuelin","family":"Liu","sequence":"additional","affiliation":[{"name":"Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health , Bethesda, MD, 20892,","place":["United States"]},{"name":"Department of Computer Science, University of Maryland , College Park, MD, 20740,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2147-8033","authenticated-orcid":false,"given":"Alejandro A","family":"Sch\u00e4ffer","sequence":"additional","affiliation":[{"name":"Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health , Bethesda, MD, 20892,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2748-8205","authenticated-orcid":false,"given":"Stephen M","family":"Mount","sequence":"additional","affiliation":[{"name":"Program in Computational Biology, Bioinformatics, and Genomics, University of Maryland , College Park, MD, 20740,","place":["United States"]},{"name":"Department of Cell Biology and Molecular Genetics, University of Maryland , College Park, MD, 20740,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2170-2808","authenticated-orcid":false,"given":"S Cenk","family":"Sahinalp","sequence":"additional","affiliation":[{"name":"Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health , Bethesda, MD, 20892,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2025,7,15]]},"reference":[{"key":"2025071509015857300_btaf237-B1","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1021\/acs.jcim.0c00908","article-title":"Effective feature selection method for class-imbalance datasets applied to chemical toxicity prediction","volume":"61","author":"Antelo-Collado","year":"2021","journal-title":"J Chem Inf Model"},{"key":"2025071509015857300_btaf237-B2","doi-asserted-by":"crossref","first-page":"507","DOI":"10.1038\/nrg.2016.86","article-title":"Towards precision medicine","volume":"17","author":"Ashley","year":"2016","journal-title":"Nat Rev Genet"},{"key":"2025071509015857300_btaf237-B3","doi-asserted-by":"publisher","first-page":"1060","DOI":"10.1126\/science.aao3791","article-title":"Single-cell multiomics sequencing and analyses of human colorectal cancer","volume":"362","author":"Bian","year":"2018","journal-title":"Science"},{"key":"2025071509015857300_btaf237-B4","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1016\/0959-437X(95)80044-1","article-title":"CpG islands and genes","volume":"5","author":"Cross","year":"1995","journal-title":"Curr Opin Genet Dev"},{"key":"2025071509015857300_btaf237-B5","doi-asserted-by":"publisher","first-page":"576","DOI":"10.1038\/s41586-019-1198-z","article-title":"Epigenetic evolution and lineage histories of chronic lymphocytic leukaemia","volume":"569","author":"Gaiti","year":"2019","journal-title":"Nature"},{"volume-title":"Computers and Intractability: A Guide to the Theory of NP-Completeness","year":"1979","author":"Garey","key":"2025071509015857300_btaf237-B6"},{"year":"2022","author":"Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual","key":"2025071509015857300_btaf237-B7"},{"key":"2025071509015857300_btaf237-B8","doi-asserted-by":"crossref","first-page":"178","DOI":"10.1038\/ng.298","article-title":"The human colon cancer methylome shows similar hypo-and hypermethylation at conserved tissue-specific CPG island shores","volume":"41","author":"Irizarry","year":"2009","journal-title":"Nat Genet"},{"key":"2025071509015857300_btaf237-B9","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1007\/978-1-4684-2001-2_9","volume-title":"Complexity of Computer Computations","author":"Karp","year":"1972"},{"key":"2025071509015857300_btaf237-B10","doi-asserted-by":"crossref","first-page":"e0288173","DOI":"10.1371\/journal.pone.0288173","article-title":"Improving prediction of drug\u2013target interactions based on fusing multiple features with data balancing and feature selection techniques","volume":"18","author":"Khojasteh","year":"2023","journal-title":"PLoS One"},{"key":"2025071509015857300_btaf237-B11","doi-asserted-by":"publisher","first-page":"145","DOI":"10.1109\/18.61115","article-title":"Divergence measures based on the Shannon entropy","volume":"37","author":"Lin","year":"1991","journal-title":"IEEE Trans Inform Theory"},{"key":"2025071509015857300_btaf237-B12","first-page":"1089","article-title":"Single-cell methylation sequencing data reveal succinct metastatic migration histories and tumor progression models","volume":"33","author":"Liu","year":"2023","journal-title":"Genome Res"},{"key":"2025071509015857300_btaf237-B13","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1186\/1471-2156-10-39","article-title":"An ancestry informative marker set for determining continental origin: validation and extension using human genome diversity panels","volume":"10","author":"Nassir","year":"2009","journal-title":"BMC Genetics"},{"first-page":"301","year":"2005","author":"Phuong","key":"2025071509015857300_btaf237-B14"},{"first-page":"260","year":"2018","author":"Rios","key":"2025071509015857300_btaf237-B15"},{"first-page":"443","year":"2023","author":"Sharma","key":"2025071509015857300_btaf237-B16"},{"key":"2025071509015857300_btaf237-B17","doi-asserted-by":"publisher","first-page":"599","DOI":"10.1109\/TPAMI.1985.4767707","article-title":"Entropy and distance of random graphs with application to structural pattern recognition","volume":"7","author":"Wong","year":"1985","journal-title":"IEEE Trans Pattern Anal Mach Intell"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/Supplement_1\/i150\/63745761\/btaf237.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/Supplement_1\/i150\/63745761\/btaf237.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T13:02:08Z","timestamp":1752584528000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/41\/Supplement_1\/i150\/8199416"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,1]]},"references-count":17,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2025,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf237","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2025,7]]},"published":{"date-parts":[[2025,7,1]]}}}