{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,14]],"date-time":"2025-12-14T00:05:06Z","timestamp":1765670706350,"version":"3.48.0"},"reference-count":13,"publisher":"Oxford University Press (OUP)","issue":"12","license":[{"start":{"date-parts":[[2025,12,1]],"date-time":"2025-12-01T00:00:00Z","timestamp":1764547200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,12,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Genomic studies very often rely on computationally intensive analyses of relationships between features, which are typically represented as intervals along a 1D coordinate system (such as positions on a chromosome). In this context, the Python programming language is extensively used for manipulating and analyzing data stored in a tabular form of rows and columns, called a DataFrame. Pandas is the most widely used Python DataFrame package and has been criticized for inefficiencies and scalability issues, which its modern alternative\u2014Polars\u2014aims to address with a native backend written in the Rust programming language.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>polars-bio is a Python library that enables fast, parallel and out-of-core operations on large genomic interval datasets. Its main components are implemented in Rust, using the Apache DataFusion query engine and Apache Arrow for efficient data representation. It is compatible with Polars and Pandas DataFrame formats. In a real-world comparison (107 versus 1.2\u00d7106 intervals), our library runs overlap queries 6.5\u00d7, nearest queries 15.5\u00d7, count_overlaps queries 38\u00d7, and coverage queries 15\u00d7 faster than Bioframe. On equally sized synthetic sets (107 versus 107), the corresponding speedups are 1.6\u00d7, 5.5\u00d7, 6\u00d7, and 6\u00d7. In streaming mode, on real and synthetic interval pairs, our implementation uses 90\u00d7 and 15\u00d7 less memory for overlap, 4.5\u00d7 and 6.5\u00d7 less for nearest, 60\u00d7 and 12\u00d7 less for count_overlaps, and 34\u00d7 and 7\u00d7 less for coverage than Bioframe. Multi-threaded benchmarks show good scalability characteristics. To the best of our knowledge, polars-bio is the most efficient single-node library for genomic interval DataFrames in Python.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>polars-bio is an open-source Python package distributed under the Apache License available for major platforms, including Linux, macOS, and Windows in the PyPI registry. The online documentation is https:\/\/biodatageeks.org\/polars-bio\/ and the source code is available on GitHub: https:\/\/github.com\/biodatageeks\/polars-bio and Zenodo: https:\/\/doi.org\/10.5281\/zenodo.16374290. are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf640","type":"journal-article","created":{"date-parts":[[2025,11,28]],"date-time":"2025-11-28T13:12:59Z","timestamp":1764335579000},"source":"Crossref","is-referenced-by-count":0,"title":["polars-bio\u2014fast, scalable, and out-of-core operations on large genomic interval datasets"],"prefix":"10.1093","volume":"41","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8724-5646","authenticated-orcid":false,"given":"Marek","family":"Wiewi\u00f3rka","sequence":"first","affiliation":[{"name":"Institute of Computer Science, Warsaw University of Technology , 00-665 Warsaw,","place":["Poland"]}]},{"given":"Pavel","family":"Khamutou","sequence":"additional","affiliation":[{"name":"Institute of Computer Science, Warsaw University of Technology , 00-665 Warsaw,","place":["Poland"]}]},{"given":"Marek","family":"Zbysi\u0144ski","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Oxford , OX1 3QD Oxford,","place":["United Kingdom"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0941-4571","authenticated-orcid":false,"given":"Tomasz","family":"Gambin","sequence":"additional","affiliation":[{"name":"Institute of Computer Science, Warsaw University of Technology , 00-665 Warsaw,","place":["Poland"]}]}],"member":"286","published-online":{"date-parts":[[2025,12,1]]},"reference":[{"key":"2025121319024114500_btaf640-B1","doi-asserted-by":"crossref","first-page":"btae088","DOI":"10.1093\/bioinformatics\/btae088","article-title":"Bioframe: operations on genomic intervals in Pandas dataframes","volume":"40","author":"Abdennur","year":"2024","journal-title":"Bioinformatics"},{"key":"2025121319024114500_btaf640-B2","doi-asserted-by":"crossref","first-page":"1386","DOI":"10.1093\/bioinformatics\/btl647","article-title":"Nested containment list (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases","volume":"23","author":"Alekseyenko","year":"2007","journal-title":"Bioinformatics"},{"key":"2025121319024114500_btaf640-B3","doi-asserted-by":"crossref","first-page":"3423","DOI":"10.1093\/bioinformatics\/btr539","article-title":"Pybedtools: a flexible Python library for manipulating genomic datasets and annotations","volume":"27","author":"Dale","year":"2011","journal-title":"Bioinformatics"},{"key":"2025121319024114500_btaf640-B4","doi-asserted-by":"crossref","first-page":"4907","DOI":"10.1093\/bioinformatics\/btz407","article-title":"Augmented interval list: a novel data structure for efficient genomic interval search","volume":"35","author":"Feng","year":"2019","journal-title":"Bioinformatics"},{"volume-title":"Speed up Your Python with Rust: Optimize Python Performance by Creating Python Pip Modules in Rust","year":"2022","author":"Flitton","key":"2025121319024114500_btaf640-B5"},{"key":"2025121319024114500_btaf640-B6","doi-asserted-by":"crossref","first-page":"434","DOI":"10.1038\/s41586-020-2308-7","article-title":"The mutational constraint spectrum quantified from variation in 141,456 humans","volume":"581","author":"Karczewski","year":"2020","journal-title":"Nature"},{"first-page":"5","year":"2024","author":"Lamb","key":"2025121319024114500_btaf640-B7"},{"key":"2025121319024114500_btaf640-B8","doi-asserted-by":"crossref","first-page":"e1003118","DOI":"10.1371\/journal.pcbi.1003118","article-title":"Software for computing and annotating genomic ranges","volume":"9","author":"Lawrence","year":"2013","journal-title":"PLoS Comput Biol"},{"key":"2025121319024114500_btaf640-B9","doi-asserted-by":"crossref","first-page":"1315","DOI":"10.1093\/bioinformatics\/btaa827","article-title":"Bedtk: finding interval overlap with implicit interval tree","volume":"37","author":"Li","year":"2021","journal-title":"Bioinformatics"},{"key":"2025121319024114500_btaf640-B10","doi-asserted-by":"crossref","first-page":"122","DOI":"10.1186\/s13059-016-0974-4","article-title":"The ensembl variant effect predictor","volume":"17","author":"McLaren","year":"2016","journal-title":"Genome Biol"},{"key":"2025121319024114500_btaf640-B11","unstructured":"Oketunji AF. \u00a0Exploratory data analysis with polars. 2024. https:\/\/zenodo.org\/doi\/10.5281\/zenodo.14211160"},{"key":"2025121319024114500_btaf640-B12","doi-asserted-by":"crossref","first-page":"2679","DOI":"10.14778\/3603581.3603604","article-title":"The composable data management system manifesto","volume":"16","author":"Pedreira","year":"2023","journal-title":"Proc VLDB Endow"},{"key":"2025121319024114500_btaf640-B13","doi-asserted-by":"crossref","first-page":"918","DOI":"10.1093\/bioinformatics\/btz615","article-title":"PyRanges: efficient comparison of genomic intervals in Python","volume":"36","author":"Stovner","year":"2020","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaf640\/65667510\/btaf640.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/12\/btaf640\/65667510\/btaf640.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/12\/btaf640\/65667510\/btaf640.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,14]],"date-time":"2025-12-14T00:02:49Z","timestamp":1765670569000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btaf640\/8362264"}},"subtitle":[],"editor":[{"given":"Peter","family":"Robinson","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2025,12,1]]},"references-count":13,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf640","relation":{},"ISSN":["1367-4811"],"issn-type":[{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2025,12]]},"published":{"date-parts":[[2025,12,1]]},"article-number":"btaf640"}}