{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T09:50:29Z","timestamp":1773481829161,"version":"3.50.1"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2018,7]]},"abstract":"<jats:p>\n            Exploratory big data applications often run on raw unstructured or semi-structured data formats, such as JSON files or text logs. These applications can spend 80--90% of their execution time parsing the data. In this paper, we propose a new approach for reducing this overhead: apply filters on the data's raw bytestream\n            <jats:italic>before<\/jats:italic>\n            parsing. This technique, which we call raw filtering, leverages the features of modern hardware and the high selectivity of queries found in many exploratory applications. With raw filtering, a user-specified query predicate is compiled into a set of filtering primitives called raw filters (RFs). RFs are fast, SIMD-based operators that occasionally yield false positives, but never false negatives. We combine multiple RFs into an RF cascade to decrease the false positive rate and maximize parsing throughput. Because the best RF cascade is data-dependent, we propose an optimizer that dynamically selects the combination of RFs with the best expected throughput, achieving within 10% of the global optimum cascade while adding less than 1.2% overhead. We implement these techniques in a system called Sparser, which automatically manages a parsing cascade given a data stream in a supported format (e.g., JSON, Avro, Parquet) and a user query. We show that many real-world applications are highly selective and benefit from Sparser. Across diverse workloads, Sparser accelerates state-of-the-art parsers such as Mison by up to 22 \u00d7 and improves end-to-end application performance by up to 9 \u00d7.\n          <\/jats:p>","DOI":"10.14778\/3236187.3236207","type":"journal-article","created":{"date-parts":[[2018,9,10]],"date-time":"2018-09-10T12:12:28Z","timestamp":1536581548000},"page":"1576-1589","source":"Crossref","is-referenced-by-count":47,"title":["Filter before you parse"],"prefix":"10.14778","volume":"11","author":[{"given":"Shoumik","family":"Palkar","sequence":"first","affiliation":[{"name":"Stanford InfoLab"}]},{"given":"Firas","family":"Abuzaid","sequence":"additional","affiliation":[{"name":"Stanford InfoLab"}]},{"given":"Peter","family":"Bailis","sequence":"additional","affiliation":[{"name":"Stanford InfoLab"}]},{"given":"Matei","family":"Zaharia","sequence":"additional","affiliation":[{"name":"Databricks Inc."}]}],"member":"320","published-online":{"date-parts":[[2018,7]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2452376.2452377"},{"key":"e_1_2_1_2_1","first-page":"337","volume-title":"NSDI","author":"Agarwal Rachit","year":"2015","unstructured":"Agarwal , Rachit and Khandelwal , Anurag and Stoica , Ion . Succinct : Enabling Queries on Compressed Data . In NSDI , pages 337 -- 350 , 2015 . Agarwal, Rachit and Khandelwal, Anurag and Stoica, Ion. Succinct: Enabling Queries on Compressed Data. In NSDI, pages 337--350, 2015."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213864"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610502"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742797"},{"key":"e_1_2_1_6_1","unstructured":"Apache Avro. https:\/\/avro.apache.org.  Apache Avro. https:\/\/avro.apache.org."},{"key":"e_1_2_1_7_1","unstructured":"Apache Avro 1.8.1 Specification. https:\/\/avro.apache.org\/docs\/1.8.1\/spec.html.  Apache Avro 1.8.1 Specification. https:\/\/avro.apache.org\/docs\/1.8.1\/spec.html."},{"key":"e_1_2_1_8_1","unstructured":"Intel AVX2. https:\/\/software.intel.com\/en-us\/node\/523876.  Intel AVX2. https:\/\/software.intel.com\/en-us\/node\/523876."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007615"},{"key":"e_1_2_1_10_1","doi-asserted-by":"crossref","DOI":"10.17487\/RFC8259","volume-title":"RFC 8259: The Javascript Object Notation (JSON) Data Interchange Format","author":"Bray Tim","year":"2017","unstructured":"Bray , Tim . RFC 8259: The Javascript Object Notation (JSON) Data Interchange Format . 2017 . Bray, Tim. RFC 8259: The Javascript Object Notation (JSON) Data Interchange Format. 2017."},{"key":"e_1_2_1_11_1","unstructured":"Bro. https:\/\/www.bro.org\/.  Bro. https:\/\/www.bro.org\/."},{"key":"e_1_2_1_12_1","unstructured":"Bro Exchange 2013 Malware Analysis. https: \/\/github.com\/LiamRandall\/BroMalware-Exercise.  Bro Exchange 2013 Malware Analysis. https: \/\/github.com\/LiamRandall\/BroMalware-Exercise."},{"key":"e_1_2_1_13_1","volume-title":"http:\/\/matthias.vallentin.net\/slides\/bro-nf.pdf","author":"Bro","year":"2011","unstructured":"Network Forensics with Bro . http:\/\/matthias.vallentin.net\/slides\/bro-nf.pdf , 2011 . Network Forensics with Bro. http:\/\/matthias.vallentin.net\/slides\/bro-nf.pdf, 2011."},{"key":"e_1_2_1_14_1","unstructured":"Understanding and Examining Bro Logs. https:\/\/www.bro.org\/bro-workshop-2011\/solutions\/logs\/index.html.  Understanding and Examining Bro Logs. https:\/\/www.bro.org\/bro-workshop-2011\/solutions\/logs\/index.html."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1463788.1463811"},{"key":"e_1_2_1_16_1","first-page":"559","volume-title":"Divyakant. AFilter: Adaptable XML Filtering with Prefix-caching Suffix-clustering. In Proceedings of the 32nd VLDB","author":"Candan K Sel\u00e7uk","year":"2006","unstructured":"Candan , K Sel\u00e7uk and Hsiung , Wang-Pin and Chen , Songting and Tatemura , Junichi and Agrawal , Divyakant. AFilter: Adaptable XML Filtering with Prefix-caching Suffix-clustering. In Proceedings of the 32nd VLDB , pages 559 -- 570 . VLDB Endowment , 2006 . Candan, K Sel\u00e7uk and Hsiung, Wang-Pin and Chen, Songting and Tatemura, Junichi and Agrawal, Divyakant. AFilter: Adaptable XML Filtering with Prefix-caching Suffix-clustering. In Proceedings of the 32nd VLDB, pages 559--570. VLDB Endowment, 2006."},{"key":"e_1_2_1_17_1","volume-title":"Research Access to Censys Data. https:\/\/support.censys.io\/getting-started\/research-access-to-censys-data","author":"Censys","year":"2017","unstructured":"Censys . Research Access to Censys Data. https:\/\/support.censys.io\/getting-started\/research-access-to-censys-data , 2017 . Censys. Research Access to Censys Data. https:\/\/support.censys.io\/getting-started\/research-access-to-censys-data, 2017."},{"key":"e_1_2_1_18_1","volume-title":"Really Fast JSON Parser. https:\/\/chadaustin.me\/2017\/05\/writing-a-really-really-fast-json-parser\/","author":"Really","year":"2017","unstructured":"Writing a Really , Really Fast JSON Parser. https:\/\/chadaustin.me\/2017\/05\/writing-a-really-really-fast-json-parser\/ , 2017 . Writing a Really, Really Fast JSON Parser. https:\/\/chadaustin.me\/2017\/05\/writing-a-really-really-fast-json-parser\/, 2017."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2593673"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2818181"},{"key":"e_1_2_1_21_1","first-page":"551","volume-title":"NSDI","author":"Choi Byungkwon","year":"2016","unstructured":"Choi , Byungkwon and Chae , Jongwook and Jamshed , Muhammad and Park , Kyoungsoo and Han , Dongsu . DFC : Accelerating String Pattern Matching for Network Applications . In NSDI , pages 551 -- 565 , 2016 . Choi, Byungkwon and Chae, Jongwook and Jamshed, Muhammad and Park, Kyoungsoo and Han, Dongsu. DFC: Accelerating String Pattern Matching for Network Applications. In NSDI, pages 551--565, 2016."},{"key":"e_1_2_1_22_1","unstructured":"Wireshark Filters. http:\/\/www.lovemytool.com\/blog\/2010\/04\/top-10-wireshark-filters-by-chris-greer.html.  Wireshark Filters. http:\/\/www.lovemytool.com\/blog\/2010\/04\/top-10-wireshark-filters-by-chris-greer.html."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/958942.958947"},{"issue":"1","key":"e_1_2_1_24_1","first-page":"41","article-title":"High-performance XML Filtering","volume":"26","author":"Diao Yanlei","year":"2003","unstructured":"Diao , Yanlei and Franklin , Michael J . High-performance XML Filtering : An Overview of YFilter. IEEE Data Eng. Bull. , 26 ( 1 ): 41 -- 48 , 2003 . Diao, Yanlei and Franklin, Michael J. High-performance XML Filtering: An Overview of YFilter. IEEE Data Eng. Bull., 26(1):41--48, 2003.","journal-title":"An Overview of YFilter. IEEE Data Eng. Bull."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2810103.2813703"},{"key":"e_1_2_1_26_1","unstructured":"Gallant Andrew. ripgrep is faster than grep ag git grep ucg pt sift. https:\/\/blog.burntsushi.net\/ripgrep.  Gallant Andrew. ripgrep is faster than grep ag git grep ucg pt sift. https:\/\/blog.burntsushi.net\/ripgrep."},{"key":"e_1_2_1_27_1","unstructured":"TShark Tutorial and Filter Examples. https:\/\/hackertarget.com\/tshark-tutorial-and-filter-examples\/.  TShark Tutorial and Filter Examples. https:\/\/hackertarget.com\/tshark-tutorial-and-filter-examples\/."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2006.184"},{"key":"e_1_2_1_29_1","unstructured":"Analyze HTTP Requests with TShark. http:\/\/kvz.io\/blog\/2010\/05\/15\/analyze-http-requests-with-tshark\/.  Analyze HTTP Requests with TShark. http:\/\/kvz.io\/blog\/2010\/05\/15\/analyze-http-requests-with-tshark\/."},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of 5th Biennial Conference on Innovative Data Systems Research, number EPFL-CONF-161489","author":"Idreos Stratos","year":"2011","unstructured":"Idreos , Stratos and Alagiannis , Ioannis and Johnson , Ryan and Ailamaki , Anastasia . Here are my data files. Here are my queries. Where are my results? In Proceedings of 5th Biennial Conference on Innovative Data Systems Research, number EPFL-CONF-161489 , 2011 . Idreos, Stratos and Alagiannis, Ioannis and Johnson, Ryan and Ailamaki, Anastasia. Here are my data files. Here are my queries. Where are my results? In Proceedings of 5th Biennial Conference on Innovative Data Systems Research, number EPFL-CONF-161489, 2011."},{"key":"e_1_2_1_31_1","unstructured":"Jackson. https:\/\/github.com\/FasterXML\/jackson.  Jackson. https:\/\/github.com\/FasterXML\/jackson."},{"key":"e_1_2_1_32_1","unstructured":"nativejson-benchmark. https:\/\/github.com\/miloyip\/nativejson-benchmark.  nativejson-benchmark. https:\/\/github.com\/miloyip\/nativejson-benchmark."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994516"},{"key":"e_1_2_1_34_1","volume-title":"Anastasia. Just-in-time Data Virtualization: Lightweight Data Management with ViDa. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR)","author":"Karpathiotakis Manos","year":"2015","unstructured":"Karpathiotakis , Manos and Alagiannis , Ioannis and Heinis , Thomas and Branco , Miguel and Ailamaki , Anastasia. Just-in-time Data Virtualization: Lightweight Data Management with ViDa. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR) , 2015 . Karpathiotakis, Manos and Alagiannis, Ioannis and Heinis, Thomas and Branco, Miguel and Ailamaki, Anastasia. Just-in-time Data Virtualization: Lightweight Data Management with ViDa. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR), 2015."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732977.2732986"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3115404.3115416"},{"key":"e_1_2_1_37_1","unstructured":"libpcap. http:\/\/www.tcpdump.org.  libpcap. http:\/\/www.tcpdump.org."},{"key":"e_1_2_1_38_1","first-page":"728","article-title":"Techniques for Ordering Predicates","author":"Ma Lu","year":"2014","unstructured":"Ma , Lu and Au , Grace Kwan-On . Techniques for Ordering Predicates in Column Partitioned Databases for Query Optimization , July 3 2014 . US Patent App. 13\/ 728 ,345. Ma, Lu and Au, Grace Kwan-On. Techniques for Ordering Predicates in Column Partitioned Databases for Query Optimization, July 3 2014. US Patent App. 13\/728,345.","journal-title":"Column Partitioned Databases for Query Optimization"},{"key":"e_1_2_1_39_1","first-page":"9","volume-title":"ADMS@ VLDB","author":"Moussalli Roger","year":"2011","unstructured":"Moussalli , Roger and Halstead , Robert J and Salloum , Mariam and Najjar , Walid A and Tsotras , Vassilis J . Efficient XML Path Filtering Using GPUs . In ADMS@ VLDB , pages 9 -- 18 . Citeseer , 2011 . Moussalli, Roger and Halstead, Robert J and Salloum, Mariam and Najjar, Walid A and Tsotras, Vassilis J. Efficient XML Path Filtering Using GPUs. In ADMS@ VLDB, pages 9--18. Citeseer, 2011."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767899"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/2556549.2556555"},{"key":"e_1_2_1_42_1","unstructured":"ARM NEON. https:\/\/developer.arm.com\/technologies\/neon.  ARM NEON. https:\/\/developer.arm.com\/technologies\/neon."},{"key":"e_1_2_1_43_1","volume-title":"Inc.","author":"Norton Marc","year":"2004","unstructured":"Norton , Marc . Optimizing Pattern Matching for Intrusion Detection. Sourcefire , Inc. , Columbia, MD , 2004 . Norton, Marc. Optimizing Pattern Matching for Intrusion Detection. Sourcefire, Inc., Columbia, MD, 2004."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3115404.3115415"},{"key":"e_1_2_1_45_1","unstructured":"Apache Parquet. https:\/\/parquet.apache.org.  Apache Parquet. https:\/\/parquet.apache.org."},{"key":"e_1_2_1_46_1","unstructured":"apache\/parquet-format. https:\/\/github.com\/apache\/parquet-format.  apache\/parquet-format. https:\/\/github.com\/apache\/parquet-format."},{"key":"e_1_2_1_47_1","unstructured":"Development\/LibpcapFileFormat - The Wireshark Wiki. https:\/\/wiki.wireshark.org\/Development\/LibpcapFileFormat.  Development\/LibpcapFileFormat - The Wireshark Wiki. https:\/\/wiki.wireshark.org\/Development\/LibpcapFileFormat."},{"key":"e_1_2_1_48_1","unstructured":"Libpcap File Format. https:\/\/wiki.wireshark.org\/Development\/LibpcapFileFormat.  Libpcap File Format. https:\/\/wiki.wireshark.org\/Development\/LibpcapFileFormat."},{"key":"e_1_2_1_49_1","unstructured":"RapidJSON. https:\/\/rapidjson.org.  RapidJSON. https:\/\/rapidjson.org."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465292"},{"key":"e_1_2_1_51_1","volume-title":"RWTH","author":"Scheufele Wolfgang","year":"1996","unstructured":"Scheufele , Wolfgang and Moerkotte , Guido . Optimal Ordering of Selections and Joins in Acyclic Queries with Expensive Predicates . RWTH , Fachgruppe Informatik , 1996 . Scheufele, Wolfgang and Moerkotte, Guido. Optimal Ordering of Selections and Joins in Acyclic Queries with Expensive Predicates. RWTH, Fachgruppe Informatik, 1996."},{"key":"e_1_2_1_52_1","unstructured":"Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform. https:\/\/databricks.com\/blog\/2015\/01\/09\/.  Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform. https:\/\/databricks.com\/blog\/2015\/01\/09\/."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2017.56"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2612183"},{"key":"e_1_2_1_55_1","unstructured":"tcpdump. http:\/\/www.tcpdump.org.  tcpdump. http:\/\/www.tcpdump.org."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/2536800"},{"key":"e_1_2_1_57_1","volume-title":"https:\/\/spark.apache.org\/docs\/latest\/sql-programming-guide.html#json-datasets","author":"The Apache Foundation JSON","year":"2015","unstructured":"The Apache Foundation . JSON Datasets . https:\/\/spark.apache.org\/docs\/latest\/sql-programming-guide.html#json-datasets , 2015 . The Apache Foundation. JSON Datasets. https:\/\/spark.apache.org\/docs\/latest\/sql-programming-guide.html#json-datasets, 2015."},{"key":"e_1_2_1_58_1","unstructured":"TShark. https:\/\/www.wireshark.org\/docs\/man-pages\/tshark.html.  TShark. https:\/\/www.wireshark.org\/docs\/man-pages\/tshark.html."},{"key":"e_1_2_1_59_1","first-page":"2628","volume-title":"George. Deterministic Memory-efficient String Matching Algorithms for Intrusion Detection. In INFOCOM 2004. Twenty-third Annual Joint Conference of the IEEE Computer and Communications Societies","volume":"4","author":"Tuck Nathan","year":"2004","unstructured":"Tuck , Nathan and Sherwood , Timothy and Calder , Brad and Varghese , George. Deterministic Memory-efficient String Matching Algorithms for Intrusion Detection. In INFOCOM 2004. Twenty-third Annual Joint Conference of the IEEE Computer and Communications Societies , volume 4 , pages 2628 -- 2639 . IEEE, 2004 . Tuck, Nathan and Sherwood, Timothy and Calder, Brad and Varghese, George. Deterministic Memory-efficient String Matching Algorithms for Intrusion Detection. In INFOCOM 2004. Twenty-third Annual Joint Conference of the IEEE Computer and Communications Societies, volume 4, pages 2628--2639. IEEE, 2004."},{"key":"e_1_2_1_60_1","unstructured":"Introduction to Twitter JSON. https:\/\/developer.twitter.com\/en\/docs\/tweets\/data-dictionary\/overview\/intro-to-tweet-json.  Introduction to Twitter JSON. https:\/\/developer.twitter.com\/en\/docs\/tweets\/data-dictionary\/overview\/intro-to-tweet-json."},{"key":"e_1_2_1_61_1","first-page":"2001","volume-title":"Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on","volume":"1","author":"Viola Paul","unstructured":"Viola , Paul and Jones , Michael . Rapid Object Detection using a Boosted Cascade of Simple Features . In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on , volume 1 , pages I--I. IEEE, 2001 . Viola, Paul and Jones, Michael. Rapid Object Detection using a Boosted Cascade of Simple Features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I--I. IEEE, 2001."},{"key":"e_1_2_1_62_1","first-page":"2","volume-title":"Ion. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation","author":"Zaharia Matei","year":"2012","unstructured":"Zaharia , Matei and Chowdhury , Mosharaf and Das , Tathagata and Dave , Ankur and Ma , Justin and McCauley , Murphy and Franklin , Michael J and Shenker , Scott and Stoica , Ion. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation , pages 2 -- 2 . USENIX Association , 2012 . Zaharia, Matei and Chowdhury, Mosharaf and Das, Tathagata and Dave, Ankur and Ma, Justin and McCauley, Murphy and Franklin, Michael J and Shenker, Scott and Stoica, Ion. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3236187.3236207","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:47:26Z","timestamp":1672220846000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3236187.3236207"}},"subtitle":["faster analytics on raw data with sparser"],"short-title":[],"issued":{"date-parts":[[2018,7]]},"references-count":62,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2018,7]]}},"alternative-id":["10.14778\/3236187.3236207"],"URL":"https:\/\/doi.org\/10.14778\/3236187.3236207","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2018,7]]}}}