{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T05:16:04Z","timestamp":1767849364323,"version":"3.49.0"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,7]]},"abstract":"<jats:p>The schemalessness, one of the major advantages of JSON representation format, comes with high penalties in querying and operations by denying various critical functions such as query optimizations, indexing, or data verification. There have been continuous efforts to develop an accurate JSON schema discovery algorithm from a bag of JSON documents. Unfortunately, existing schema discovery techniques, being top-down algorithms, face challenges from the lack of visibility into children nodes of JSON tree. With absence of the information about lower-level JSON elements, top-down algorithms need to employ assumptions and heuristics to decide the schema type of nodes. However, such static decisions are often violated in datasets which causes top-down algorithms to perform poorly. To overcome this, we propose an algorithm, called ReCG, that processes JSON documents in a bottom-up manner. It builds up schemas from leaf elements upward in the JSON document tree and, thus, can make more informed decisions of the schema node types. In addition, we adopt MDL (Minimum Description Length) principles systematically while building up the schemas to choose among candidate schemas the most concise yet accurate one with well-balanced generality. Evaluations show that our technique improves the recall and precision of found schemas by as high as 47%, resulting in 46% better F1 score while also performing 2.11\u00d7 faster on average against the state-of-the-art.<\/jats:p>","DOI":"10.14778\/3681954.3682019","type":"journal-article","created":{"date-parts":[[2024,8,30]],"date-time":"2024-08-30T16:23:36Z","timestamp":1725035016000},"page":"3538-3550","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework"],"prefix":"10.14778","volume":"17","author":[{"given":"Joohyung","family":"Yun","sequence":"first","affiliation":[{"name":"POSTECH, Pohang, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Byungchul","family":"Tak","sequence":"additional","affiliation":[{"name":"Kyungpook National University, Daegu, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wook-Shin","family":"Han","sequence":"additional","affiliation":[{"name":"Graduate School of AI, POSTECH, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,8,30]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Technical Report. Retrieved","year":"2024","unstructured":"2024. Technical Report. Retrieved July 15, 2024 from https:\/\/sites.google.com\/dblab.postech.ac.kr\/recg-technical-report"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (","author":"Alrashed Tarfah","unstructured":"Tarfah Alrashed, Jumana Almahmoud, Amy X. Zhang, and David R. Karger. 2020. ScrAPIr: Making Web Data APIs Accessible to End Users. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (, Honolulu, HI, USA,) (CHI '20). ACM, New York, NY, USA, 1--12."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3632891"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the Conference on Extending Database Technology (EDBT). 222--233","author":"Baazizi Mohamed Amine","year":"2017","unstructured":"Mohamed Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2017. Schema Inference for Massive JSON Datasets. In Proceedings of the Conference on Extending Database Technology (EDBT). 222--233."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3314032"},{"key":"e_1_2_1_6_1","first-page":"4","article-title":"Parametric Schema Inference for Massive JSON Datasets","volume":"28","author":"Baazizi Mohamed-Amine","year":"2022","unstructured":"Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2022. Parametric Schema Inference for Massive JSON Datasets. The VLDB Journal 28, 4 (mar 2022), 497--521.","journal-title":"The VLDB Journal"},{"key":"e_1_2_1_7_1","first-page":"271","article-title":"Type-Based XML Projection","volume":"6","author":"Benzaken V\u00e9ronique","year":"2006","unstructured":"V\u00e9ronique Benzaken, Giuseppe Castagna, Dario Colazzo, and Kim Nguyen. 2006. Type-Based XML Projection.. In VLDB, Vol. 6. 271--282.","journal-title":"VLDB"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1841909.1841911"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/3402755.3402761"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137782"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3034786.3056120"},{"key":"e_1_2_1_12_1","volume-title":"Retrieved","author":"Cabrera Alvaro","year":"2016","unstructured":"Alvaro Cabrera. 2016. JSON Schema Faker. Retrieved July 15, 2024 from https:\/\/github.com\/json-schema-faker\/json-schema-faker"},{"key":"e_1_2_1_13_1","first-page":"14","article-title":"Enabling JSON Document Stores in Relational Systems","volume":"13","author":"Chasseur Craig","year":"2013","unstructured":"Craig Chasseur, Yinan Li, and Jignesh M Patel. 2013. Enabling JSON Document Stores in Relational Systems.. In WebDB, Vol. 13. 14--15.","journal-title":"WebDB"},{"key":"e_1_2_1_14_1","volume-title":"Flash profile. Novel techniques in sensory characterization and consumer profiling","author":"Delarue Julien","year":"2014","unstructured":"Julien Delarue. 2014. Flash profile. Novel techniques in sensory characterization and consumer profiling (2014), 175--206."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1121995.1122010"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452809"},{"key":"e_1_2_1_17_1","unstructured":"Martin Ester Hans-Peter Kriegel J\u00f6rg Sander Xiaowei Xu et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd Vol. 96. 226--231."},{"key":"e_1_2_1_18_1","volume-title":"2007 IEEE 23rd International Conference on Data Engineering. IEEE, 666--675","author":"Fan Wenfei","year":"2006","unstructured":"Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2006. Rewriting regular XPath queries on XML views. In 2007 IEEE 23rd International Conference on Data Engineering. IEEE, 666--675."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/IRI.2018.00060"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2018.02.007"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335409"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3409719"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517850"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.4018\/JDM.2019070103"},{"key":"e_1_2_1_26_1","volume-title":"The distribution of the flora in the alpine zone. 1. New phytologist 11, 2","author":"Jaccard Paul","year":"1912","unstructured":"Paul Jaccard. 1912. The distribution of the flora in the alpine zone. 1. New phytologist 11, 2 (1912), 37--50."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3436905.3436926"},{"key":"e_1_2_1_28_1","unstructured":"Meike Klettke Uta St\u00f6rl and Stefanie Scherzinger. 2015. Schema extraction and structural outlier detection for JSON-based NoSQL data stores. (2015)."},{"key":"e_1_2_1_29_1","volume-title":"Retrieved","author":"Kristensen Mads","year":"2017","unstructured":"Mads Kristensen. 2017. SchemaStore. Retrieved July 15, 2024 from https:\/\/github.com\/SchemaStore\/schemastore"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2487788.2488184"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/3115404.3115416"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2595628"},{"key":"e_1_2_1_33_1","first-page":"315","article-title":"Query optimization for XML","volume":"99","author":"McHugh Jason","year":"1999","unstructured":"Jason McHugh and Jennifer Widom. 1999. Query optimization for XML. In VLDB, Vol. 99. 315--326.","journal-title":"VLDB"},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Felipe Pezoa Juan L. Reutter Fernando Suarez Martin Ugarte and Domagoj Vrgo\u010d. 2016. Foundations of JSON Schema. In Proceedings of the 25th International Conference on World Wide Web (Montr\u00e9al Qu\u00e9bec Canada) (WWW '16). International World Wide Web Conferences Steering Committee Republic and Canton of Geneva CHE 263--273.","DOI":"10.1145\/2872427.2883029"},{"key":"e_1_2_1_35_1","first-page":"3","article-title":"Inferring Decision Trees Using the Minimum Description Length","volume":"80","author":"Quinlan J. R.","year":"1989","unstructured":"J. R. Quinlan and R. L. Rivest. 1989. Inferring Decision Trees Using the Minimum Description Length Principle. Inf. Comput. 80, 3 (mar 1989), 227--248.","journal-title":"Principle. Inf. Comput."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1016\/0005-1098(78)90005-5"},{"key":"e_1_2_1_37_1","volume-title":"Jo\u00e3o da Cunha Costa, V\u00e1lter Ferreira Picas Carvalho, and Jos\u00e9 Carlos Ramalho.","author":"dos Santos Filipa Alves","year":"2021","unstructured":"Filipa Alves dos Santos, Hugo Andr\u00e9 Coelho Cardoso, Jo\u00e3o da Cunha Costa, V\u00e1lter Ferreira Picas Carvalho, and Jos\u00e9 Carlos Ramalho. 2021. DataGen: JSON\/XML Dataset Generator. (2021)."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452801"},{"key":"e_1_2_1_39_1","volume-title":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","author":"Tahara Daniel","unstructured":"Daniel Tahara, Thaddeus Diamond, and Daniel J. Abadi. 2014. Sinew: a SQL system for multi-structured data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). ACM, New York, NY, USA, 815--826."},{"key":"e_1_2_1_40_1","volume-title":"2018 USENIX Annual Technical Conference (USENIX ATC 18)","author":"Trivedi Animesh","year":"2018","unstructured":"Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and Bernard Metzler. 2018. Albis:{High-Performance} File Format for Big Data Systems. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 615--630."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3355369.3355594"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/2777598.2777601"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3384345.3384350"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","unstructured":"Erik Wilde. 2018. Surfing the API Web: Web Concepts. In Companion Proceedings of the The Web Conference 2018 (Lyon France) (WWW '18). International World Wide Web Conferences Steering Committee Republic and Canton of Geneva CHE 797--803. 10.1145\/3184558.3188743","DOI":"10.1145\/3184558.3188743"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICMLC.2007.4370588"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3567444"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3591637"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247541"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3681954.3682019","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T18:29:36Z","timestamp":1725474576000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3681954.3682019"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7]]},"references-count":47,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["10.14778\/3681954.3682019"],"URL":"https:\/\/doi.org\/10.14778\/3681954.3682019","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,7]]},"assertion":[{"value":"2024-08-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}