{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T06:03:23Z","timestamp":1767852203998,"version":"3.49.0"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2010,4,1]],"date-time":"2010-04-01T00:00:00Z","timestamp":1270080000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"FWO","award":["G.0821.09N"],"award-info":[{"award-number":["G.0821.09N"]}]},{"DOI":"10.13039\/501100004963","name":"Seventh Framework Programme","doi-asserted-by":"publisher","award":["FP7-ICT-233599"],"award-info":[{"award-number":["FP7-ICT-233599"]}],"id":[{"id":"10.13039\/501100004963","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2010,4]]},"abstract":"<jats:p>We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning<jats:italic>concise<\/jats:italic>regular expressions from positive examples strings. We identify two classes of concise regular expressions\u2014the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)\u2014that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however (for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small datasets.<\/jats:p>","DOI":"10.1145\/1735886.1735890","type":"journal-article","created":{"date-parts":[[2010,5,4]],"date-time":"2010-05-04T14:14:06Z","timestamp":1272982446000},"page":"1-47","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":70,"title":["Inference of concise regular expressions and DTDs"],"prefix":"10.1145","volume":"35","author":[{"given":"Geert Jan","family":"Bex","sequence":"first","affiliation":[{"name":"Hasselt University and Transnational University of Limburg, Belgium"}]},{"given":"Frank","family":"Neven","sequence":"additional","affiliation":[{"name":"Hasselt University and Transnational University of Limburg, Belgium"}]},{"given":"Thomas","family":"Schwentick","sequence":"additional","affiliation":[{"name":"Dortmund University, Germany"}]},{"given":"Stijn","family":"Vansummeren","sequence":"additional","affiliation":[{"name":"Universit\u00e9 Libre de Bruxelles, Belgium"}]}],"member":"320","published-online":{"date-parts":[[2010,5,3]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"Abiteboul S. Buneman P. and Suciu D. 1999. Data on the Web. Morgan Kaufmann Publishers. Abiteboul S. Buneman P. and Suciu D. 1999. Data on the Web. Morgan Kaufmann Publishers."},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/356914.356918"},{"key":"e_1_2_2_4_1","volume-title":"Proceedings of the 5th International Workshop on the Web and Databases (WebDB","author":"Barbosa D.","year":"2002","unstructured":"Barbosa , D. , Mendelzon , A. O. , Keenleyside , J. , and Lyons , K. A . 2002. ToXgene: An extensible template-based data generator for XML . In Proceedings of the 5th International Workshop on the Web and Databases (WebDB 2002 ). 49--54. Barbosa, D., Mendelzon, A. O., Keenleyside, J., and Lyons, K. A. 2002. ToXgene: An extensible template-based data generator for XML. In Proceedings of the 5th International Workshop on the Web and Databases (WebDB 2002). 49--54."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11280-006-8437-6"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1346330.1346333"},{"key":"e_1_2_2_7_1","volume-title":"Online Proceedings of the 1st Biennal Conference on Innovative Data Systems Research (CIDR'03)","author":"Bernstein P. A.","year":"2003","unstructured":"Bernstein , P. A. 2003 . Applying model management to classical meta data problems . In Online Proceedings of the 1st Biennal Conference on Innovative Data Systems Research (CIDR'03) . Bernstein, P. A. 2003. Applying model management to classical meta data problems. In Online Proceedings of the 1st Biennal Conference on Innovative Data Systems Research (CIDR'03)."},{"key":"e_1_2_2_8_1","unstructured":"Bex G. J. Gelade W. Neven F. and Vansummeren S. Learning deterministic regular expressions for the inference of schemas from XML data. http:\/\/arxiv.org\/abs\/1004.2372. Bex G. J. Gelade W. Neven F. and Vansummeren S. Learning deterministic regular expressions for the inference of schemas from XML data. http:\/\/arxiv.org\/abs\/1004.2372."},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1367497.1367609"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1017074.1017095"},{"key":"e_1_2_2_11_1","volume-title":"Proceedings of the International Conference on Database Theory (VLDB). U. Dayal, K.-Y. Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L. Kersten, S. K. Cha, and Y.-K. Kim, Eds. ACM, 115--126","author":"Bex G. J.","unstructured":"Bex , G. J. , Neven , F. , Schwentick , T. , and Tuyls , K . 2006. Inference of concise DTDs from XML data . In Proceedings of the International Conference on Database Theory (VLDB). U. Dayal, K.-Y. Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L. Kersten, S. K. Cha, and Y.-K. Kim, Eds. ACM, 115--126 . Bex, G. J., Neven, F., Schwentick, T., and Tuyls, K. 2006. Inference of concise DTDs from XML data. In Proceedings of the International Conference on Database Theory (VLDB). U. Dayal, K.-Y. Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L. Kersten, S. K. Cha, and Y.-K. Kim, Eds. ACM, 115--126."},{"key":"e_1_2_2_12_1","volume-title":"Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07)","author":"Bex G. J.","unstructured":"Bex , G. J. , Neven , F. , and Vansummeren , S . 2007. Inferring XML schema definitions from XML data . In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07) . 998--1009. Bex, G. J., Neven, F., and Vansummeren, S. 2007. Inferring XML schema definitions from XML data. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB'07). 998--1009."},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/168304.168340"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/0304-3975(93)90287-4"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1006\/inco.1997.2688"},{"key":"e_1_2_2_16_1","volume-title":"Proceedings of the International Conference on Database Theory (ICDT'97)","volume":"1186","author":"Buneman P.","unstructured":"Buneman , P. , Davidson , S. B. , Fernandez , M. F. , and Suciu , D . 1997. Adding structure to unstructured data . In Proceedings of the International Conference on Database Theory (ICDT'97) . Lecture Notes in Computer Science , vol. 1186 . Springer, 336--350. Buneman, P., Davidson, S. B., Fernandez, M. F., and Suciu, D. 1997. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory (ICDT'97). Lecture Notes in Computer Science, vol. 1186. Springer, 336--350."},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0304-3975(97)00296-X"},{"key":"e_1_2_2_18_1","unstructured":"Castor. The Castor project. www.castor.org. Castor. The Castor project. www.castor.org."},{"key":"e_1_2_2_19_1","volume-title":"Proceedings of the 8th International Workshop on Knowledge Representation meets Databases (KRDB'01)","volume":"45","author":"Chidlovskii B.","year":"2001","unstructured":"Chidlovskii , B. 2001 . Schema extraction from XML: A grammatical inference approach . In Proceedings of the 8th International Workshop on Knowledge Representation meets Databases (KRDB'01) . CEUR Workshop Proceedings , vol. 45 . Chidlovskii, B. 2001. Schema extraction from XML: A grammatical inference approach. In Proceedings of the 8th International Workshop on Knowledge Representation meets Databases (KRDB'01). CEUR Workshop Proceedings, vol. 45."},{"key":"e_1_2_2_20_1","unstructured":"Clark J. Trang: Multi-Format schema converter based on RELAX NG. www.thaiopensource.com\/relaxng\/trang.html. Clark J. Trang: Multi-Format schema converter based on RELAX NG. www.thaiopensource.com\/relaxng\/trang.html."},{"key":"e_1_2_2_21_1","doi-asserted-by":"crossref","unstructured":"Cover R. 2003. The Cover Pages. xml.coverpages.org. Cover R. 2003. The Cover Pages. xml.coverpages.org.","DOI":"10.14714\/CP46.480"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-30500-2_31"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304220"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0022-0000(76)80034-7"},{"key":"e_1_2_2_25_1","volume-title":"Proceedings of the 14th International Conference on Data Engineering (ICDE'98)","author":"Fernandez M. F.","unstructured":"Fernandez , M. F. and Suciu , D . 1998. Optimizing regular path expressions using graph schemas . In Proceedings of the 14th International Conference on Data Engineering (ICDE'98) . 14--23. Fernandez, M. F. and Suciu, D. 1998. Optimizing regular path expressions using graph schemas. In Proceedings of the 14th International Conference on Data Engineering (ICDE'98). 14--23."},{"key":"e_1_2_2_26_1","series-title":"Lecture Notes in Artificial Intelligence","volume-title":"Proceedings of the 7th International Colloquium on Grammatical Inference: Algorithms and Applications","author":"Fernau H.","unstructured":"Fernau , H. 2004. Extracting minimum length document type definitions is NP-hard . In Proceedings of the 7th International Colloquium on Grammatical Inference: Algorithms and Applications . Lecture Notes in Artificial Intelligence , vol. 3264 . Springer , 277--278. Fernau, H. 2004. Extracting minimum length document type definitions is NP-hard. In Proceedings of the 7th International Colloquium on Grammatical Inference: Algorithms and Applications. Lecture Notes in Artificial Intelligence, vol. 3264. Springer, 277--278."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ic.2008.12.008"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1103822.1103832"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.57687"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1021560618289"},{"key":"e_1_2_2_31_1","first-page":"325","article-title":"Succinctness of the complement and intersection of regular expressions. In Proceedings of the 25th Annual Symposium on Theoretical Aspects of Computer Science (STACS'08)","volume":"08001","author":"Gelade W.","year":"2008","unstructured":"Gelade , W. and Neven , F. 2008 . Succinctness of the complement and intersection of regular expressions. In Proceedings of the 25th Annual Symposium on Theoretical Aspects of Computer Science (STACS'08) . Dagstuhl Seminar Proceedings , vol. 08001. 325 -- 336 . Gelade, W. and Neven, F. 2008. Succinctness of the complement and intersection of regular expressions. In Proceedings of the 25th Annual Symposium on Theoretical Aspects of Computer Science (STACS'08). Dagstuhl Seminar Proceedings, vol. 08001. 325--336.","journal-title":"Dagstuhl Seminar Proceedings"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0019-9958(67)91165-5"},{"key":"e_1_2_2_33_1","volume-title":"Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB'97)","author":"Goldman R.","unstructured":"Goldman , R. and Widom , J . 1997. DataGuides: Enabling query formulation and optimization in semistructured databases . In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB'97) . 436--445. Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB'97). 436--445."},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-70583-3_4"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2006.09.025"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDEW.2006.166"},{"key":"e_1_2_2_37_1","unstructured":"Hinkelman S. 2005. Business integration\u2014Information conformance statements (BI-ICS). Tech. rep. IBM DeveloperWorks. Hinkelman S. 2005. Business integration\u2014Information conformance statements (BI-ICS). Tech. rep. IBM DeveloperWorks."},{"key":"e_1_2_2_38_1","unstructured":"Hopcroft J. and Ullman J. 1979. Introduction to Automata Theory Languages and computation. Addison-Wesley. Hopcroft J. and Ullman J. 1979. Introduction to Automata Theory Languages and computation. Addison-Wesley."},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/322217.322230"},{"key":"e_1_2_2_40_1","volume-title":"Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04)","author":"Koch C.","unstructured":"Koch , C. , Scherzinger , S. , Schweikardt , N. , and Stegmaier , B . 2004. Schema-Based scheduling of event processors and buffer minimization for queries on structured data streams . In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04) . 228--239. Koch, C., Scherzinger, S., Schweikardt, N., and Stegmaier, B. 2004. Schema-Based scheduling of event processors and buffer minimization for queries on structured data streams. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). 228--239."},{"key":"e_1_2_2_41_1","volume-title":"Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01)","author":"Manolescu I.","unstructured":"Manolescu , I. , Florescu , D. , and Kossmann , D . 2001. Answering XML queries on heterogeneous data sources . In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01) . 241--250. Manolescu, I., Florescu, D., and Kossmann, D. 2001. Answering XML queries on heterogeneous data sources. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB'01). 241--250."},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/1324185.1324188"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/1166074.1166076"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/262762.262770"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/775152.775223"},{"key":"e_1_2_2_47_1","unstructured":"Miklau G. 2002. XMLData repository. www.cs.washington.edu\/research\/xmldatasets. Miklau G. 2002. XMLData repository. www.cs.washington.edu\/research\/xmldatasets."},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0020-0190(02)00345-9"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/276304.276331"},{"key":"e_1_2_2_50_1","volume-title":"Proceedings of the 13th International Conference on Data Engineering. IEEE Computer Society, 79--90","author":"Nestorov S.","unstructured":"Nestorov , S. , Ullman , J. D. , Wiener , J. L. , and Chawathe , S. S . 1997. Representative objects: Concise representations of semistructured, hierarchial data . In Proceedings of the 13th International Conference on Data Engineering. IEEE Computer Society, 79--90 . Nestorov, S., Ullman, J. D., Wiener, J. L., and Chawathe, S. S. 1997. Representative objects: Concise representations of semistructured, hierarchial data. In Proceedings of the 13th International Conference on Data Engineering. IEEE Computer Society, 79--90."},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.2168\/LMCS-2(3:1)2006"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11280-005-0509-5"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10796-006-9016-1"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0304-3975(00)00209-7"},{"key":"e_1_2_2_55_1","unstructured":"Open Web Application Security Project Consortium. 2004. The top ten most critical web application security vulnerabilities\u20142004 update. www.owasp.org. Open Web Application Security Project Consortium. 2004. The top ten most critical web application security vulnerabilities\u20142004 update. www.owasp.org."},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.5555\/647702.734327"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.18626"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1007\/s007780100057"},{"key":"e_1_2_2_59_1","volume-title":"Proceedings of the 3rd International Workshop on The World Wide Web and Databases, (WebDB'00)","author":"Sahuguet A.","year":"2000","unstructured":"Sahuguet , A. 2000 . Everything you ever wanted to know about DTDs, but were afraid to ask (extended abstract) . In Proceedings of the 3rd International Workshop on The World Wide Web and Databases, (WebDB'00) , Selected Papers. 171--183. Sahuguet, A. 2000. Everything you ever wanted to know about DTDs, but were afraid to ask (extended abstract). In Proceedings of the 3rd International Workshop on The World Wide Web and Databases, (WebDB'00), Selected Papers. 171--183."},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0304-3975(97)00014-5"},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/502585.502613"},{"key":"e_1_2_2_62_1","unstructured":"Sun. Sun JAXB. java.sun.com\/webservices\/jaxb. Sun. Sun JAXB. java.sun.com\/webservices\/jaxb."},{"key":"e_1_2_2_63_1","unstructured":"Thompson H. S. Beech D. Maloney M. and Mendelsohn N. 2004. XML Schema part 1: Structures 2nd Ed. World Wide Web Consortium Recommendation REC-xmlschema-1-20041028. Thompson H. S. Beech D. Maloney M. and Mendelsohn N. 2004. XML Schema part 1: Structures 2nd Ed. World Wide Web Consortium Recommendation REC-xmlschema-1-20041028."},{"key":"e_1_2_2_64_1","volume-title":"W3C. 2002. XHTML 1.0 The Extensible HyperText Markup Language","unstructured":"W3C. 2002. XHTML 1.0 The Extensible HyperText Markup Language , 2 nd Ed. W3C. W3C. 2002. XHTML 1.0 The Extensible HyperText Markup Language, 2nd Ed. W3C.","edition":"2"},{"key":"e_1_2_2_65_1","volume-title":"Proceedings of the 7th International Database Engineering and Applications Symposium. 230--235","author":"Wang G.","unstructured":"Wang , G. , Liu , M. , Yu , J. X. , Sun , B. , Yu , G. , Lv , J. , and Lu , H . 2003. Effective schema-based XML query optimization techniques . In Proceedings of the 7th International Database Engineering and Applications Symposium. 230--235 . Wang, G., Liu, M., Yu, J. X., Sun, B., Yu, G., Lv, J., and Lu, H. 2003. Effective schema-based XML query optimization techniques. In Proceedings of the 7th International Database Engineering and Applications Symposium. 230--235."}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1735886.1735890","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1735886.1735890","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T12:45:33Z","timestamp":1750250733000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1735886.1735890"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,4]]},"references-count":63,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2010,4]]}},"alternative-id":["10.1145\/1735886.1735890"],"URL":"https:\/\/doi.org\/10.1145\/1735886.1735890","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"value":"0362-5915","type":"print"},{"value":"1557-4644","type":"electronic"}],"subject":[],"published":{"date-parts":[[2010,4]]},"assertion":[{"value":"2009-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2009-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2010-05-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}