{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T07:19:26Z","timestamp":1740122366539,"version":"3.37.3"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2020,3,16]],"date-time":"2020-03-16T00:00:00Z","timestamp":1584316800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,3,16]],"date-time":"2020-03-16T00:00:00Z","timestamp":1584316800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["TU\/C\/000018","EP\/N510129\/1"],"award-info":[{"award-number":["TU\/C\/000018","EP\/N510129\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Data Min Knowl Disc"],"published-print":{"date-parts":[[2020,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Type inference refers to the task of inferring the data type of a given column of data. Current approaches often fail when data contains missing data and anomalies, which are found commonly in real-world data sets. In this paper, we propose <jats:italic>ptype<\/jats:italic>, a probabilistic robust type inference method that allows us to detect such entries, and infer data types. We further show that the proposed method outperforms existing methods.<\/jats:p>","DOI":"10.1007\/s10618-020-00680-1","type":"journal-article","created":{"date-parts":[[2020,3,16]],"date-time":"2020-03-16T07:05:07Z","timestamp":1584342307000},"page":"870-904","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["ptype: probabilistic type inference"],"prefix":"10.1007","volume":"34","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5059-8421","authenticated-orcid":false,"given":"Taha","family":"Ceritli","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Christopher K. I.","family":"Williams","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"James","family":"Geddes","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,3,16]]},"reference":[{"key":"680_CR1","doi-asserted-by":"crossref","unstructured":"Bahl L, Brown P, De\u00a0Souza P, Mercer R (1986) Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: IEEE ICASSP\u201986, IEEE, vol\u00a011, pp 49\u201352","DOI":"10.1109\/ICASSP.1986.1169179"},{"key":"680_CR2","unstructured":"Brown PF (1987) The acoustic-modeling problem in automatic speech recognition. Ph.D. Dissertation, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA"},{"issue":"3","key":"680_CR3","doi-asserted-by":"publisher","first-page":"15","DOI":"10.1145\/1541880.1541882","volume":"41","author":"V Chandola","year":"2009","unstructured":"Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv (CSUR) 41(3):15","journal-title":"ACM Comput Surv (CSUR)"},{"key":"680_CR4","doi-asserted-by":"crossref","unstructured":"Dasu T, Johnson T (2003) Exploratory data mining and data cleaning: an overview. In: Exploratory data mining and data cleaning, vol 479, 1st edn, John Wiley & Sons, Inc., New York, NY, USA, chap\u00a01, pp 1\u201316","DOI":"10.1002\/0471448354.ch1"},{"issue":"7","key":"680_CR5","doi-asserted-by":"publisher","first-page":"1895","DOI":"10.1162\/089976698300017197","volume":"10","author":"TG Dietterich","year":"1998","unstructured":"Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895\u20131923","journal-title":"Neural Comput"},{"key":"680_CR6","unstructured":"D\u00f6hmen T, M\u00fchleisen H, Boncz P (2017a) hypoparsr. https:\/\/github.com\/tdoehmen\/hypoparsr [Accessed on 29\/06\/2018]"},{"key":"680_CR7","doi-asserted-by":"crossref","unstructured":"D\u00f6hmen T, M\u00fchleisen H, Boncz P (2017b) Multi-hypothesis CSV parsing. In: Proceedings of the 29th SSDBM","DOI":"10.1145\/3085504.3085520"},{"issue":"9","key":"680_CR8","doi-asserted-by":"publisher","first-page":"1349","DOI":"10.1016\/j.patcog.2004.03.020","volume":"38","author":"P Dupont","year":"2005","unstructured":"Dupont P, Denis F, Esposito Y (2005) Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms. Pattern Recognit 38(9):1349\u20131371","journal-title":"Pattern Recognit"},{"key":"680_CR9","doi-asserted-by":"crossref","unstructured":"Fisher K, Gruber R (2005) PADS: a domain-specific language for processing ad hoc data. In: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation (PLDI\u201905), ACM, vol 40(6), pp 295\u2013304","DOI":"10.1145\/1065010.1065046"},{"key":"680_CR10","doi-asserted-by":"crossref","unstructured":"Fisher K, Walker D, Zhu KQ, White P (2008) From dirt to shovels: fully automatic tool generation from ad hoc data. In: POPL \u201908, ACM, vol 43(1), pp 421\u2013434","DOI":"10.1145\/1328438.1328488"},{"key":"680_CR11","unstructured":"Gill A (1962) The basic model. In: Introduction to the theory of finite-state machines, McGraw-Hill Book Company, pp 1\u201315, 10.2307\/2003459"},{"key":"680_CR12","unstructured":"Greenery (2018) greenery. https:\/\/github.com\/qntm\/greenery\/ Accessed 31 May 2019"},{"key":"680_CR13","doi-asserted-by":"crossref","unstructured":"Guo PJ, Kandel S, Hellerstein JM, Heer J (2011) Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In: UIST \u201911, ACM, pp 65\u201374","DOI":"10.1145\/2047196.2047205"},{"key":"680_CR14","volume-title":"Principles of data mining","author":"DJ Hand","year":"2001","unstructured":"Hand DJ, Mannila H, Smyth P (2001) Principles of data mining. MIT Press, Cambridge"},{"issue":"1","key":"680_CR15","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1145\/568438.568455","volume":"32","author":"JE Hopcroft","year":"2001","unstructured":"Hopcroft JE, Motwani R, Ullman JD (2001) Introduction to automata theory, languages, and computation. ACM SIGACT News 32(1):60\u201365","journal-title":"ACM SIGACT News"},{"issue":"4","key":"680_CR16","doi-asserted-by":"publisher","first-page":"589","DOI":"10.1016\/j.csl.2009.08.002","volume":"24","author":"H Jiang","year":"2010","unstructured":"Jiang H (2010) Discriminative training of HMMs for automatic speech recognition: a survey. Comput Speech Lang 24(4):589\u2013608","journal-title":"Comput Speech Lang"},{"key":"680_CR17","doi-asserted-by":"crossref","unstructured":"Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: CHI \u201911, ACM, pp 3363\u20133372","DOI":"10.1145\/1978942.1979444"},{"key":"680_CR18","doi-asserted-by":"crossref","unstructured":"Limaye G, Sarawagi S, Chakrabarti S (2010) Annotating and searching web tables using entities, types and relationships. VLDB \u201910 3(1\u20132):1338\u20131347","DOI":"10.14778\/1920841.1921005"},{"key":"680_CR19","unstructured":"Lindenberg F (2017) messytables Documentation Release 0.3. https:\/\/media.readthedocs.org\/pdf\/messytables\/latest\/messytables.pdf Accessed 29 June 2018"},{"issue":"9","key":"680_CR20","doi-asserted-by":"publisher","first-page":"1432","DOI":"10.1109\/29.90371","volume":"36","author":"A N\u00e1das","year":"1988","unstructured":"N\u00e1das A, Nahamoo D, Picheny MA (1988) On a model-robust training method for speech recognition. IEEE Trans Acoust Speech Signal Process 36(9):1432\u20131436","journal-title":"IEEE Trans Acoust Speech Signal Process"},{"key":"680_CR21","volume-title":"Introduction to probabilistic automata","author":"A Paz","year":"1971","unstructured":"Paz A (1971) Introduction to probabilistic automata, vol 78. Academic Press Inc, New York"},{"issue":"1","key":"680_CR22","doi-asserted-by":"publisher","first-page":"83","DOI":"10.1145\/1147234.1147247","volume":"8","author":"RK Pearson","year":"2006","unstructured":"Pearson RK (2006) The problem of disguised missing data. ACM SIGKDD Explor 8(1):83\u201392","journal-title":"ACM SIGKDD Explor"},{"key":"680_CR23","doi-asserted-by":"crossref","unstructured":"Petricek T, Guerra G, Syme D (2016) Types from data: making structured data first-class citizens in F#. In: PLDI 2016","DOI":"10.1145\/2908080.2908115"},{"key":"680_CR24","doi-asserted-by":"crossref","unstructured":"Qahtan AA, Elmagarmid A, Castro\u00a0Fernandez R, Ouzzani M, Tang N (2018) FAHES: a robust disguised missing values detector. In: Proceedings of the 24th ACM SIGKDD, ACM, pp 2100\u20132109","DOI":"10.1145\/3219819.3220109"},{"issue":"9","key":"680_CR25","doi-asserted-by":"publisher","first-page":"1537","DOI":"10.1109\/TPAMI.2008.191","volume":"31","author":"JA Quinn","year":"2009","unstructured":"Quinn JA, Williams CKI, McIntosh N (2009) Factorial switching linear dynamical systems applied to physiological condition monitoring. IEEE Trans Pattern Anal Mach Intell 31(9):1537\u20131551","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"3","key":"680_CR26","doi-asserted-by":"publisher","first-page":"230","DOI":"10.1016\/S0019-9958(63)90290-0","volume":"6","author":"MO Rabin","year":"1963","unstructured":"Rabin MO (1963) Probabilistic automata. Inf Control 6(3):230\u2013245","journal-title":"Inf Control"},{"issue":"2","key":"680_CR27","doi-asserted-by":"publisher","first-page":"114","DOI":"10.1147\/rd.32.0114","volume":"3","author":"MO Rabin","year":"1959","unstructured":"Rabin MO, Scott D (1959) Finite automata and their decision problems. IBM J Res Dev 3(2):114\u2013125. https:\/\/doi.org\/10.1147\/rd.32.0114","journal-title":"IBM J Res Dev"},{"key":"680_CR28","unstructured":"Raman V, Hellerstein JM (2001) Potter\u2019s wheel: an interactive data cleaning system. In: VLDB \u201901, Morgan Kaufmann Publishers Inc., pp 381\u2013390"},{"key":"680_CR29","unstructured":"Solutions Stochastic (2018) Test-driven data analysis. https:\/\/tdda.readthedocs.io\/en\/tdda-1.0.23\/constraints.html Accessed 8 April 2019"},{"key":"680_CR30","unstructured":"Trifacta (2018) Trifacta Wrangler. https:\/\/www.trifacta.com\/ Accessed 27 June 2018"},{"key":"680_CR31","unstructured":"Valera I, Ghahramani Z (2017) Automatic discovery of the statistical types of variables in a dataset. In: Proceedings of the 34th ICML, PMLR, vol\u00a070, pp 3521\u20133529"},{"key":"680_CR32","doi-asserted-by":"crossref","unstructured":"Vergari A, Molina A, Peharz R, Ghahramani Z, Kersting K, Velera I (2019) Automatic Bayesian density analysis. In: Proceedings of the 33rd AAAI","DOI":"10.1609\/aaai.v33i01.33015207"},{"issue":"7","key":"680_CR33","doi-asserted-by":"publisher","first-page":"1013","DOI":"10.1109\/TPAMI.2005.147","volume":"27","author":"E Vidal","year":"2005","unstructured":"Vidal E, Thollard F, de la Higuera C, Casacuberta F, Carrasco RC (2005) Probabilistic finite-state machines\u2014Part I. IEEE Trans Pattern Anal Mach Intell 27(7):1013\u20131025","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"680_CR34","unstructured":"Wickham H, Grolemund G (2016) R for data science: import, tidy, transform, visualize, and model data, 1st edn, O\u2019Reilly Media, Inc., chap\u00a08, pp 137\u2013138. http:\/\/r4ds.had.co.nz\/data-import.html Accessed 24 July 2018"},{"key":"680_CR35","unstructured":"Wickham H, Hester J, Francois R, Jyl\u00e4nki J, J\u00f8rgensen M (2017) readr 1.1.1. https:\/\/cran.r-project.org\/web\/packages\/readr\/readr.pdf Accessed 29 June 2018"},{"key":"680_CR36","unstructured":"Williams CKI, Hinton GE (1991) Mean field networks that learn to discriminate temporally distorted strings. In: Proceedings of the 1990 Connectionist Models Summer School, Morgan Kaufmann Publishers, Inc., pp 18\u201322"}],"container-title":["Data Mining and Knowledge Discovery"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s10618-020-00680-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1007\/s10618-020-00680-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s10618-020-00680-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,3,16]],"date-time":"2021-03-16T00:41:50Z","timestamp":1615855310000},"score":1,"resource":{"primary":{"URL":"http:\/\/link.springer.com\/10.1007\/s10618-020-00680-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,16]]},"references-count":36,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,5]]}},"alternative-id":["680"],"URL":"https:\/\/doi.org\/10.1007\/s10618-020-00680-1","relation":{},"ISSN":["1384-5810","1573-756X"],"issn-type":[{"type":"print","value":"1384-5810"},{"type":"electronic","value":"1573-756X"}],"subject":[],"published":{"date-parts":[[2020,3,16]]},"assertion":[{"value":"19 January 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 February 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 March 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}