{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,3,14]],"date-time":"2024-03-14T09:20:47Z","timestamp":1710408047979},"reference-count":11,"publisher":"F1000 Research Ltd","license":[{"start":{"date-parts":[[2018,10,19]],"date-time":"2018-10-19T00:00:00Z","timestamp":1539907200000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000139","name":"U.S. Environmental Protection Agency","doi-asserted-by":"publisher"}],"content-domain":{"domain":["f1000research.com"],"crossmark-restriction":false},"short-container-title":["F1000Res"],"abstract":"<ns4:p>Functional dependencies (FDs) and candidate keys are essential for table decomposition,\u00a0database normalization, and data cleansing. In this paper, we present FDTool, a command\u00a0line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute\u00a0sets and candidate keys from them. The runtime and memory costs associated with\u00a0seven published FD discovery algorithms are given with an overview of their theoretical foundations.\u00a0We conclude that FD_Mine is the most efficient FD discovery algorithm when applied\u00a0to datasets with many rows (&gt; 100,000 rows) and few columns (&lt; 14 columns). This puts\u00a0it in a special position to rule mine clinical and demographic datasets, which often consist\u00a0of long and narrow sets of participant records. The structure of FD Mine is described and\u00a0supplemented with a formal proof of the equivalence pruning method used. FDTool is a\u00a0re-implementation of FD Mine with additional features added to improve performance and\u00a0automate typical processes in database architecture. The experimental results of applying\u00a0FDTool to 12 datasets of different dimensions are summarized in terms of the number of\u00a0FDs checked, the number of FDs found, and the time it takes for the code to terminate. We\u00a0find that the number of attributes in a dataset has a much greater effect on the runtime and\u00a0memory costs of FDTool than does row count. The last section explains in detail how the\u00a0FDTool application can be accessed, executed, and further developed.<\/ns4:p>","DOI":"10.12688\/f1000research.16483.1","type":"journal-article","created":{"date-parts":[[2018,10,19]],"date-time":"2018-10-19T14:45:20Z","timestamp":1539960320000},"page":"1667","update-policy":"http:\/\/dx.doi.org\/10.12688\/f1000research.crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data"],"prefix":"10.12688","volume":"7","author":[{"ORCID":"http:\/\/orcid.org\/0000-0001-9778-243X","authenticated-orcid":false,"given":"Matt","family":"Buranosky","sequence":"first","affiliation":[]},{"given":"Elmar","family":"Stellnberger","sequence":"additional","affiliation":[]},{"given":"Emily","family":"Pfaff","sequence":"additional","affiliation":[]},{"given":"David","family":"Diaz-Sanchez","sequence":"additional","affiliation":[]},{"given":"Cavin","family":"Ward-Caviness","sequence":"additional","affiliation":[]}],"member":"2560","published-online":{"date-parts":[[2018,10,19]]},"reference":[{"key":"ref-8","first-page":"307-328","article-title":"Fast discovery of association rules.","volume":"12","author":"R Agrawal","year":"1996","journal-title":"Advances in knowledge discovery and data mining."},{"key":"ref-7","article-title":"Automatic discovery of functional dependencies and conditional functional dependencies: A comparative study","author":"N Asghar","year":"2015"},{"key":"ref-11","article-title":"USEPA\/FDTool: FDTool (Version v0.1.7).","author":"M Buranosky","year":"2018","journal-title":"Zenodo."},{"key":"ref-9","article-title":"Database Systems: Models, Languages, Design, and Application Programming","author":"R Elmasri","year":"2011"},{"key":"ref-4","doi-asserted-by":"publisher","first-page":"100-111","DOI":"10.1093\/comjnl\/42.2.100","article-title":"Tane: An efficient algorithm for discovering functional and approximate dependencies.","volume":"42","author":"Y Huhtala","year":"1999","journal-title":"Comput J."},{"key":"ref-2","article-title":"Theory of Relational Databases.","author":"D Maier","year":"1983"},{"key":"ref-6","doi-asserted-by":"publisher","first-page":"1082-1093","DOI":"10.14778\/2794367.2794377","article-title":"Functional dependency discovery: An experimental evaluation of seven algorithms.","volume":"8","author":"T Papenbrock","year":"2015","journal-title":"Proc VLDB Endow."},{"key":"ref-5","article-title":"Database Management Systems.","author":"R Ramakrishnan","year":"2000"},{"key":"ref-10","article-title":"Functional dependencies and finding a minimal cover","author":"R Soule","year":"2014"},{"key":"ref-1","doi-asserted-by":"publisher","first-page":"197-219","DOI":"10.1007\/s10618-007-0083-9","article-title":"Mining functional dependencies from data.","volume":"16","author":"H Yao","year":"2008","journal-title":"Data Min Knowl Discov."},{"key":"ref-3","doi-asserted-by":"publisher","article-title":"Fd_mine: Discovering functional dependencies in a database using equivalences","author":"H Yao","year":"2002","DOI":"10.1109\/ICDM.2002.1184040"}],"container-title":["F1000Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/f1000research.com\/articles\/7-1667\/v1\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/f1000research.com\/articles\/7-1667\/v1\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/f1000research.com\/articles\/7-1667\/v1\/iparadigms","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,4,29]],"date-time":"2019-04-29T10:00:11Z","timestamp":1556532011000},"score":1,"resource":{"primary":{"URL":"https:\/\/f1000research.com\/articles\/7-1667\/v1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,10,19]]},"references-count":11,"URL":"http:\/\/dx.doi.org\/10.12688\/f1000research.16483.1","relation":{"has-review":[{"id-type":"doi","id":"10.5256\/f1000research.18017.r46574","asserted-by":"subject"},{"id-type":"doi","id":"10.5256\/f1000research.18017.r39685","asserted-by":"subject"}]},"ISSN":["2046-1402"],"issn-type":[{"value":"2046-1402","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,10,19]]},"assertion":[{"value":"Indexed","URL":"https:\/\/f1000research.com\/articles\/7-1667\/v1#article-reports","order":0,"name":"referee-status","label":"Referee status","group":{"name":"current-referee-status","label":"Current Referee Status"}},{"value":"10.5256\/f1000research.18017.r39685, Howard J. Hamilton, Shubhashis Shil, Department of Computer Science, University of Regina, Regina, SK, Canada, 10 Dec 2018, version 1, indexed","URL":"https:\/\/f1000research.com\/articles\/7-1667\/v1#referee-response-39685","order":0,"name":"referee-response-39685","label":"Referee Report","group":{"name":"article-reports","label":"Article Reports"}},{"value":"10.5256\/f1000research.18017.r46574, Sayan Mukherjee, Department of Statistical Science, Duke University, Durham, NC, USA, 29 Apr 2019, version 1, indexed","URL":"https:\/\/f1000research.com\/articles\/7-1667\/v1#referee-response-46574","order":1,"name":"referee-response-46574","label":"Referee Report","group":{"name":"article-reports","label":"Article Reports"}},{"value":"This work was funded by the US Environmental Protection Agency. The work presented here does not necessarily reflect the views or policy of the EPA. Any mention of trade names does not constitute endorsement by the EPA. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.","order":2,"name":"grant-information","label":"Grant Information"},{"value":"This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.","order":0,"name":"copyright-info","label":"Copyright"}]}}