Abstract:
Conditional functional dependencies (CFDs) generalize functional dependencies and are widely employed in data quality and data cleaning. Usually, CFDs discovery methods will find all CFDs holding on data, and only a small number of CFDs that can detect errors user concern are used in data cleaning, leading to massive meaningless CFDs, and an expensive post-processing step in further required for selecting those relevant ones. In fact, CFDs discovery corresponded to structure learning by solving the sparse regression of probability graph model. By transforming the dirty dataset, estimating the inverse covariance of the transformed dataset and decomposing it to obtain the autoregression matrix, we could capture the conditional function dependencies that could characterize the distribution of dataset. Experiments show that this method can effectively find a small number of CFDs that can be used for error detection, which is more effective than state-of-the-art CFDs discovery methods.