基于概率图模型结构学习的条件函数依赖发现

CONDITIONAL FUNCTIONAL DEPENDENCIES DISCOVERY WITH STRUCTURE LEARNING IN PROBABILISTIC GRAPHICAL MODELS

  • 摘要: 条件函数依赖包含了传统的函数依赖,在数据质量管理和数据清洗研究领域有着广泛的应用。一般的方法会发现能够支持关系数据模型的所有条件函数依赖,而实际数据清洗过程中只需使用其中非常少的对错误检测有意义的部分,因此需要一个昂贵的后处理步骤。将条件函数依赖发现问题视为一个借助概率图模型稀疏回归的结构学习过程,通过对脏数据集进行转换,再对转换后的数据集进行逆协方差估计并分解得到自回归矩阵,学习能够表征数据集分布情况的条件函数依赖。实验结果表明,该方法能够有效地发现少量的用于错误检测的条件函数依赖,与常用的条件函数依赖发现方法相比更加有效。

     

    Abstract: Conditional functional dependencies (CFDs) generalize functional dependencies and are widely employed in data quality and data cleaning. Usually, CFDs discovery methods will find all CFDs holding on data, and only a small number of CFDs that can detect errors user concern are used in data cleaning, leading to massive meaningless CFDs, and an expensive post-processing step in further required for selecting those relevant ones. In fact, CFDs discovery corresponded to structure learning by solving the sparse regression of probability graph model. By transforming the dirty dataset, estimating the inverse covariance of the transformed dataset and decomposing it to obtain the autoregression matrix, we could capture the conditional function dependencies that could characterize the distribution of dataset. Experiments show that this method can effectively find a small number of CFDs that can be used for error detection, which is more effective than state-of-the-art CFDs discovery methods.

     

/

返回文章
返回