Project

Probabilistic Data Cleaning

The goal of this project is to study and develop probabilistic data cleaning techniques. Data cleaning refers to the process of detecting and repairing errors, duplicates and anomalies in data. In response to the large amounts of “dirty” data in today’s digital society, the data quality problem is enjoying a lot of interest from various disciplines in computer science. For instance, since most of the data resides in databases, efficient database techniques have been developed to improve the quality of data. These database techniques are mostly non-probabilistic in the sense that data is either clean or dirty, two objects are either the same or different, and repairs of the data are “one-shot”. That is, a single cleaned repair of the data is returned to the user, without any information on (a) why this repair is returned; (b) how reliable this repair is; and (c) whether other possible repairs exist that are of comparable quality. Clearly, such information is of great importance to asses the quality of these techniques. What is needed is a probabilistic approach to the data quality problem that provides assurances of the decisions made during the cleaning processes. The cornerstone of this project is the observation that many problems studied in probabilistic logic have a direct counterpart in research on data quality in databases, and vice versa. In this project we leverage these relationships to provide a solid foundation for data quality in a probabilistic setting.

Date:1 Jan 2013 → 31 Dec 2016

Keywords:G.0062.13

Disciplines:Applied mathematics in specific fields

See also: Probabilistic data cleaning.

Project

Probabilistic Data Cleaning

Researchers

Project partners

Funding