Probabilistic Data Cleaning
The goal of this project is to study and develop probabilistic data cleaning techniques. Data cleaning refers to the process of detecting and repairing errors, duplicates and anomalies in data. In response to the large amounts of “dirty” data in today’s digital society, the data quality problem is enjoying a lot of interest from various disciplines in computer science. For instance, since most of the data resides in databases, efficient database techniques have been developed to improve the quality of data. These database techniques are mostly non-probabilistic in the sense that data is either clean or dirty, two objects are either the same or different, and repairs of the data are “one-shot”. That is, a single cleaned repair of the data is returned to the user, without any information on (a) why this repair is returned; (b) how reliable this repair is; and (c) whether other possible repairs exist that are of comparable quality. Clearly, such information is of great importance to asses the quality of these techniques. What is needed is a probabilistic approach to the data quality problem that provides assurances of the decisions made during the cleaning processes. The cornerstone of this project is the observation that many problems studied in probabilistic logic have a direct counterpart in research on data quality in databases, and vice versa. In this project we leverage these relationships to provide a solid foundation for data quality in a probabilistic setting.
- See also: Probabilistic data cleaning.