Automated Data Cleansing

Short Description

The project analyzes the problem of data cleansing and automatically identifying potential errors in data sets. Methods for error detection that go beyond integrity analysis are reviewed and presented. The applicable methods include: statistical outlier detection, pattern matching, clustering, and data mining techniques. A new extension of the Boolean association rules, ordinal association rules, that incorporates ordinal relationships among data items, is introduced. One use for ordinal rules is to identify possible errors in data. A method that finds these rules and identifies potential errors in data is proposed.

Project Details

The quality of a large real world data set depends on a number of issues, but the source of the data is the crucial factor. Data entry and acquisition is inherently prone to errors both simple and complex. Much effort can be given to this front-end process, with respect to reduction in entry error, but the fact often remains that errors in a large data set are common. Unless an organization undertakes extreme measures in an effort to avoid data errors the field errors rates are typically around 5% or more. To address this problem some organizations spend millions of dollars per year to detect data errors. Coupled with the fact that the manual process of data cleansing is also laborious, time consuming, and prone to errors, methods to automatically detect or assist in detecting errors is of great interest.

The process of automated data cleansing is multi-faceted and a number of problems must be addressed to solve any particular data cleansing problem. The following describes a generic framework for automated data cleansing:

This project focuses in particular on the first two elements in the data cleansing framework. More precisely, we are trying to define methods that help defining and identifying errors in data when domain knowledge is missing or depreciated.

One such method was proposed, based on data mining techniques for uncovering patterns in data. We define and extension of the association rules - ordinal association rules or simply ordinal rules. The objective here is to find ordinal relationships between record attributes that tend to hold over a large percentage of records. If attribute A is less then B most of the time then a record that contains a B that is less then A may be in error. One flag on B may not mean much, but if a number of ordinal rules that deal with B are broken, the likelihood of error goes up. The following more formally defines this concept.

Definition Let R = {r1, r2, …, rn} a set of records, where each record is a set of k attributes (a1, …, ak). Each attribute ai in a particular record rj has a value f(rj, ai) from a domain D. The value of the attribute may also be empty and is therefore included in D. The following relations (partial orderings) are defined over D, namely less or equal (<=), equal (=) and, greater or equal (>=) all having the standard meaning.

Then (a1, a2, a3, …, am) --> (a1 o1 a2 o2 a3 … om-1 am), where each oi is <=, =, or >=, is a an ordinal association rule if:

1) a1 ...am occur together (are non-empty) in at least s% of the n records, where s is the support of the rule; 2) and, in a subset of the records R' included in R where a1 …am occur together and f(rj, a1) o1 … om-1 f(rj, am) is true for each rj in R'. Thus |R'| is the number of records that the rule holds for and the confidence, c, of the rule is the percentage of records that hold for the rule c = |R'|/|R|.

As can be seen from the definition, ordinal rules can be defined over two or more attributes. The work here currently focuses on ordinal rules that use only two attributes.

The process to identify potential errors in data sets using ordinal association rules is composed of the following steps:

  1. Prepare the data
  2. Find ordinal rules with a minimum confidence c.
  3. Identify data attributes that broke the rules and can be considered potential errors.

Here, the manner in which support of a rule is important differs from the typical data-mining problem. In error detection, we are looking for rules that hold for large number of records. But, a rule may hold for a small percentage of the overall records and not represent an error. Here, we assume all the discovered rules that hold for more than two records represent valid possible partial orderings. Future work will investigate user-specified minimum support.

The method was applied on personnel data supplied by the Naval Personnel Research, Studies, and Technology.

Funding

The project was funded Office of Naval Research through grant N00014-99-1-0730.

People

Papers