In the previous article, you have learned about dirty data and issues with data cleaning. Keeping that in mind, we introduce steps in data cleaning.
The data cleaning steps involve data analysis , creating and applying transformation rules, and back-flow.
Step 1: Data Analysis
Before you clean the data, you must analyze it using the definition of a dirty data. The data that has no common standards, missing data, duplicate data, factually incorrect, and contains many inconsistencies such as inconsistent associations, semantic inconsistencies, and referential inconsistencies are dirty data.
Therefore, identifying such data is the first step towards cleaning the data set. Also, you must obtain the metadata.
Step 2: Create Transformation Rules
The transformation rules can transform data from its “dirty” form to the required “clean” form. The transformation can be applied either at the schema level or data level.
The schema level transformation is necessary because of poor schema design of the databases and multiple sources of data. Every database has its own schema structure and requires that it must be optimized.
The data set may also has missing and duplicate data, and incorrect aggregated values. Therefore, the transformation rule for both schema and data level depends on the source of data. Is the data from single source or multiple sources ?
Step 3: Rule Verification
Once the rules are in place, you must verify that it works by applying it on test data sets.
Step 4: Transformation
Once you are satisfied with the test results, apply the transformation rules on the actual data set.
Step 5: Backflow
Re-populate the data sources with cleaned data is called back-flow.
Data Analysis Techniques
There are many data analysis techniques used to find dirtiness in the data. See the table below.
Problems To Be Detected | Meta-data Used |
Illegal values | (max, min), (mean, deviation), cardinality |
Spelling Mistakes | Hashing, N-gram outliers |
Lack of Standards | Column comparison ( compare value sets from given column across tables ) |
Duplicate and Missing Values | Compare cardinality with rows, detect nulls, use rules to predict incorrect or missing values. |
In the next article, we will discuss about transformation algorithms that remove the dirtiness of data.