Dirty Data And Data Cleaning

The first problem with data mining is that you need proper data. However, it is not possible if we talk about data from different sources. There is a couple of problem when data is from heterogeneous sources and it must be cleaned, transformed into a standard form to be mined. In this article, we will talk about dirty data and data cleaning issues.

The problem with dirty data is listed here.

  • Lack of standardization
  • Missing, spurious and duplicate data
  • Incorrect or Inconsistent data

Lack Of Standardization

Consider internet where data is available in multiple languages, multiple encodings, and locales are available, therefore, there is no common standard with data.

A user may write abbreviation: “Mahatma Gandhi Road” and another one writes “M.G. Road”. Both are same in meaning but there are written in different ways.

Then there is problem of semantic equivalence: “Chennai” is same as “Madras”, “Mumbai” is same as “Bombay”.

Even people use multiple standards such as 1.6 miles is same as 1 kilometer.

Missing, Spurious, and Duplicate Data

Given a form to fill, the users enter information differently. A few people miss some fields such as age.

Incorrectly entered (spurious) values are common. In the relational databases, duplicate data is common, even though database has been normalized.

There are semantic duplication as well. For instance, B.M.Krishna may appear as M.Balakrishna in another data set.

Incorrect or Inconsistent data

Sometimes the user enter codes that are incorrect and has no meaning. For example, using 0/1 instead of M/F to identify gender is incorrect.

Codes that are inconsistent or outdated are not to be used. For example, traveling eligibility ‘ C ‘ denotes ‘IIIrd class’ no longer used.

Inconsistent duplicate data: when two data sets are found to belong to same person, but have two different information. For example, address information.

Inconsistent association: when sales figures provided by the marketing department do not add up to the total sales figures by the retail units.

Semantic inconsistencies : when user writes Feb 31st, but there is not Feb 31st in the calendar.

Referential Inconsistencies: Rs. 10 million sales reported from a unit that has been closed already.

Issues In Data Cleaning

Now that you know how dirty data look like we must discuss issues with cleaning the data. The process of data cleaning cannot be automated.

The mining process is very much depended on GIGO( garbage in, garbage out) principle. It means the kind of input determine the output of the system. Therefore, unclean data will not produce good results.

We need considerable knowledge that is not explicit and beyond the purview of the warehouse such as metrics, geography, govt policies, policies, etc, to clean the data.

Note that we haven’t discussed about the multiple sources of data, that adds to the complexity. The complexity increases with the history span that is taken up for cleaning.

In the next article, we will discuss the steps taken to clean the data.


Enjoy this blog? Please spread the word :)