Data mining uses a variety of techniques from multiple disciplines such as statistics, machine learning, high performance computing, pattern recognition, neural networks, data visualization, signal processing, and image processing. In addition to this, we must learn the kind of data in data mining.
Data Sources
Data mining is applied to all kinds of databases including data streams and web. Here is a list of data sources supported by data mining technologies.
- Relational databases
- Data warehouses
- Transactional databases
- Advanced database systems
- Flat files
- Data streams
- World Wide Web
Each of the data repository requires different techniques. In the rest of the article, we will discuss about each database and give an overview of the database systems.
Relational Databases
The relation database is also known as RDBMS ( relational database management system). It is a collection of databases connected via relationships. The relational database consists of tables known as relations with columns and row. The columns are attributes or field and the row are tuples or records.
The tuples are identified uniquely by a set of keys.
The relational database starts with a semantic model of database called E-R Diagram. For example, let us create a new relational database for ITStore – a fictitious company that sell computer and its accessories.
The first step is to create define the schema for relations in the database.
Customer(ID(key), customer name, address, age, occupation, annual income, credit information, category)
Item(Item_ID(key), name, brand, category, type, price, place made, supplier, cost)
Employee(Emp_ID(key), name, category, group, salary, commission)
Branch(Branch_ID(key), name, address)
Purchase(Trans_ID(key), ID(key),Emp_ID(key),date,time, payment method,amount)
Item_Sold(Trans_ID(key), Item_ID, qty)
Works_At(Emp_ID(key),Branch_ID)
The next step is to create relations and set relationships between them such that we can query the database using SQL and get results.
The query is transformed into relational operators such join, selection, and projection and is then optimized for efficient processing.
The relational queries uses aggregate functions such as sum, avg(average), count, max(maximum) and min(minimum).
Data mining on relational database search for trends or data patterns. For example, in the about relational database, based on customer income, age, and credit information we can find credit risks for new customers.
We can also compare sales of current year from the sales of previous year etc.
Data warehouses
Consider the previous example of ITStore, if the CEO of the company want to know “Sales of each item” at “each branch” for “last 6 months”. Since, the data is located at different branches it is difficult task to analyze the data.
However, if ITStore has a data warehouse the task would have been easy. The data warehouse is a single repository that has data from many sources under a unified schema.
A data warehouse is constructed via a process of
- data cleaning
- data integration
- data transformation
- periodic data refreshing
The data in the data warehouse is stored around major subjects such as
customer, sales, items, suppliers, etc.
These data is stored to provide information from historical perspective (5 – 10 years of data) and are summarized.
A data warehouse is modeled by multi-dimensional database structure where each dimension is an attribute or set of attributes in the schema, each cell stores some values of some aggregate measure such as count, or sales amount.
The physical structure of the data warehouse is a data cube or relational data store.
Transactional Databases
A transactional database consists of file where each record represents a transaction. Each of the transaction has transaction id and list of items involved in the transactions.
Data mining on transactional database many answer queries like “Which items sold well together? “
We can answer the query by looking at set of items sold together frequently.
Advanced Data And Information Systems
Advanced data and information systems are required by some modern applications that handle
- Maps (spacial data)
- IC Circuit Design, Building Design ( engineering design data)
- Text, audio, video, images (html and other multi-media)
- Historical facts, stock exchange data ( time based data)
- Sensor data, CCTV recordings (stream data)
- Internet data (world wide web)
There are many advanced databases to handle such data types. They are
- Object-Relational Database
- Spatial Database
- Temporal Database
- Spatial-temporal Database
- Text and Multimedia Database
- Heterogeneous and Legacy systems
- Data stream management systems
At this moment we step discussing about the databases. In future articles, we will explore each of the databases in-depth.