DWM Unit 4 Introduction To Data Mining
DWM Unit 4 Introduction To Data Mining
*Data Mining:
Data mining means searching for knowledge (interesting patterns or useful data) in data.
Data mining refers to extraction of small information from large amount of data.
The data sources can include databases, datawarehouses, the Web, other information
repositories, or data that are streamed into the system dynamically.
Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data.
Data mining is used by companies in order to get customer preferences, determine price of
their product and services and to analyse market.
Data mining is also known as knowledge discovery in Database (KDD).
1. Data cleaning:
In data cleaning it removes the noise and inconsistent data.
2. Data integration:
Multiple data sources may be combined.
3. Data selection:
The data relevant to the analysis task are retrieved from the database.
4. Data transformation:
The data are transformed and consolidated into forms appropriate for mining by performing
summary or aggregation operations.
1
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
i.e. the data from different data sources which is of varied types can be converted into a
single standard format.
5. Data mining:
Data mining is the process in which intelligent methods or algorithms are applied on data to
extract useful data patterns.
6. Pattern evaluation:
This process identifies the truly interesting patterns representing actual knowledge based
on user requirements for analysis.
7. Knowledge presentation:
In this process, visualization and knowledge representation techniques are used to present
mined knowledge to users for analysis.
2. Relational Databases:
A Relational database is defined as the collection of data organized in tables with
rows and columns.
Physical schema in Relational databases is a schema which defines the structure
of tables.
Logical schema in Relational databases is a schema which defines the
relationship among tables.
Standard API of relational database is SQL.
Application: Data Mining, ROLAP model, etc.
2
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
3. DataWarehouse:
A datawarehouse is defined as the collection of data integrated from multiple
sources that will queries and decision making.
There are three types of datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
Application: Business decision making, Data mining, etc.
4. Transactional Databases:
Transactional databases are a collection of data organized by time stamps, date,
etc to represent transaction in databases.
This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
Highly flexible system where users can modify information without changing any
sensitive information.
Follows ACID property of DBMS.
Application: Banking, Distributed systems, Object databases, etc.
5. Multimedia Databases:
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in a pre-specified format.
Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.
6. Spatial Database:
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.
7. Time-series Databases:
Time series databases contains stock exchange data and user logged activities.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW:
WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet.
It is the most heterogeneous repository as it collects data from multiple
resources.
It is dynamic in nature as Volume of data is continuously increasing and
changing.
Application: Online shopping, Job search, Research, studying, etc.
3
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
B. Performance issues:
i. Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.
ii. Parallel, distributed, and incremental mining algorithms:
There are huge size of databases, the wide distribution of data, and complexity of some
data mining methods.
These factors should be considered during development of parallel and distributed data
mining algorithms.
4
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
ii. Mining information from heterogeneous databases and global information systems:
Since data is fetched from different data sources on Local Area Network (LAN) and Wide
Area Network (WAN), the discovery of knowledge from different sources of structured is a
great challenge to data mining.
Attribute:
Attribute is a data field that represents characteristics or features of a data object.
For a customer object, attributes can be customer Id, address etc.
Set of attributes used to describe an object.
Types of attributes:
1. Qualitative Attributes
2. Quantitative Attributes
Numeric
1. Qualitative Attributes:
a. Nominal Attributes (N):
These attributes are related to names.
The values of a Nominal attribute are name of things, some kind of symbols.
5
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Values of Nominal attributes represents some category or state and that’s why nominal
attribute also referred as categorical attributes and there is no order (rank, position)
among values of nominal attribute.
Example:
Attribute Values
Colors Black, Red, Green
Categorical Data Lecturer, Professor
Attribute Values
Grade A, B, C, D, E
Income low, medium, high
Age Teenage, young, old
2. Quantitative Attributes:
a. Numeric:
A numeric attribute is quantitative because, it is a measurable quantity, represented in
integer or real values.
Attribute Values
Salary 2000, 3000
Units sold 10, 20
Age 5,10,20..
6
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
b. Discrete:
Discrete data have finite values, it can be numerical and can also be in categorical form.
These attributes have finite or countably infinite set of values.
Example:
Attribute Values
Profession Teacher, Businessman, Peon
Zip Code 413736, 413713
c. Continuous:
Continuous data have infinite no. of states. Continuous data is of float type. There can be
many values between 2 and 3.
Example:
Attribute Values
Height 2.3, 3, 6.3……
Weight 40, 45.33,…….
7
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
*Data Preprocessing:
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format.
Real-world data is incomplete, inconsistent and contain many errors.
Data preprocessing is a proven method of resolving such issues.
Data preprocessing prepares raw data for further processing.
8
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
9
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Example:
Price = 4, 8, 15, 21, 21, 24, 25, 28, 34
10
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
C. Regression:
Data can be smoothed by fitting the data into a regression function.
Example:
If we measured the height of child per year, if child grows 3 inches approximately, then the
regression function may be: child growing 3 inches per year
D. Clustering:
Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters.
Values that fall outside of the set of clusters may be considered outliers. The outliers may
be ignored while analysis of data.
11
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Data Source 1
Data
Data Source 1
Unified
Warehouse View
Data Source 1
These sources may include multiple databases, data cubes, or flat files. One of the most
well-known implementation of data integration is building an enterprise's data warehouse.
The benefit of a data warehouse enables a business to perform analysis based on the data
in the data warehouse.
There are mainly 2 major approaches for data integration:
1. Tight Coupling
In tight coupling data is combined from different sources into a single physical location
through the process of ETL - Extraction, Transformation and Loading.
2. Loose Coupling
In loose coupling data only remains in the actual source databases. In this approach, an
interface is provided that takes query from user and then sends the query directly to the
source databases to obtain the result.
12
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
b. Aggregation:
Aggregation in data mining is the process of finding, collecting, and presenting the data in a
summarized format to perform statistical analysis of business decisions.
Aggregated data help in finding useful information about a group after they are written as
reports.
Ex: Finding the number of consumers by country. This can increase sales in the country
with more buyers and help the company to enhance its marketing in a country with low
buyers. Here also, instead of an individual buyer, a group of buyers in a country are
considered.
13
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
c. Generalization:
In generalization low-level data are replaced with high-level data by using concept
hierarchies climbing.
Example: Roll-up operation on Data Cube
14
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
b. Dimensionality Reduction:
In dimensionality reduction redundant attributes are detected and removed, which reduces
the data set size.
Example:
Before reduction
A1 A2 A1 A3
10 11 11 21
After reduction
A1 A2 A3
21 11 21
c. Discretization process:
(for concept refer the below section)
15
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
5. Data Discretization:
Data Discretization techniques can be used to divide the range of continuous attribute into
intervals. (Continuous values can be divided into discrete (finite) values)
i.e. it divides the large dataset into smaller parts.
Numerous continuous attribute values are replaced by small interval labels.
This leads to a brief, easy-to-use, knowledge-level representation of mining results.
Data mining on a reduced data set means fewer input/output operations and is more
efficient than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
Typical methods for Discretization and Concept Hierarchy Generation for Numerical Data:
a. Binning Method:
(Refer page no. 10)
b. Cluster Analysis:
Cluster analysis is a popular data discretization method.
In clustering, a group of different data objects is classified as similar objects. One group
means a cluster of data. Data sets are divided into different groups in the cluster analysis,
which is based on the similarity of the data.
A clustering algorithm can be applied to discrete a numerical attribute of A by partitioning
the values of A into clusters or groups.
Each initial cluster or partition may be further decomposed into several subcultures, forming
a lower level of the hierarchy.
16
PadmashriDr.VitthalraoVikhe Patil Institute of Technology & Engineering (POLYTECHNIC), Loni 0030
DWM 22621
Assignment 4
17