Data Mining Mod1
Data Mining Mod1
8
What Is Data Mining?
Task-relevant Data
Data Cleaning
Data Integration
Databases
1. Data cleaning (to remove noise and inconsistent data)
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
Why Not Traditional Data Analysis?
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
Data Mining: On What Kinds of Data?
Relational database
Data warehouse
Transactional database
Relational Databases
A database system, also called a database management
system (DBMS), consists of a collection of interrelated
data, known as a database, and a set of software programs
to manage and access the data.
A relational database is a collection of tables, each of
which is assigned a unique name.
Each table consists of a set of attributes (columns or
fields) and usually stores a large set of tuples (records or
rows
A semantic data model, such as an entity-relationship (ER)
data model, is often constructed for relational databases.
An ER data model represents the database as a set of
entities and their relationships.
April 13, 2021 Data Mining: Concepts and Techniques 26
A relational database for AllElectronics. The AllElectronics
company is described by the following relation tables:
customer, item, employee, and branch.
Relational data can be accessed by database queries
written in a relational query language, such as SQL, or
with the assistance of graphical user interfaces.
enterprise-wide.
Data cube
confidence threshold.
No coupling
Loose coupling
Tight Coupling
perishable.
A multi-dimensional
structure called the data cube.
Eg:
dice time=’Q1 or Q2 and location =’Mumbai’ or ‘Pune’
C[Quarter,city,product]=C[quarter,city,product]
Eg RollUp=C[quarter,city,product]=C[quarter,
province. product]
81
3-Tier Data Warehouse Architecture
Bottom Tier
Middle Tier
Top Tier
82
83
Data Sources:
84
Bottom Tier: Data warehouse server
85
Backend Tools & Utilities:
Functions performed by backend tools and utilities
are:
Data Extraction
Data Cleaning
Data Transformation
Load
Refresh
86
Bottom Tier Contains:
Data warehouse
Metadata Repository
Data Marts
87
Data Warehouse:
It is an optimized form of operational database contain
only relevant information and provide fast access to
data.
Subject oriented
Eg: Data related to all the departments of an organization
Integrated:
Different views Single unified
of data A view
B Warehous
e
Time – variant C
Nonvolatile
Metadata repository:
It figure out that what is available in data warehouse.
It contains:
Structure of data warehouse
Data names and definitions
Source of extracted data
Algorithm used for data cleaning purpose
Sequence of transformations applied on data
Data related to system performance
Data Marts:
Subset of data warehouse contain only small slices of data
warehouse
Eg: Data pertaining to the single department
Dependent Independent
sourced directly sourced from one or
from data warehouse more data sources
Monitoring & Administration:
Data Refreshment
Disaster recovery
Data
Data Marts
Metadata Warehouse
Repository
Data
Sourc B C
eA
Middle Tier: OLAP Server
Report writers
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
Regression: Y1
smooth by fitting
the data into
regression functions Y1’ y=x+1
X1 x
Data cleaning as a process
Discrepancy detection
Field overloading
Unique rules
Consecutive rules
Null rules
min-max normalization
Min-max normalization performs a linear transformation on the original
data.
Suppose that mina and maxa are the minimum and the maximum values
for attribute A. Min-max normalization maps a value v of A to v’ in the
range [new-mina, new-maxa] by computing:
v’= ( (v-mina) / (maxa – mina) ) * (new-maxa – newmina)+ new-mina
Data Transformation: Normalization
Z-score Normalization:
In z-score normalization, attribute A are normalized based on the
mean and standard deviation of A. a value v of A is normalized to v’
by computing:
v’ = ( ( v – A ) / µA )
in volume but yet produces the same (or almost the same) analytical
results
terabytes of data. Complex data analysis may take a very long time
22
Data reduction strategies
Data cube aggregation
Dimensionality reduction
Numerosity reduction
popular form of
sorted: 1, 1, 5, 5, 5, 5, 5,8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,20, 20, 20, 20, 20, 20, 21,
21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Supervised discretization
Unsupervised discretization
Top-down discretization or splitting
Bottom-up discretization or merging
33 April 13, 2021
Data Discretization and Concept
Hierarchy Generation
A concept hierarchy for a given numerical attribute defines a
discretization of the attribute.