0% found this document useful (0 votes)
60 views128 pages

Data Mining Mod1

Data mining involves analyzing large amounts of data to discover hidden patterns and relationships. The growth of data from terabytes to petabytes has created a need for data mining techniques to extract useful knowledge from vast data stores. Data mining draws from multiple disciplines like machine learning, statistics, pattern recognition, and database technology to analyze data from diverse sources. It aims to discover patterns that can be used for tasks like classification, prediction, and clustering. Major applications of data mining include business intelligence and scientific discovery.

Uploaded by

asnaparveen414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views128 pages

Data Mining Mod1

Data mining involves analyzing large amounts of data to discover hidden patterns and relationships. The growth of data from terabytes to petabytes has created a need for data mining techniques to extract useful knowledge from vast data stores. Data mining draws from multiple disciplines like machine learning, statistics, pattern recognition, and database technology to analyze data from diverse sources. It aims to discover patterns that can be used for tasks like classification, prediction, and clustering. Major applications of data mining include business intelligence and scientific discovery.

Uploaded by

asnaparveen414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 128

Data Mining:

Concepts and Techniques


— Unit 1 —
— Introduction —
Introduction

Motivation: Why data mining?


What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Classification of data mining systems
Major issues in data mining
Overview
Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes


(1000 Terabytes = 1 Petabyte )
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
Disk Storage

1 Bit = Binary Digit


· 8 Bits = 1 Byte
· 1024 Bytes = 1 Kilobyte
· 1024 Kilobytes = 1 Megabyte
· 1024 Megabytes = 1 Gigabyte
· 1024 Gigabytes = 1 Terabyte
· 1024 Terabytes = 1 Petabyte
· 1024 Petabytes = 1 Exabyte
· 1024 Exabytes = 1 Zettabyte
· 1024 Zettabytes = 1 Yottabyte
· 1024 Yottabytes = 1 Brontobyte
· 1024 Brontobytes = 1 Geopbyte
We are drowning in data, but starving for
knowledge!
“Necessity is the mother of invention”
Data mining—Automated analysis of massive
data sets
Evolution of Sciences
Before 1600, empirical science
1600-1950s, theoretical science
1950s-1990s, computational science – simulation , models
1990-now, data science
The flood of data
The ability to economically store and manage data online
The Internet and computing Grid that makes all these
archives universally accessible
Here is the concept of Data mining.
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information
systems
A Brief History of Data Mining Society

1989 IJCAI Workshop on Knowledge Discovery in Databases


1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
ACM Transactions on KDD starting in 2007

8
What Is Data Mining?

Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction
Data/pattern analysis
Data archeology
Data dredging
Information harvesting
Business intelligence, etc
Knowledge Discovery (KDD) Process

Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be


combined)

3. Data selection (where data relevant to the analysis task are


retrieved fromthe database)

4. Data transformation (where data are transformed or


consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance)
April 13, 2021 Data Mining: Concepts and Techniques 12
5. Data mining (an essential process where intelligent
methods are applied in order to extract data patterns)

6. Pattern evaluation (to identify the truly interesting


patterns representing knowledge based on some
interestingness measures)

7. Knowledge presentation (where visualization and


knowledge representation techniques are used to present
the mined knowledge to the user)
April 13, 2021 Data Mining: Concepts and Techniques 13
Architecture: Typical Data Mining System

Graphical User Interface

Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web
Database, data warehouse, WorldWideWeb, or other
information repository: This is one or a set of databases,
data warehouses, spreadsheets, or other kinds of
information repositories. Data cleaning and data
integration techniques may be performed on the data.

April 13, 2021 Data Mining: Concepts and Techniques 15


Database or data warehouse server: The database or data
warehouse server is responsible for fetching the relevant
data, based on the user’s data mining request.

April 13, 2021 Data Mining: Concepts and Techniques 16


Knowledge base: This is the domain knowledge that is
used to guide the search or evaluate the interestingness of
resulting patterns.
Knowledge such as user beliefs, which can be used to
assess a pattern’s interestingness based on its
unexpectedness, may also be included.
Other examples of domain knowledge are additional
interestingness constraints or thresholds, and metadata
(e.g., describing data from multiple heterogeneous
sources).

April 13, 2021 Data Mining: Concepts and Techniques 17


Data mining engine: This is essential to the data mining
system and ideally consists of a set of functional modules
for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster
analysis, outlier analysis, and evolution analysis.

April 13, 2021 Data Mining: Concepts and Techniques 18


Pattern evaluation module: This component typically
employs interestingness measures and interacts with the
data mining modules so as to focus the search toward
interesting patterns.
It may use interestingness thresholds to filter out
discovered patterns.
Alternatively, the pattern evaluation module may be
integrated with the mining module, depending on the
implementation of the data mining method used. For
efficient data mining, it is highly recommended to push

April 13, 2021 Data Mining: Concepts and Techniques 19


User interface: This module communicates between users
and the data mining system, allowing the user to interact
with the system by specifying a data mining query or task,
providing information to help focus the search, and
performing exploratory data mining based on the
intermediate data mining results.
In addition, this component allows the user to browse
database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in
different forms.

April 13, 2021 Data Mining: Concepts and Techniques 20


Data Mining and Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines
Why Not Traditional Data Analysis?

Tremendous amount of data


High-dimensionality of data
High complexity of data
New and sophisticated applications
Data Mining: Classification Schemes

General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
Data Mining: On What Kinds of Data?

a. Database-oriented data sets and applications

Relational database
Data warehouse
Transactional database
Relational Databases
A database system, also called a database management
system (DBMS), consists of a collection of interrelated
data, known as a database, and a set of software programs
to manage and access the data.
A relational database is a collection of tables, each of
which is assigned a unique name.
Each table consists of a set of attributes (columns or
fields) and usually stores a large set of tuples (records or
rows
A semantic data model, such as an entity-relationship (ER)
data model, is often constructed for relational databases.
An ER data model represents the database as a set of
entities and their relationships.
April 13, 2021 Data Mining: Concepts and Techniques 26
A relational database for AllElectronics. The AllElectronics
company is described by the following relation tables:
customer, item, employee, and branch.
Relational data can be accessed by database queries
written in a relational query language, such as SQL, or
with the assistance of graphical user interfaces.

April 13, 2021 Data Mining: Concepts and Techniques 27


April 13, 2021 Data Mining: Concepts and Techniques 28
April 13, 2021 Data Mining: Concepts and Techniques 29
A query allows retrieval of specified subsets of the data.

Suppose that your job is to analyze the AllElectronics data.

Through the use of relational queries, you can ask things


like “Show me a list of all items that were sold in the last
quarter.”

Relational languages also include aggregate functions such


as sum, avg (average), count, max (maximum), and min
(minimum).
April 13, 2021 Data Mining: Concepts and Techniques 30
DataWarehouses

A data warehouse is a repository of information collected


from multiple sources, stored under a unified schema, and
that usually resides at a single site.

Data warehouses are constructed via a process of data


cleaning, data integration, data transformation, data loading,
and periodic data refreshing.

April 13, 2021 Data Mining: Concepts and Techniques 31


April 13, 2021 Data Mining: Concepts and Techniques 32
To facilitate decision making, the data in a data
warehouse are organized around major subjects, such as
customer, item, supplier, and activity.

The data are stored to provide information from a


historical perspective (such as from the past 5–10 years) and
are typically summarized.

April 13, 2021 Data Mining: Concepts and Techniques 33


A data warehouse is usually modeled by a multidimensional
database structure.
Each dimension corresponds to an attribute or a set of
attributes in the schema.
Each cell stores the value of some aggregate measure, such
as count or sales amount.
The actual physical structure of a data warehouse may be a
relational data store or a multidimensional data cube.
A data cube provides a multidimensional view of data and
allows the pre computation and fast accessing of
summarized data.
April 13, 2021 Data Mining: Concepts and Techniques 34
A data cube for AllElectronics
A data cube for summarized sales data of AllElectronics
The cube has three dimensions: address (with city
valuesChicago, New York, Toronto, Vancouver), time (with
quarter values Q1, Q2, Q3, Q4), and
item(with itemtype values home entertainment, computer, phone,
security).
The aggregate value stored in each cell of the cube is sales
amount (in thousands).
For example, the total sales forthefirstquarter,Q1, for
items relating to security systems in Vancouver is
$400,000, as stored in cell (Vancouver, Q1, security)

April 13, 2021 Data Mining: Concepts and Techniques 35


April 13, 2021 Data Mining: Concepts and Techniques 36
Data warehouse systems are well suited for on-line
analytical processing, or OLAP.

Examples of OLAP operations include drill-down and roll-


up, which allow the user to view the data at differing
degrees of summarization,

we can drill down on sales data summarized by quarter to


see the data summarized by month. Similarly, we can roll
up on sales data summarized by city to view the data
summarized by country. 37
What is the difference between a datawarehouse and
a data mart?”

A data warehouse collects information about subjects

that span an entire organization, and thus its scope is

enterprise-wide.

A data mart, on the other hand, is a department subset of

a data warehouse. It focuses on selected subjects, and

thus its scope is department-wide.


April 13, 2021 Data Mining: Concepts and Techniques 38
Transactional Databases
A transactional database consists of a file where each
record represents a transaction.
A transaction typically includes a unique transaction
identity number (trans ID) and a list of the items making
up the transaction (such as items purchased in a store).

April 13, 2021 Data Mining: Concepts and Techniques 39


“Show me all the items purchased by Sandy Smith” or
“How many transactions include item number I3?”

Answering such queries may require a scan of the entire


transactional database.
“Which items sold well together?”

market basket data analysis would enable you to bundle


groups of items together as a strategy for maximizing
sales.
data mining systems for transactional data can do so by
identifying frequent itemsets.
April 13, 2021 Data Mining: Concepts and Techniques 40
b. Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked
data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
Two dimensional
representation-table

April 13, 2021 Data Mining: Concepts and Techniques 42


Data cube

Data cube

April 13, 2021 Data Mining: Concepts and Techniques 43


Dimension Modelling.

April 13, 2021 Data Mining: Concepts and Techniques 44


Data Mining Functionalities

Data mining functionalities are used to specify the kind


of patterns to be found in data mining tasks.
In general, data mining tasks can be classified into two
categories: descriptive and predictive.
Descriptive mining tasks characterize the general
properties of the data in the database.
Predictive mining tasks perform inference on the current
data in order to make predictions.

April 13, 2021 Data Mining: Concepts and Techniques 45


Data Mining Functionalities
Concept/Class Description: Characterization and
Discrimination
Data can be associated with classes or concepts.

For example, in the AllElectronics store, classes of items for sale


include computers and printers, and concepts of customers include

bigSpenders and budgetSpenders.

It can be useful to describe individual classes and concepts in


summarized, concise, and yet precise terms. Such descriptions of a
class or a concept are called class/concept descriptions.

April 13, 2021 Data Mining: Concepts and Techniques 46


These descriptions can be derived via

(1) data characterization, by summarizing the data of the


class under study in general terms, or (2) data
discrimination, by comparison of the target class with one
or a set of comparative classes or (3) both data
characterization and discrimination.

Data characterization is a summarization of the general


characteristics or features of a target class of data.
April 13, 2021 Data Mining: Concepts and Techniques 47
For example,
Data characterization. summarizing the characteristics
of customers who spend more than $1,000 a year at
AllElectronics.

Data discrimination . compare the general features of


software products whose sales increased by 10% in the
last year with those whose sales decreased by at least 30%
during the same period.
April 13, 2021 Data Mining: Concepts and Techniques 48
Data Mining Functionalities
ii. Frequent patterns, association, correlation
Frequent patterns, as the name suggests, are patterns that
occur frequently in data.
frequent patterns, including item sets, subsequences, and
substructures.
A frequent itemset :milk and bread
(frequent) sequential pattern: customers tend to purchase
first a PC, followed by a digital camera, and then a
memory card
(frequent)structured pattern: substructure occurs
frequently
Mining frequent patterns leads to the discovery of
interesting associations and correlations within data.
April 13, 2021 Data Mining: Concepts and Techniques 49
Association analysis
buys(X; “computer”))buys(X; “software”) [support = 1%; confidence = 50%]
A confidence, or certainty, of 50% means that if a customer buys a
computer, there is a 50% chance that she will buy software as well.

A 1% support means that 1% of all of the transactions under


analysis showed that computer and software were purchased
together.

Association rules that contain a single predicate are referred to as


single-dimensional association rules.

April 13, 2021 Data Mining: Concepts and Techniques 50


Association analysis

age(X, “20:::29”)^income(X, “20K:::29K”))buys(X, “CD player”)

[support = 2%, confidence = 60%]

multidimensional association rule

association rules are discarded as uninteresting if they do not

satisfy both a minimum support threshold and a minimum

confidence threshold.

April 13, 2021 Data Mining: Concepts and Techniques 51


Data Mining Functionalities

iii. Classification and prediction


Construct models (functions) that describe and
distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or
classify cars based on (mileage)
Predict some unknown or missing numerical values
April 13, 2021 Data Mining: Concepts and Techniques 53
iv. Cluster analysis
e.g., cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass
similarity

April 13, 2021 Data Mining: Concepts and Techniques 54


April 13, 2021 Data Mining: Concepts and Techniques 55
v. Outlier analysis
Outlier: Data object that does not comply with the general
behavior of the data
vi. Trend and evolution analysis
Trend and deviation: e.g., regression analysis (eg: height &
wait , price & demand)
Sequential pattern mining: e.g., digital camera large SD
memory
Periodicity analysis
Similarity-based analysis
vii. statistical analyses
Integration scheme includes

No coupling

Loose coupling

Semi tight coupling

Tight Coupling

April 13, 2021 Data Mining: Concepts and Techniques 57


No coupling
No coupling means that a DM system will not utilize any function of
a DB or DW system.
Advantages
Simple to use
Drawbacks
Without using a DB/DW system, a DM system may spend a
substantial amount of time finding, collecting, cleaning, and
transforming data
DM system will need to use other tools to extract data, making it
difficult to integrate such a system into an information processing
environment.
Thus, no coupling represents a poor design.
58
Loose coupling
Loose coupling means that a DM system will use some facilities of a DB
or DW system.
Fetching data from a data repository managed by these systems,
performing data mining, and then storing the mining results either in a
file or in a designated place in a database or data warehouse.
Advantages
Loose coupling is better than no coupling.
it can fetch any portion of data stored in databases or data
warehouses by using query processing.
It incurs some advantages of the flexibility, efficiency, and other
features provided by such systems
Disadvantages
many loosely coupled mining systems are main memory-based.
it is difficult for loose coupling to achieve high scalability and good
performance with large data sets
59
Semitight coupling

Semitight coupling means that besides linking a DM system to a


DB/DW system, efficient implementations of a few essential data
mining primitives can be provided in the DB/DW system.

These primitives can include sorting, indexing, aggregation, histogram


analysis, multiway join, and pre computation of some essential
statistical measures, such as sum, count, max, min, standard deviation,
and so on.

this design will enhance the performance of a DM system.


April 13, 2021 Data Mining: Concepts and Techniques 60
Tight coupling

Tight coupling means that a DM system is smoothly integrated into the


DB/DW system.

The data mining subsystem is treated as one functional component of an


information system.

This approach is highly desirable because it facilitates efficient


implementations of data mining functions, high system performance,
and an integrated information processing environment.

April 13, 2021 Data Mining: Concepts and Techniques 61


Major Issues in Data Mining

Mining different kinds of knowledge in databases.

Interactive mining of knowledge at multiple levels of


abstraction.

Incorporation of background knowledge.

Data mining query languages and ad hoc data mining.

Presentation and visualization of data mining results.

Handling noisy or incomplete data

Pattern evaluation-the interestingness problem


Major Issues in Data Mining
Performance Issues
Efficiency and scalability of data mining algorithms

Parallel , distributed and incremental mining algorithms

Issues relating to the diversity of database types


Handling of relational and complex types of data.

Mining information from heterogeneous database and global


information systems

April 13, 2021 Data Mining: Concepts and Techniques 63


Data
warehouse
64
Definition
Data Warehouse
A collection of corporate information,
derived directly from operational
systems and some external data
sources.
Its specific purpose is to support
business decisions, not business
operations.
The Purpose of Data Warehousing

Realize the value of data


Data / information is an asset
Methods to realize the value, (Reporting, Analysis, etc.)

Make better decisions


Turn data into information
Create competitive advantage
Methods to support the decision making process(DSS)
A data ware house refers to database that is maintained

separately from an organization’s operational database.

It allows integration of variety of application systems.

It gives the opportunity for historical data analysis.

April 13, 2021 Data Mining: Concepts and Techniques 67


Subject oriented: A data ware house is subject oriented rather

than Transaction oriented. For eg: customer, product, sales etc.

Non volatile: data ware house is Physically separate ,So it not

perishable.

April 13, 2021 Data Mining: Concepts and Techniques 68


Integrated: A data ware house is constructed by integrating
heterogeneous sources such as rdbms, files,OLTPfiles.

data cleaning and data integration techniques are used for


ensuring consistency in naming conventions, encoding
structures, attribute measures and so on.

Time variant: Data is stored in historical perspective (5-


10years).

Implicit or explicit time variant will be there in constructing data


ware house.
April 13, 2021 Data Mining: Concepts and Techniques 69
MULTI DIMENSIONAL DATA MODEL
Two dimensional representation-table

April 13, 2021 Data Mining: Concepts and Techniques 70


Data cube

April 13, 2021 Data Mining: Concepts and Techniques 71


Data Ware house implementation

A multi-dimensional
structure called the data cube.

It is a data abstraction that


allows one to view
Data Cube
aggregated data from a
number of perspectives.
OLAP Operations.
OLAP Operations are used for retrieving data in a simplified
manner from data cube for analysis.
Slicing: Reducing the data cube by one or more dimensions.
Slice time=’Q2’ C[quarter,city,product]=C[city, product]

April 13, 2021 Data Mining: Concepts and Techniques 73


Dicing:
This operation is for selecting a smaller data cube and analyzing it
from different perspectives (selection criteria).

Eg:
dice time=’Q1 or Q2 and location =’Mumbai’ or ‘Pune’
C[Quarter,city,product]=C[quarter,city,product]

April 13, 2021 Data Mining: Concepts and Techniques 74


Drilling: Moving up and down along classification
hierarchies

Drill up(Roll Up): This means switching from


detailed to an aggregated with in same classification
hierarchy.

Eg RollUp=C[quarter,city,product]=C[quarter,
province. product]

April 13, 2021 Data Mining: Concepts and Techniques 75


April 13, 2021 Data Mining: Concepts and Techniques 76
Drill Down : This is concerned with switching from aggregated to detailed
level.
For eg: day->month->quarter->year,

April 13, 2021 Data Mining: Concepts and Techniques 77


Integration of Data Mining and Data
Warehousing:

Data warehouse provides clean, integrated data for fruitful


mining.
Data mining provides powerful tools for analysis of data stored
in data warehouses.

Data mining provides more analysis tools, e.g.,


- association,
- classification,
- clustering,
- pattern-directed, and
- trend analysis.
Data mining: the extraction of hidden predictive information
from large DB.

Data might be one of the most valuable assets of your


corporation - but only if you know how to reveal valuable
knowledge hidden in raw data.

Data mining allows you to extract diamonds of knowledge from


your historical data and predict outcomes of future situations.
The actual need of data warehouse is
- To store heterogeneous data for managerial decision purpose.
- To store data in various dimensions with in a data warehouse.

-it is easy to analyze the data and to take decisions.

- A data warehouse is a subject-oriented, integrated, time-variant,


and nonvolatile collection of data in support of management and
decision-making process.
3- Tier Data Warehouse
Architecture

81
3-Tier Data Warehouse Architecture

Data ware house adopt a three tier architecture.

These 3 tiers are:

Bottom Tier

Middle Tier

Top Tier

82
83
Data Sources:

All the data related to any bussiness organization is stored in


operational databases, external files and flat files.

These sources are application oriented

Eg: complete data of organization such as training detail, customer


detail, sales, departments, transactions, employee detail etc.

Data present here in different formats or host format

Contain data that is not well documented

84
Bottom Tier: Data warehouse server

Data Warehouse server fetch only relevant


information based on data mining (mining a
knowledge from large amount of data) request.

Eg: customer profile information provided by


external consultants.

Data is feed into bottom tier by some backend


tools
and utilities.

85
Backend Tools & Utilities:
Functions performed by backend tools and utilities
are:
Data Extraction
Data Cleaning
Data Transformation
Load
Refresh

86
Bottom Tier Contains:

Data warehouse

Metadata Repository

Data Marts

Monitoring and Administration

87
Data Warehouse:
It is an optimized form of operational database contain
only relevant information and provide fast access to
data.
Subject oriented
Eg: Data related to all the departments of an organization
Integrated:
Different views Single unified
of data A view
B Warehous
e
Time – variant C
Nonvolatile
Metadata repository:
It figure out that what is available in data warehouse.
It contains:
Structure of data warehouse
Data names and definitions
Source of extracted data
Algorithm used for data cleaning purpose
Sequence of transformations applied on data
Data related to system performance
Data Marts:
Subset of data warehouse contain only small slices of data
warehouse
Eg: Data pertaining to the single department

Two types of data marts:

Dependent Independent
sourced directly sourced from one or
from data warehouse more data sources
Monitoring & Administration:

Data Refreshment

Data source synchronization

Disaster recovery

Managing access control and security

Manage data growth, database performance

Controlling the number & range of queries

Limiting the size of data warehouse


Bottom Tier:
Monitoring Administration Data Warehouse
Server

Data
Data Marts
Metadata Warehouse
Repository

Data
Sourc B C
eA
Middle Tier: OLAP Server

It presents the users a multidimensional data from


data warehouse or data marts.
Typically implemented using two models:

ROLAP Model MOLAP Model


Present data in Present data in array
relational tables based structures means
map directly to data
cube array structure.
Top Tier: Front end tools
It is front end client layer.
Query and reporting tools
Reporting Tools: Production reporting tools

Report writers

Managed query tools: Point and click creation of SQL used


in customer mailing list.

Analysis tools : Prepare charts based on analysis

Data mining Tools: mining knowledge, discover hidden


piece of information, new correlations, useful pattern
Data Pre Processing
Data preprocessing is a data mining technique that involves

transforming raw data into an understandable format.

Data preprocessing is a proven method of resolving such issues.

Data preprocessing prepares raw data for further processing.

Data preprocessing is used database-driven applications such as

customer relationship management and rule-based applications.


Data Preprocessing
Preprocess Steps
Data cleaning
Data integration
Data transformation
Data reduction
Why Data Preprocessing?
Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

e.g., occupation=“ ”
noisy: containing errors or outliers

e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names


Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:


Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
Forms of data preprocessing
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in
the missing value: smarter
Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth by
boundaries
Cluster Analysis

Clustering: detect and remove outliers


Regression
y

Regression: Y1
smooth by fitting
the data into
regression functions Y1’ y=x+1

X1 x
Data cleaning as a process
Discrepancy detection

Use meta data

Field overloading

Unique rules

Consecutive rules

Null rules

15 April 13, 2021


Data Integration
Data integration:
combines data from multiple sources.
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different sources
are different
possible reasons: different representations, different scales, e.g.,
metric vs. British units
Handling Redundant Data in Data Integration

Redundant data occur often when integration of multiple


databases
The same attribute may have different names in different databases
One attribute may be a “derived” attribute in another table.
Redundant data may be able to be detected by correlational analysis

Careful integration of the data from multiple sources may help


reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Correlation analysis
Data Transformation

Smoothing: remove noise from data


Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation: Normalization

min-max normalization
Min-max normalization performs a linear transformation on the original
data.

Suppose that mina and maxa are the minimum and the maximum values
for attribute A. Min-max normalization maps a value v of A to v’ in the
range [new-mina, new-maxa] by computing:
v’= ( (v-mina) / (maxa – mina) ) * (new-maxa – newmina)+ new-mina
Data Transformation: Normalization
Z-score Normalization:
In z-score normalization, attribute A are normalized based on the
mean and standard deviation of A. a value v of A is normalized to v’
by computing:
v’ = ( ( v – A ) / µA )

where  and  A are the mean and the standard deviation


respectively of attribute A.

This method of normalization is useful when the actual minimum and


maximum of attribute A are unknown.
Data Transformation: Normalization
Normalization by Decimal Scaling
Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A.

The number of decimal points moved depends on the maximum


absolute value of A.

a value v of A is normalized to v’ by computing: v’ = ( v / 10j ). Where j


is the smallest integer such that Max(|v’|)<1.
Data reduction
Obtain a reduced representation of the data set that is much smaller

in volume but yet produces the same (or almost the same) analytical

results

Why data reduction? — A database/data warehouse may store

terabytes of data. Complex data analysis may take a very long time

to run on the complete data set.

22
Data reduction strategies
Data cube aggregation

Attribute subset selection

Dimensionality reduction

Numerosity reduction

Discretization and concept hierarchy generation


Data cube aggregation

aggregation operations are applied to the data in the construction


of a data cube.
This is achieved by aggregation operations on data cube.

24 April 13, 2021


Attribute subset selection

Irrelevant ,weakly relevant or redundant attributes or


dimensions may be detected and removed.
Stepwise forward selection:
Stepwise backward elimination
Combination of forward selection and backward
elimination:
Decision tree induction:

25 April 13, 2021


Dimensionality reduction
Encoding mechanisms are used to reduce the data size.
Wavelet transforms
The discrete wavelet transform(DWT) is a linear signal processing
technique that, when applied to a data vector X, transforms it to a
numerically different vector, X , of wavelet coefficients.
0

Principal components analysis, or PCA


Unlike attribute subset selection, which reduces the attribute set size by
retaining a subset of the initial set of attributes, PCA “combines” the
essence of attributes by creating an alternative, smaller set of variables.
Data compression
Numerosity reduction
The data are replaced or estimated by alternative, smaller data
representations such as parametric models(which need to store only
the model parameters instead of the actual data) or nonparametric
methods such as clustering, sampling and the use of histograms.
Regression and Log-Linear Models
Histograms
Clustering
Sampling
Data compression
Regression and Log-Linear Models

Regression and log-linear models can be used to approximate the


given data.
linear regression, the data are modeled to fit a straight line.
y (called a response variable)
X (called a predictor variable)
y = wx+b
Log-linear models approximate discrete multidimensional
probability distributions.
This allows a higher-dimensional data space to be constructed from
lower dimensional spaces.
Log-linear models are therefore also useful for dimensionality
reduction
28 April 13, 2021
Histograms
Histograms use binning to approximate data distributions and are a

popular form of

The following data are a list of prices of commonly sold items at

AllElectronics(rounded to the nearest dollar). The numbers have been

sorted: 1, 1, 5, 5, 5, 5, 5,8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,

15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,20, 20, 20, 20, 20, 20, 21,

21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

29 April 13, 2021


30 April 13, 2021
Sampling
Sampling can be used as a data reduction technique because it allows a
large data set to be represented by a much smaller random sample (or
subset) of the data.

31 April 13, 2021


Sampling
Simple random sample without replacement (SRSWOR) of size s
Simple random sample with replacement (SRSWR) of size s
Cluster sample
Stratified sample

32 April 13, 2021


Data Discretization and Concept
Hierarchy Generation
Data discretization techniques can be used to reduce the number of
values for a given continuous attribute by dividing the range of the
attribute into intervals.

Interval labels can then be used to replace actual data values.

Supervised discretization
Unsupervised discretization
Top-down discretization or splitting
Bottom-up discretization or merging
33 April 13, 2021
Data Discretization and Concept
Hierarchy Generation
A concept hierarchy for a given numerical attribute defines a
discretization of the attribute.

Concept hierarchies can be used to reduce the data by


collecting and replacing low-level concepts (such as numerical
values for the attribute age) with higher-level concepts (such as
youth, middle-aged, or senior).

34 April 13, 2021

You might also like