0% found this document useful (0 votes)
32 views103 pages

DWDM Module II

Uploaded by

Reddy Sindhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views103 pages

DWDM Module II

Uploaded by

Reddy Sindhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 103

MODULE-II

DATA MINING

33
OUTLINE
Introduction
• What is Data Mining
• Definition, Knowledge Discovery in Data ( KDD)
• Kinds of data bases
• Data mining functionalities
• Classification of data mining systems
• Data mining task primitives
OUTLINE
Data Preprocessing:
• Data cleaning
• Data integration and transformation
• Data mining functionalities
• Data reduction
• Data discretization and Concept hierarchy.
INTRODUCTION

Introduction
The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge.

The information and knowledge gained can be used for applications ranging from
business management, production control, and market analysis, to engineering design and
science exploration.
INTRODUCTION

The evolution of database technology


WHAT IS DATA MINING - DEFINITION

WHAT IS DATA MINING - DEFINITION

Data mining refers to extracting or mining" knowledge from large amounts of data. There
are many other terms related to data mining, such as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data dredging.

Many people treat data mining as a synonym for another popularly used term, Knowledge
Discovery in Databases", or KDD

Knowledge discovery as a process is depicted in figure.


Knowledge Discovery in Databases
It consists of an iterative sequence of the following steps:
• Data cleaning: to remove noise or irrelevant data
• Data integration: where multiple data sources may be combined
• Data selection: where data relevant to the analysis task are retrieved from the database
• Data transformation: where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations
• Data mining: an essential process where intelligent methods are applied in order to
extract data patterns
• Pattern evaluation: to identify the truly interesting patterns representing knowledge
based on some interestingness measures
• Knowledge presentation: where visualization and knowledge representation
techniques are used to present the mined knowledge to the user.
Architecture of a typical data mining system/Major Components
Data mining is the process of discovering interesting knowledge from large amounts of
data stored either in databases, data warehouses, or other information repositories. Based
on this view, the architecture of a typical data mining system may have the following
major components:
1. A database, data warehouse, or other information repository, which consists of the
set of databases, data warehouses, spreadsheets, or other kinds of information repositories
containing the student and course information.
2. A database or data warehouse server which fetches the relevant data based on users‘
data mining requests.
3. A knowledge base that contains the domain knowledge used to guide the search or to
evaluate the interestingness of resulting patterns. For example, the knowledge base may
contain metadata which describes data from multiple heterogeneous sources.
4. A data mining engine, which consists of a set of functional modules for tasks such as
classification, association, classification, cluster analysis, and evolution and deviation
analysis.
5. A pattern evaluation module that works in tandem with the data mining modules by
employing interestingness measures to help focus the search towards interestingness
patterns.
6. A graphical user interface that allows the user an interactive approach to the data
mining system.
KINDS OF DATA BASES

KINDS OF DATA BASES

Data mining should be applicable to any kind of data repository, as well as to transient
data, such as data streams.

The scope of our examination of data repositories will include relational databases, data
warehouses, transactional databases, advanced database systems, flat files, data streams,
and the World Wide Web.

Advanced database systems include object-relational databases and specific application-


oriented databases, such as spatial databases, time-series databases, text databases, and
multimedia databases.
KINDS OF DATA BASES

Relational Databases
A relational database is a collection of tables, each of which is assigned a unique name.

Each table consists of a set of attributes (columns or fields) and usually stores a large set
of tuples (records or rows). Each tuple in a relational table represents an object identified
by a unique key and described by a set of attribute values.

A semantic data model, such as an entity-relationship (ER) data model, is often


constructed for relational databases.

Relational data can be accessed by database queries written in a relational query


language, such as SQL
KINDS OF DATA BASES

DataWarehouses

Suppose that AllElectronics is a successful international company, with branches around


the world. Each branch has its own set of databases. The president of AllElectronics has
asked you to provide an analysis of the company’s sales per item type per branch for the
third quarter. This is a difficult task, particularly since the relevant data are spread out
over several databases, physically located at numerous sites.

If AllElectronics had a data warehouse, this task would be easy. A data warehouse is a
repository of information collected from multiple sources, stored under a unified schema,
and that usually resides at a single site. Data warehouses are constructed via a process of
data cleaning, data integration, data transformation, data loading, and periodic data
refreshing.
KINDS OF DATA BASES
KINDS OF DATA BASES

A data warehouse is usually modeled by a multidimensional database structure, where


each dimension corresponds to an attribute or a set of attributes in the schema, and each
cell stores the value of some aggregate measure, such as count or sales amount. The
actual physical structure of a data warehouse may be a relational data store or a
multidimensional data cube.

A data cube for AllElectronics. A data cube for summarized sales data of AllElectronics is
presented in Figure.The cube has three dimensions: address (with city values Chicago,
New York, Toronto, Vancouver), time (with quarter values Q1, Q2, Q3, Q4), and
item(with itemtype values home entertainment, computer, phone, security).
KINDS OF DATA BASES
KINDS OF DATA BASES

Transactional Databases
A transactional database consists of a file where each record represents a transaction. A
transaction typically includes a unique transaction identity number (trans ID) and a list of
the items making up the transaction.

As an analyst of the AllElectronics database, you may ask, “Show me all the items
purchased by Sandy Smith” or “How many transactions include item number I3?”
Answering such queries may require a scan of the entire transactional database.
KINDS OF DATA BASES

Advanced Data and Information Systems and Advanced Applications

Object-Relational Databases
Object-relational databases are constructed based on an object-relational data model.
This model extends the relational model by providing a rich data type for handling
complex objects and object orientation.
Each object has associated with it the following:
• A set of variables that describe the objects. These correspond to attributes in the entity-
relationship and relational models.
• A set of messages that the object can use to communicate with other objects, or with
the rest of the database system.
• A set of methods, where each method holds the code to implement a message. Upon
receiving a message, the method returns a value in response.
KINDS OF DATA BASES

Temporal Databases, Sequence Databases, and Time-Series Databases

A temporal database typically stores relational data that include time-related attributes.
These attributes may involve several timestamps, each having different semantics.

A sequence database stores sequences of ordered events, with or without a concrete


notion of time. Examples include customer shopping sequences, Web click streams, and
biological sequences.

Atime-series database stores sequences of values or events obtained over repeated


measurements of time (e.g., hourly, daily, weekly).
KINDS OF DATA BASES

Spatial Databases and Spatiotemporal Databases


Spatial databases contain spatial-related information. Examples include geographic
(map) databases, very large-scale integration (VLSI) or computed-aided design databases,
and medical and satellite image databases.

A spatial database that stores spatial objects that change with time is called a
spatiotemporal database,
Text Databases and Multimedia Databases
Text databases are databases that contain word descriptions for objects. These word
descriptions are usually not simple keywords but rather long sentences or paragraphs,
such as product specifications, error or bug reports, warning messages, summary reports,
notes, or other documents.
Multimedia databases store image, audio, and video data.
KINDS OF DATA BASES

Heterogeneous Databases and Legacy Databases


A heterogeneous database consists of a set of interconnected, autonomous component
databases. The components communicate in order to exchange information and answer
queries. Objects in one component database may differ greatly from objects in other
component databases, making it difficult to assimilate their semantics into the overall
heterogeneous database.

A legacy database is a group of heterogeneous databases that combines different kinds of


data systems, such as relational or object-oriented databases, hierarchical databases,
network databases, spreadsheets, multimedia databases, or file systems.
KINDS OF DATA BASES

Data Streams
Many applications involve the generation and analysis of a newkind of data, called stream
data, where data flow in and out of an observation platform (or window) dynamically.
Such data streams have the following unique features: huge or possibly infinite volume,
dynamically changing, flowing in and out in a fixed order, allowing only one or a small
number of scans, and demanding fast (often real-time) response time.

Typical examples of data streams include various kinds of scientific and engineering data,
time-series data
The World Wide Web
The World Wide Web and its associated distributed information services, such as Yahoo!,
Google, America Online, and AltaVista, provide rich, worldwide, on-line information
services, where data objects are linked together to facilitate interactive access.
Data Mining Functionalities

Data Mining—What Kinds of Patterns Can Be Mined?

Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. In general, data mining tasks can be classified into two categories:
1.Descriptive
2.Predictive.

Descriptive mining tasks characterize the general properties of the data in the database.

Predictive mining tasks perform inference on the current data in order to make
predictions.
Data Mining Functionalities

Concept/Class Description: Characterization and Discrimination


Data can be associated with classes or concepts.

For example,
in the AllElectronics store,classes of items for sale include computers and printers, and
concepts of customers include bigSpenders and budget Spenders.

It can be useful to describe individual classes and concepts in summarized, concise, and
yet precise terms. Such descriptions of a class or a concept are called class/concept
descriptions.
These descriptions can be derived via
1. Data characterization
2. Data discrimination
Data Mining Functionalities

Data characterization

is a summarization of the general characteristics or features of a target class of data. The


data corresponding to the user-specified class are typically collected by a database query.

For example, to study the characteristics of software products whose sales increased by
10% in the last year, the data related to such products can be collected by executing an
SQL query

The output of data characterization can be presented in various forms. Examples include
pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs.
Data Mining Functionalities

Data discrimination
is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.

The target and contrasting classes can be specified by the user, and the corresponding data
objects retrieved through database queries.

For example, the user may like to compare the general features of software products
whose sales increased by 10% in the last year with those whose sales decreased by at least
30% during the same period. The methods used for data discrimination are similar to
those used for data characterization.

Here also we use the same pie charts, bar charts, curves, multidimensional data cubes, and
multidimensional tables, including crosstabs for output.
Data Mining Functionalities

Mining Frequent Patterns, Associations, and Correlations


Frequent patterns, as the name suggests, are patterns that occur frequently in data. There
are many kinds of frequent patterns, including itemsets, subsequences, and substructures.

A frequent itemset typically refers to a set of items that frequently appear together
in a transactional data set, such as milk and bread.

A frequently occurring subsequence, such as the pattern that customers tend to purchase
first a PC, followed by a digital camera, and then a memory card, is a (frequent)
sequential pattern.
A substructure can refer to different structural forms, such as graphs, trees, or lattices,
which may be combined with itemsets or subsequences. If a substructure occurs
frequently, it is called a (frequent) structured pattern.
Data Mining Functionalities

Eg:
Association analysis. Suppose, as a marketing manager of AllElectronics, you would like
to determine which items are frequently purchased together within the same transactions.

An example of such a rule, mined from the AllElectronics transactional database, is


buys(X; “computer”))buys(X; “software”) [support = 1%; confidence = 50%] where X is
a variable representing a customer. A confidence, or certainty, of 50% means that if a
customer buys a computer, there is a 50% chance that she will buy software as well. A 1%
support means that 1% of all of the transactions under analysis showed that computer and
software were purchased together. This association rule involves a single attribute or
predicate (i.e., buys) that repeats.

association rules are discarded as uninteresting if they do not satisfy both a minimum
support threshold and a minimum confidence threshold.
Data Mining Functionalities

Classification and Prediction

Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown. The derived model is based on
the analysis of a set of training data (i.e., data objects whose class label is known).

“How is the derived model presented?” The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae,
or neural networks.
Data Mining Functionalities

A decision tree is a flow-chart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves represent
classes or class distributions. Decision trees can easily be converted to classification
rules.

A neural network, when used for classification, is typically a collection of neuron-like


processing units with weighted connections between the units.

There are many other methods for constructing classification models, such as naïve
Bayesian classification, support vector machines, and k-nearest neighbor classification.
Data Mining Functionalities
Data Mining Functionalities

Eg:
Classification and prediction. Suppose, as sales manager of AllElectronics, you would
like to classify a large set of items in the store, based on three kinds of responses to a
sales campaign: good response, mild response, and no response. You would like to derive
a model for each of these three classes based on the descriptive features of the items, such
as price, brand, place made, type, and category.
Data Mining Functionalities

Cluster Analysis

“What is cluster analysis?” Unlike classification and prediction, which analyze class-
labeled data objects, clustering analyzes data objects without consulting a known class
label.
Data Mining Functionalities

The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity. That is, clusters of objects are formed
so that objects within a cluster have high similarity in comparison to one another, but are
very dissimilar to objects in other clusters.

Eg:
Cluster analysis can be performed on AllElectronics customer data in order to identify
homogeneous subpopulations of customers. These clusters may represent individual
target groups for marketing.
Data Mining Functionalities

Outlier Analysis
A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions. However, in some applications such as fraud detection, the
rare events can be more interesting than the more regularly occurring ones. The analysis
of outlier data is referred to as outlier mining.

Outliers may be detected using statistical tests that assume a distribution or probability
model for the data, or using distance measures where objects that are a substantial
distance from any other cluster are considered outliers.

Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of extremely large amounts for a given account number in
comparison to regular charges incurred by the same account. Outlier values may also be
detected with respect to the location and type of purchase, or the purchase frequency.
Data Mining Functionalities

Eg:
Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of extremely large amounts for a given account number in
comparison to regular charges incurred by the same account. Outlier values may also be
detected with respect to the location and type of purchase, or the purchase frequency.
Data Mining Functionalities

Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time. Although this may include characterization, discrimination,
association and correlation analysis, classification, prediction, or clustering of time
related
data, distinct features of such an analysis include time-series data analysis, sequence or
periodicity pattern matching, and similarity-based data analysis.

Eg:
Evolution analysis. Suppose that you have the major stock market (time-series) data of
the last several years available from the New York Stock Exchange and you would like to
invest in shares of high-tech industrial companies. A data mining study of stock exchange
data may identify stock evolution regularities for overall stocks and for the stocks of
particular companies. Such regularities may help predict future trends in stock market
prices, contributing to your decision making regarding stock investments.
Data Mining Functionalities

Classification of Data Mining Systems


Data mining is an interdisciplinary field, the confluence of a set of disciplines, including
database systems, statistics, machine learning, visualization, and information science.

Depending on the kinds of data to be mined or on the given data mining application, the
data mining system may also integrate techniques from spatial data analysis, information
retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web
technology, economics, business, bioinformatics, or psychology.
Data Mining Functionalities

Classification according to the kinds of databases mined:


A data mining system can be classified according to the kinds of databases mined.
Database systems can be classified according to different criteria (such as data models, or
the types of data or applications involved), each of which may require its own data
mining technique. Datamining systems can therefore be classified accordingly.

Classification according to the kinds of knowledge mined:


Data mining systems can be categorized according to the kinds of knowledge they mine,
that is, based on data mining functionalities, such as characterization, discrimination,
association and correlation analysis, classification, prediction, clustering, outlier analysis,
and evolution analysis.
Data Mining Functionalities

Classification according to the kinds of techniques utilized:


Data mining systems can be categorized according to the underlying data mining
techniques employed. These techniques can be described according to the degree of user
interaction involved (e.g., autonomous systems, interactive exploratory systems, query-
driven systems) or the methods of data analysis employed (e.g., database-oriented or data
warehouse–oriented techniques, machine learning, statistics, visualization, pattern
recognition, neural networks, and so on).

Classification according to the applications adapted:


Data mining systems can also be categorized according to the applications they adapt. For
example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on.
Data Mining Task Primitives

Data Mining Task Primitives


Each user will have a data mining task in mind, that is, some form of data analysis that he
or she would like to have performed. A data mining task can be specified in the form of a
data mining query, which is input to the data mining system.

A data mining query is defined in terms of data mining task primitives. These primitives
allow the user to interactively communicate with the data mining system during discovery
in order to direct the mining process, or examine the findings from different angles or
depths.
Data Mining Task Primitives

The set of task-relevant data to be mined: This specifies the portions of the database
or the set of data in which the user is interested. This includes the database attributes
or data warehouse dimensions of interest (referred to as the relevant attributes or
dimensions).

The kind of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.

The background knowledge to be used in the discovery process: This knowledge


about the domain to be mined is useful for guiding the knowledge discovery process and
for evaluating the patterns found. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of abstraction.
Data Mining Task Primitives

Pattern interestingness measure: This primitive allows users to specify functions that
are used to separate uninteresting patterns from knowledge and may be used to guide the
mining process, as well as to evaluate the discovered patterns. This allows the user to
confine the number of uninteresting patterns returned by the process, as a data mining
process may generate a large number of patterns. Interestingness measures can be
specified for such pattern characteristics as simplicity, certainty, utility and novelty.

The expected representation for visualizing the discovered patterns: This refers to the
form in which discovered patterns are to be displayed, which may include rules, tables,
charts, graphs, decision trees, and cubes.
Data Mining Task Primitives
DATA PREPROCESSING

DATA PREPROCESSING

Data preprocessing describes processing performed on raw data to prepare it for another
processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively
processed for the purpose of the user.
DATA PREPROCESSING

Why Data Preprocessing?


Data in the real world is dirty. It can be in incomplete, noisy and inconsistent from.
These data needs to be preprocessed in order to help improve the quality of the data,
and quality of the mining results.
If no quality data, then no quality mining results. The quality decision is always
based on the quality data.
If there is much irrelevant and redundant information present or noisy and
unreliable data, then knowledge discovery during the training phase is more difficult

Incomplete data: lacking attribute values, lacking certain attributes of interest, or


containing only aggregate
data. e.g., occupation = 1.
Noisy data: containing errors or outliers data. e.g., Salary = -100
DATA PREPROCESSING

Inconsistent data: containing discrepancies in codes or names.


e.g., Age = ―42 ‖ Birthday=03/07/1997

Incomplete data may come from


―Not applicable data value when collected
Different considerations between the time when the data was collected and when it is
analyzed.
Human/hardware/software problems

Noisy data (incorrect values) may come from Faulty data collection by instruments
Human or computer error at data entry Errors in data transmission

Inconsistent data may come from Different data sources


DATA PREPROCESSING

Major Tasks in Data Preprocessing


• Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve in
consistencies
• Data integration
Integration of multiple databases, data cubes, or files
• Data transformation
Normalization and aggregation
• Data reduction
Obtains reduced representation in volume but produces the same or similar analytical
results
• Data discretization
Part of data reduction but with particular importance, especially for numerical data Forms
of Data Preprocessing
DATA PREPROCESSING
DATA PREPROCESSING- Data cleaning

Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out
noise while identifying outliers, and correct inconsistencies in the data.

Missing Values:
Imagine that you need to analyze AllElectronics sales and customer data. You note that
many tuples have no recorded value for several attributes, such as customer income. How
can you go about filling in the missing values for this attribute? Let’s look at the
following
methods:
1. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage
of missing values per attribute varies considerably.
DATA PREPROCESSING- Data cleaning

2. Fill in the missing value manually: In general, this approach is time-consuming and
may not be feasible given a large data set with many missing values.

3. Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label like “Unknown” or - ¥. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that they
form an interesting concept, since they all have a value in common—that of “Unknown.”
Hence, although this method is simple, it is not fool proof.

4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of AllElectronics customers is $56,000. Use this value to replace the
missing value for income.
DATA PREPROCESSING- Data cleaning

5. Use the attribute mean for all samples belonging to the same class as the given tuple:
For example, if classifying customers according to credit risk, replace the missing value
with the average income value for customers in the same credit risk category as that of the
given tuple.

6. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
DATA PREPROCESSING- Data cleaning

Noisy Data:
“What is noise?” Noise is a random error or variance in a measured variable. Given a
numerical attribute such as, say, price, how can we “smooth” out the data to remove the
noise?
Let’s look at the following data smoothing techniques:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34

Smoothing by bin means:


Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
DATA PREPROCESSING- Data cleaning

Smoothing by bin boundaries:


Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
1.Binning: Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are distributed into a
number of “buckets,” or bins. Because binning methods consult the neighborhood of
values, they perform local smoothing. In this example, the data for price are first sorted
and then partitioned into equal-frequency bins of size 3 (i.e., each bin contains three
values). In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each
original value in this bin is replaced by the value 9. Similarly, smoothing by bin medians
can be employed, in which each bin value is replaced by the bin median. In smoothing by
bin boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
DATA PREPROCESSING- Data cleaning

2. Regression: Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best” line to fit two attributes (or
variables), so that one attribute can be used to predict the other. Multiple linear regression
is an extension of linear regression, where more than two attributes are involved and the
data are fit to a multidimensional surface.
DATA PREPROCESSING- Data cleaning

3. Clustering: Outliers may be detected by clustering, where similar values are organized
into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be
considered outliers
DATA PREPROCESSING- Data cleaning

The data should also be examined regarding unique rules, consecutive rules, and null
rules.
A unique rule says that each value of the given attribute must be different from all other
values for that attribute.
A consecutive rule says that there can be no missing values between the lowest and
highest values for the attribute, and that all values must also be unique (e.g., as in check
numbers).
A null rule specifies the use of blanks, question marks, special characters, or other strings
that may indicate the null condition (e.g., where a value for a given attribute is not
available), and how such values should be handled
DATA PREPROCESSING- Data Integration and Transformation

Data Integration and Transformation


Data mining often requires data integration—the merging of data from multiple data
stores. The data may also need to be transformed into forms appropriate for mining

Data Integration
It is likely that your data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store, as in data warehousing. These sources
may include multiple databases, data cubes, or flat files.
DATA PREPROCESSING- Data Integration and Transformation

There are a number of issues to consider during data integration.


Schema integration and object matching can be tricky. How can equivalent real-world
entities from multiple data sources be matched up? This is referred to as the entity
identification problem.

For example, how can the data analyst or the computer be sure that customer id in one
database and cust number in another refer to the same attribute?

metadata for each attribute include the name, meaning, data type, and range of values
permitted for the attribute, and null rules for handling blank, zero, or null values. Such
metadata can be used to help avoid errors in schema integration.

The metadata may also be used to help transform the data (e.g., where data codes for pay
type in one database may be “H” and “S”, and 1 and 2 in another).
DATA PREPROCESSING- Data Integration and Transformation

Redundancy is another important issue.

An attribute (such as annual revenue, for instance) may be redundant if it can be


“derived” from another attribute or set of attributes. Inconsistencies in attribute or
dimension naming can also cause redundancies in the resulting data set.

Some redundancies can be detected by correlation analysis. Given two attributes, such
analysis can measure how strongly one attribute implies the other, based on the available
data. For numerical attributes, we can evaluate the correlation between two attributes, A
and B, by computing the correlation coefficient
DATA PREPROCESSING- Data Integration and Transformation

This is

The result of the equation is > 0, then A and B are positively correlated, which means the
value of A increases as the values of B increases. The higher value may indicate
redundancy that may be removed.
The result of the equation is = 0, then A and B are independent and there is no correlation
between them.
If the resulting value is < 0, then A and B are negatively correlated where the values of
one attribute increase as the value of one attribute decrease which means each attribute
may discourages each other.
-also called Pearson‘s product moment coefficient
DATA PREPROCESSING- Data Integration and Transformation
For categorical (discrete) data, a correlation relationship between two attributes, A
and B, can be discovered by a X2 (chi-square) test.

Suppose A has c distinct values, namely a1;a2; : : :ac. B has r distinct values, namely
b1;b2; : : :br. The data tuples described by A and B can be shown as a contingency
table, with the c values of A making up the columns and the r values of B making up the
rows.

Let (Ai;Bj) denote the event that attribute A takes on value ai and attribute B takes on
value bj, that is, where (A = ai;B = bj). Each and every possible (Ai;Bj) joint event has
its own cell (or slot) in the table. The c2 value (also known as the Pearson c2 statistic) is
computed as:
DATA PREPROCESSING- Data Integration and Transformation

where oi j is the observed frequency (i.e., actual count) of the joint event (Ai;Bj) and ei
j
is the expected frequency of (Ai;Bj), which can be computed as

where N is the number of data tuples, count(A=ai) is the number of tuples having value
ai for A, and count(B = bj) is the number of tuples having value bj for B.
DATA PREPROCESSING- Data Integration and Transformation
DATA TRANSFORAMTION
In data transformation, the data are transformed or consolidated into forms appropriate
for mining.
Data transformation can involve the following:
• Smoothing, which works to remove noise from the data. Such techniques include
binning, regression, and clustering.

• Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and
annual total amounts. This step is typically used in constructing a data cube for
analysis of the data at multiple granularities.

• Generalization of the data, where low-level or “primitive” (raw) data are replaced by
higher-level concepts through the use of concept hierarchies. For example,
categorical attributes, like street, can be generalized to higher-level concepts,
DATA PREPROCESSING- Data Integration and Transformation
like city or country. Similarly, values for numerical attributes, like age, may be
mapped to higher-level concepts, like youth, middle-aged, and senior.

• Normalization, where the attribute data are scaled so as to fall within a small
specified range, such as -1.0 to 1.0, or 0.0 to 1.0.

• Attribute construction (or feature construction),where new attributes are


constructed and added from the given set of attributes to help the mining
process.
DATA PREPROCESSING- Data Integration and Transformation
An attribute is normalized by scaling its values so that they fall within a small
specified range, such as 0.0 to 1.0.
There are many methods for data normalization.
We study three:
1. min-max normalization
2. z-score normalization
3. normalization by decimal scaling.
DATA PREPROCESSING- Data Integration and Transformation
Min-max normalization performs a linear transformation on the original data.
Suppose that minA and maxA are the minimum and maximum values of an
attribute, Min-max normalization maps a value, v, of A to v1 in the range [new
minA;new maxA] by computing

Suppose that the minimum and maximum values for the attribute income are
$12,000 and $98,000, respectively. We would like to map income to the range
[0.0,1.0].
By min-max normalization, a value of $73,600 for income is transformed
to
73,600-12,000/
98,000-12,000 (1.0-0)+0 = 0.716.
DATA PREPROCESSING- Data Integration and Transformation
In z-score normalization (or zero-mean normalization), the values for an attribute,
A, are normalized based on the mean and standard deviation of A. A value, v, of A is
normalized to v1 by computing

z-score normalization Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score
normalization,
a value of $73,600 for income is transformed to 73,600-54,000/16,000 = 1.225.
DATA PREPROCESSING- Data Integration and Transformation
Normalization by decimal scaling normalizes by moving the decimal point of values
of attribute A. The number of decimal points moved depends on the maximum
absolute value of A. A value, v, of A is normalized to v1 by computing

Decimal scaling. Suppose that the recorded values of A range from -986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, we therefore
divide each value by 1,000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917
normalizes to 0.917.
DATA PREPROCESSING- Data Reduction
Data Reduction
Imagine that you have selected data from the AllElectronics data warehouse for
analysis. The data set will likely be huge! Complex data analysis and mining on huge
amounts of data can take a long time, making such analysis impractical or infeasible.

Data reduction techniques can be applied to obtain a reduced representation of the


data set that is much smaller in volume, yet closely maintains the integrity of the
original data.
DATA PREPROCESSING- Data Reduction
Strategies for data reduction include the following:
1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
2. Attribute subset selection, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed.
3. Dimensionality reduction, where encoding mechanisms are used to reduce the
data set size.
4. Numerosity reduction, here the data are replaced or estimated by alternative,
smaller data representations such as parametric models (which need store only the
model parameters instead of the actual data) or nonparametric methods such as
clustering, sampling, and the use of histograms.
5.Discretization and concept hierarchy generation, where raw data values for
attributes are replaced by ranges or higher conceptual levels. Data discretization is a
form of numerosity reduction that is very useful for the automatic generation of
concept hierarchies.
DATA PREPROCESSING- Data Reduction
Data Cube Aggregation
Imagine that you have collected the data for your analysis. These data consist of the
AllElectronics sales per quarter, for the years 2002 to 2004. You are, however,
interested in the annual sales (total per year), rather than the total per quarter. Thus the
data can be aggregated so that the resulting data summarize the total sales per year
instead of per quarter.
DATA PREPROCESSING- Data Reduction
Attribute Subset Selection
Attribute subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions). The goal of attribute subset selection is to find a
minimum set of attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using all attributes.

Mining on a reduced set of attributes has an additional benefit. It reduces the number
of attributes appearing in the discovered patterns, helping to make the patterns easier
to understand.

“How can we find a ‘good’ subset of the original attributes?” For n attributes, there
are
2 power n possible subsets.
DATA PREPROCESSING- Data Reduction
Basic heuristic methods of attribute subset selection include the following techniques

1. Stepwise forward selection: The procedure starts with an empty set of attributes as
the reduced set. The best of the original attributes is determined and added to the
reduced set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.

2. Stepwise backward elimination: The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.
DATA PREPROCESSING- Data Reduction
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step,
the procedure selects the best attribute and removes the worst from among the
remaining attributes.

4. Decision tree induction constructs a flowchart like structure where each internal
(non leaf) node denotes a test on an attribute, each branch corresponds to an outcome
of the test, and each external (leaf) node denotes a class prediction. At each node, the
algorithm chooses the “best” attribute to partition the data into individual classes.
When decision tree induction is used for attribute subset selection, a tree is
constructed from the given data. All attributes that do not appear in the tree are
assumed to be irrelevant. The set of attributes appearing in the tree form the reduced
subset of attributes.
DATA PREPROCESSING- Data Reduction
DATA PREPROCESSING- Data Reduction
Dimensionality Reduction
In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or “compressed” representation of the original data.

loss less:
If the original data can be reconstructed from the compressed data without any loss of
information, the data reduction is called lossless.
lossy:
If, instead, we can reconstruct only an approximation of the original data, then the
data reduction is called lossy.

effective methods of lossy dimensionality reduction are


1. wavelet transforms and
2. principal components analysis.
DATA PREPROCESSING- Data Reduction
Wavelet Transforms
The discrete wavelet transform(DWT) is a linear signal processing technique that,
when applied to a data vector X, transforms it to a numerically different vector, X-1,
of wavelet coefficients. The two vectors are of the same length.

When applying this technique to data reduction, we consider each tuple as an n-


dimensional data vector, that is, X = (x1,x2, … ,xn), depicting n measurements made
on the tuple from n database attributes.

“How can this technique be useful for data reduction if the wavelet transformed data
are of the same length as the original data?” The usefulness lies in the fact that the
wavelet transformed data can be truncated.
DATA PREPROCESSING- Data Reduction
A compressed approximation of the data can be retained by storing only a small
fraction of the strongest of the wavelet coefficients.

For example, all wavelet coefficients larger than some user-specified threshold can be
retained. All other coefficients are set to 0.

The method is as follows:


DATA PREPROCESSING- Data Reduction
1. The length, L, of the input data vector must be an integer power of 2. This condition
can be met by padding the data vector with zeros as necessary (L >=n).
2. Each transform involves applying two functions. The first applies some data
smoothing, such as a sum or weighted average. The second performs a weighted
difference, which acts to bring out the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (x2i,x2i+1). This results in two sets of data of length L=2. In general,
these represent a smoothed or low-frequency version of the input data and the high
frequency content of it, respectively.
4. The two functions are recursively applied to the sets of data obtained in the
previous loop, until the resulting data sets obtained are of length 2.
5. Selected values from the data sets obtained in the above iterations are designated
the wavelet coefficients of the transformed data.
DATA PREPROCESSING- Data Reduction
Principal Components Analysis
Suppose that the data to be reduced consist of tuples or data vectors described by n
attributes or dimensions.

Principal components analysis, or PCA (also called the Karhunen-Loeve, or K-L,


method), searches for k n-dimensional orthogonal vectors that can best be used to
represent the data, where k <= n. The original data are thus projected onto a much
smaller space, resulting in dimensionality reduction.
DATA PREPROCESSING- Data Reduction
Numerosity Reduction
“Can we reduce the data volume by choosing alternative, ‘smaller’ forms of data
representation?” .Techniques of numerosity reduction may be
1. parametric
2. nonparametric.

For parametric methods, a model is used to estimate the data, so that typically only
the data parameters need to be stored, instead of the actual data.
Log-linear models, which estimate discrete multidimensional probability
distributions, are an example.

Nonparametric methods for storing reduced representations of the data include


histograms,clustering, and sampling.
DATA PREPROCESSING- Data Reduction
Regression and Log-Linear Models
Regression and log-linear models can be used to approximate the given data. In
(simple) linear regression, the data are modeled to fit a straight line. For example, a
random variable, y (called a response variable), can be modeled as a linear function of
another random variable, x (called a predictor variable), with the equation
y=w x+ b
Multiple linear regression is an extension of (simple) linear regression, which allows a
response variable, y, to be modeled as a linear function of two or more predictor
variables.
Log-linear models approximate discrete multidimensional probability distributions.
Given a set of tuples in n dimensions (e.g., described by n attributes), we can consider
each tuple as a point in an n-dimensional space. Log-linear models can be used to
estimate the probability of each point in a multidimensional space for a set of
discretized attributes, based on a smaller subset of dimensional combinations.
DATA PREPROCESSING- Data Reduction
Histograms
Histograms use binning to approximate data distributions and are a popular form of
data reduction.
A histogram for an attribute, A, partitions the data distribution of A into disjoint
subsets, or buckets. If each bucket represents only a single attribute-value/frequency
pair, the buckets are called singleton buckets. Often, buckets instead represent
continuous ranges for the given attribute.
Ex
Histograms. The following data are a list of prices of commonly sold items at
AllElectronics (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5,
5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,
18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30,
30.
DATA PREPROCESSING- Data Reduction

Figure shows a histogram for the data using singleton buckets. To further reduce
the data, it is common to have each bucket denote a continuous range of values for the
given attribute.
DATA PREPROCESSING- Data Reduction

“How are the buckets determined and the attribute values partitioned?” There are
several partitioning rules, including the following:
Equal-width: In an equal-width histogram, the width of each bucket range is uniform
Equal-frequency (or equidepth): In an equal-frequency histogram, the buckets are
created so that, roughly, the frequency of each bucket is constant (that is, each bucket
contains roughly the same number of contiguous data samples).
V-Optimal: If we consider all of the possible histograms for a given number of
buckets, the V-Optimal histogram is the one with the least variance. Histogram
variance is a weighted sum of the original values that each bucket represents, where
bucket weight is equal to the number of values in the bucket.
MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of
adjacent values. A bucket boundary is established between each pair for pairs having
the b-1 largest differences, where b is the user-specified number of buckets.
DATA PREPROCESSING- Data Reduction
Clustering
Clustering techniques consider data tuples as objects. They partition the objects into
groups or clusters, so that objects within a cluster are “similar” to one another and
“dissimilar” to objects in other clusters. Similarity is commonly defined in terms of
how “close” the objects are in space, based on a distance function.

The “quality” of a cluster may be represented by its diameter, the maximum distance
between any two objects in the cluster. Centroid distance is an alternative measure of
cluster quality and is defined as the average distance of each cluster object from the
cluster centroid.

In data reduction, the cluster representations of the data are used to replace the actual
data.
DATA PREPROCESSING- Data Reduction
Sampling
Sampling can be used as a data reduction technique because it allows a large data set
to be represented by a much smaller random sample (or subset) of the data. Suppose
that a large data set, D, contains N tuples. Let’s look at the most common ways that
we could sample D for data reduction,
DATA PREPROCESSING- Data Reduction
DATA PREPROCESSING- Data Reduction
Simple random sample without replacement (SRSWOR) of size s: This is created
by drawing s of the N tuples from D (s < N), where the probability of drawing any
tuple in D is 1/N, that is, all tuples are equally likely to be sampled.

Simple random sample with replacement (SRSWR) of size s: This is similar to


SRSWOR, except that each time a tuple is drawn from D, it is recorded and then
replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn
again.
DATA PREPROCESSING- Data Reduction
Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,”
then an SRS of s clusters can be obtained, where s < M. For example, tuples in a
database are usually retrieved a page at a time, so that each page can be considered a
cluster. A reduced data representation can be obtained by applying, say, SRSWOR to
the pages, resulting in a cluster sample of the tuples. Other clustering criteria
conveying rich semantics can also be explored. For example, in a spatial database, we
may choose to define clusters geographically based on how closely different areas are
located.
Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified
sample of D is generated by obtaining an SRS at each stratum. This helps ensure a
representative sample, especially when the data are skewed. For example, a stratified
sample may be obtained from customer data, where a stratum is created for each
customer age group. In this way, the age group having the smallest number of
customers will be sure to be represented.
DATA PREPROCESSING- Data Reduction
Data Discretization and Concept Hierarchy Generation
Data discretization techniques can be used to reduce the number of values for a given
continuous attribute by dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values.
Discretization techniques can be categorized based on how the discretization is
performed
1.Top-down
2.Bottom-up
If the discretization process uses class information, then we say it is supervised
discretization. Otherwise, it is unsupervised.
If the process starts by first finding one or a few points (called split points or cut
points) to split the entire attribute range, and then repeats this recursively on the
resulting intervals, it is called top-down discretization or splitting.
DATA PREPROCESSING- Data Reduction
The bottom-up discretization or merging, which starts by considering all of the
continuous values as potential split-points, removes some by merging neighborhood
values to form intervals, and then recursively applies this process to the resulting
intervals.
A concept hierarchy for a given numerical attribute defines a discretization of the
attribute. Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts (such as numerical values for the attribute age) with
higher-level concepts (such as youth, middle-aged, or senior).
Discretization and Concept Hierarchy Generation for Numerical Data
Concept hierarchies for numerical attributes can be constructed automatically based
on data discretization.
DATA PREPROCESSING- Data Reduction
Binning
Binning is a top-down splitting technique based on a specified number of bins. For
example, attribute values can be discretized by applying equal-width or equal-
frequency binning, and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively.

Histogram Analysis
Histograms partition the values for an attribute, A, into disjoint ranges called buckets.
DATA PREPROCESSING- Data Reduction
Entropy-Based Discretization
Entropy is one of the most commonly used discretization measures. Entropy-based
discretization is a supervised, top-down splitting technique. It explores class
distribution information in its calculation and determination of split-points.

To discretize a numerical attribute, A, the method selects the value of A that has the
minimum entropy as a split-point, and recursively partitions the resulting intervals to
arrive at a hierarchical discretization. Such discretization forms a concept hierarchy
for A.

Let D consist of data tuples defined by a set of attributes and a class-label attribute.
The class-label attribute provides the class information per tuple. The basic method
for
entropy-based discretization of an attribute A within the set is as follows:
DATA PREPROCESSING- Data Reduction
1. Each value of A can be considered as a potential interval boundary or split-point
(denoted split point) to partition the range of A. That is, a split-point for A can
partition the tuples in D into two subsets satisfying the conditions A <= split point and
A > split point, respectively, there by creating a binary discretization.

2.Suppose we want to classify the tuples in D by partitioning on attribute A and some


split-point. Ideally, we would like this partitioning to result in an exact classification
of the tuples. For example, if we had two classes, we would hope that all of the tuples
of, say, class C1 will fall into one partition, and all of the tuples of class C2 will fall
into the other partition. However, this is unlikely. For example, the first partition may
contain many tuples of C1, but also some of C2. How much more information would
we still need for a perfect classification, after this partitioning? This amount is called
the expected information requirement for classifying a tuple in D based on
partitioning by A. It is given by
DATA PREPROCESSING- Data Reduction

where D1 and D2 correspond to the tuples in D satisfying the conditions A <=


split point and A > split point, respectively; |D| is the number of tuples in D, and so
on. The entropy function for a given set is calculated based on the class distribution
of the tuples in the set. For example, given m classes, C1,C2,…..,Cm, the entropy of
D1 is

where pi is the probability of class Ci in D1, determined by dividing the number of


tuples of class Ci in D1 by |D1|, the total number of tuples in D1. Therefore, when
selecting a split-point for attribute A, we want to pick the attribute value that gives the
minimum expected information requirement (i.e., min(InfoA(D))).
DATA PREPROCESSING- Data Reduction

3. The process of determining a split-point is recursively applied to each partition


obtained, until some stopping criterion is met, such as when the minimum information
requirement on all candidate split-points is less than a small threshold, e, or when the
number of intervals is greater than a threshold, max interval.

Interval Merging by X2 Analysis

ChiMerge is a X2 based discretization method. The discretization methods that we


have
studied up to this point have all employed a top-down, splitting strategy. This
contrasts
with ChiMerge, which employs a bottom-up approach by finding the best neighboring
intervals and then merging these to form larger intervals, recursively. The method is
supervised in that it uses class information.
DATA PREPROCESSING- Data Reduction
Cluster Analysis
Cluster analysis is a popular data discretization method. A clustering algorithm can be
applied to discretize a numerical attribute, A, by partitioning the values of A into
clusters or groups. Clustering takes the distribution of A into consideration, as well as
the closeness of data points, and therefore is able to produce high-quality
discretization results.
DATA PREPROCESSING- Data Reduction
Concept Hierarchy Generation for Categorical Data
Categorical data are discrete data. Categorical attributes have a finite (but possibly
large) number of distinct values, with no ordering among the values. Examples
include geographic location, job category, and item type.
There are several methods for the generation of concept hierarchies for categorical
data.
Specification of a partial ordering of attributes explicitly at the schema level by users or experts :
Concept hierarchies for categorical attributes or dimensions typically involve
a group of attributes. A user or expert can easily define a concept hierarchy by
specifying a partial or total ordering of the attributes at the schema level. For example,
a relational database or a dimension location of a data warehouse may contain the
following group of attributes: street, city, province or state, and country. A hierarchy
can be defined by specifying the total ordering among these attributes at the schema
level, such as street < city < province or state < country.
DATA PREPROCESSING- Data Reduction
Specification of a portion of a hierarchy by explicit data grouping:
after specifying that province and country form a hierarchy at the schema level, a user
could define some intermediate levels manually, such as “{Alberta, Saskatchewan,
Manitobag C prairies Canada}” and “{British Columbia, prairies Canadag C Western
Canada}”.

Specification of a set of attributes, but not of their partial ordering:


A user may specify a set of attributes forming a concept hierarchy, but omit to
explicitly state their partial ordering. The system can then try to automatically
generate the attribute ordering so as to construct a meaningful concept hierarchy.
DATA PREPROCESSING- Data Reduction
Specification of only a partial set of attributes:
Sometimes a user can be sloppy when defining a hierarchy, or have only a vague idea
about what should be included in a hierarchy. Consequently, the user may have
included only a small subset of the relevant attributes in the hierarchy specification.
For example, instead of including all of the hierarchically relevant attributes for
location, the user may have specified only street and city.

To handle such partially specified hierarchies, it is important to embed data semantics


in the database schema so that attributes with tight semantic connections can be
pinned together. In this way, the specification of one attribute may trigger a whole
group of semantically tightly linked attributes to be “dragged in” to form a complete
hierarchy.

You might also like