0% found this document useful (0 votes)
54 views82 pages

3 Ravi

The document discusses the importance of data preprocessing before data mining or analysis. It explains that real-world data is often dirty, containing missing values, noise, inconsistencies, and duplicates. This low-quality data can negatively impact mining results. The major tasks of data preprocessing are described as data cleaning, integration, transformation, reduction, and discretization to handle these issues and improve data quality. Specific techniques for data cleaning like filling in missing values, identifying outliers, and resolving inconsistencies are also outlined.

Uploaded by

Krishna Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views82 pages

3 Ravi

The document discusses the importance of data preprocessing before data mining or analysis. It explains that real-world data is often dirty, containing missing values, noise, inconsistencies, and duplicates. This low-quality data can negatively impact mining results. The major tasks of data preprocessing are described as data cleaning, integration, transformation, reduction, and discretization to handle these issues and improve data quality. Specific techniques for data cleaning like filling in missing values, identifying outliers, and resolving inconsistencies are also outlined.

Uploaded by

Krishna Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Data Pre-processing

Prof. Ravi Patel


IT Department
ADIT
Why preprocessing?
• Today’s real-world databases are highly
susceptible to noisy, missing, and inconsistent
data due to their typically huge size (often several
gigabytes or more) and their likely origin from
multiple, heterogeneous sources.
• Data have quality if they satisfy the requirements
of the intended use. There are many factors
comprising data quality, including accuracy,
completeness, consistency, timeliness,
believability, and interpretability.
• Low-quality data will lead to low-quality mining
results.
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
• e.g., occupation=“ ”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Data is dirty because…
• Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected
and when it is analyzed.
– Human/hardware/software problems
• Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission
• Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
Why Is Data Preprocessing Important?
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
– Data warehouse needs consistent integration of quality
data
• Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
Data Cleaning

• Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Data Cleaning process for missing
values
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred.
Methods:
1> Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
2> Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with many missing
values.
3> Fill in it automatically with a global constant : e.g., “unknown”, a new
class?!
4> Use a measure of central tendency for the attribute (e.g., the mean or
median) to fill in the missing value
5> Use the attribute mean or median for all samples belonging to the same
class as the given tuple
6> Use the most probable value to fill in the missing value
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal
with possible outliers)
Binning
• Binning: Binning methods smooth a sorted
data value by consulting its “neighborhood,”
that is, the values around it. The sorted values
are distributed into a number of “buckets,” or
bins. Because binning methods consult the
neighborhood of values, they perform local
smoothing.
• smooth by bin means, smooth by bin median,
smooth by bin boundaries
Approach of binning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing approximately same
number of samples
– Good data scaling
– Managing categorical attributes can be a problem
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Example
• Data : 10,2,19,18,20,18,25,28,22
• Bin size : 3
• By Bin mean, By bean median and by bin
boundaries
Regression
• Regression is a data mining function that predicts a
number. Age, weight, distance, temperature, income, or
sales could all be predicted using regression techniques. For
example, a regression model could be used to predict
children's height, given their age, weight, and other factors.
• A regression task begins with a data set in which the target
values are known. For example, a regression model that
predicts children's height could be developed based on
observed data for many children over a period of time. The
data might track age, height, weight, developmental
milestones, family history, and so on. Height would be the
target, the other attributes would be the predictors, and
the data for each child would constitute a case.
linear regression
• The simplest form of regression to visualize is
linear regression with a single predictor.
• A linear regression technique can be used if
the relationship between x and y can be
approximated with a straight line.
In a linear regression scenario with a single
predictor (y = θ2x + θ1), the regression
parameters (also called coefficients) are:

The slope of the line (θ2) — the angle between


a data point and the regression line
and
The y intercept (θ1) — the point
where x crosses the y axis (x = 0)
• In the model build (training) process, a
regression algorithm estimates the value of
the target as a function of the predictors for
each case in the build data. These
relationships between predictors and target
are summarized in a model, which can then be
applied to a different data set in which the
target values are unknown.
Application
• trend analysis
• business planning
• marketing
• financial forecasting
• time series prediction
• biomedical and drug response modeling
• environmental modeling.
What is cluster
• 1. A cluster is a subset of objects which are
“similar”
• 2. A subset of objects such that the distance
between any two objects in the cluster is less
than the distance between any object in the
cluster and any object not located inside it.
• 3. A connected region of a multidimensional
space containing a relatively high density of
objects
• Clustering is a process of partitioning a set of data (or
objects) into a set of meaningful sub-classes, called
clusters.
• Help users understand the natural grouping or
structure in a data set.
• Clustering: unsupervised classification: no predefined
classes.
• Used either as a stand-alone tool to get insight into
data distribution or as a preprocessing step for other
algorithms.
• Moreover, data compression, outliers detection,
understand human concept formation.
Application Area
• Economic Science (especially market research). •
WWW: • Document classification
Cluster Weblog data to discover groups of
similar access patterns
• Pattern Recognition.
• Spatial Data Analysis:
create thematic maps in GIS by clustering
feature spaces
• Image Processing
Cluster
Data Integration
• Motivation
Many databases and sources of data that need to be
integrated to work together
Almost all applications have many sources of data
• Data Integration
Is the process of integrating data from multiple sources
and probably have a single view over all these sources
And answering queries using the combined information
• Integration can be physical or virtual
• Physical: Coping the data to warehouse
• Virtual: Keep the data only at the sources
HETEROGENEITY PROBLEMS
• Source Type Heterogeneity
Systems storing the data can be different
• Communication Heterogeneity
Some systems have web interface others do not
Some systems allow direct query language others
offer APIs
• Schema Heterogeneity
The structure of the tables storing the data can be
different (even if storing the same data)
• Data Type Heterogeneity
Storing the same data (and values) but with different data
types
E.g., Storing the phone number as String or as Number
• Value Heterogeneity
Same logical values stored in different ways E.g., ‘Prof’,
‘Prof.’, ‘Professor’
• Semantic Heterogeneity
Same values in different sources can mean different
things • E.g., Column ‘Title’ in one database means ‘Job
Title’ while in another database it means ‘Person Title’
Handling Redundancy in Data
Integration
Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object
may have different names in different databases
– Derivable data: One attribute may be a “derived”
attribute in another table
• Redundant attributes may be able to be detected by
correlation analysis
-Careful integration of the data from multiple sources
may help reduce/avoid redundancies and inconsistencies
and improve mining speed and quality
Correlation Analysis (numerical Data)

• Correlation coefficient (also called Pearson’s product


moment coefficient)
• where n is the number of tuples, and 𝐴ҧ 𝑎𝑛𝑑 𝐵ത are the
respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(AB) is
the sum of the AB cross‐product.
• If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
• rA,B = 0: independent;
• rA,B < 0: negatively correlated
Correlation Analysis (Categorical Data)
• Χ2 (chi‐square) test

• The larger the Χ2 value, the more likely the


variables are related
• The cells that contribute the most to the Χ2 value
are those whose actual count is very different
from the expected count
Data Transformation
• In data transformation, the data are transformed
or consolidated into forms appropriate or mining.
Data transformation can involve the following
• Data transformation tasks:
– Smoothing
– Normalization
– Attribute construction
– Aggregation
– Discretization
– Generalization(Concept hierarchy)
• Normalization
– the attribute data are scaled so as to fall within a small
specified range, such as -1.0 to 1.0, 0.0 to 1.0
• Attribute construction (or feature construction)
– new attributes are constructed and added from the given
set of attributes to help the mining process.
• Aggregation
– summary or aggregation operations are applied to the data.
– For example, the daily sales data may be aggregated so as
to compute monthly and annual total amounts.
• Discretization
– Dividing the range of a continuous attribute into intervals
– For example, values for numerical attributes, like age, may
be mapped to higher-level concepts, like youth, middle aged,
and senior.
• Generalization
– where low-level or “primitive” (raw) data are replaced by
higher-level concepts through the use of concept hierarchies.
– For example, categorical attributes, like street, can be
generalized to higher-level concepts, like city or country.
Normalization
• An attribute is normalized by scaling its values
so that they fall within a small specified
range, such as 0.0 to 1.0.
• Normalization is particularly useful for
classification algorithms involving
– neural networks
– distance measurements such as nearest-
neighbor classification and clustering.
• Normalization methods
– Min-max normalization
– z-score normalization
– Normalization by decimal scaling
Min-max normalization
• Min-max normalization : performs a linear
transformation on the original data.
– minA and maxA are the minimum and maximum values of an
attribute, A.
Min-max normalization maps a value, v, of A to v in the range
[new_minA, new_maxA] by computing:
• Let income range $12,000 to $98,000
normalized to [0.0, 1.0].
• Then $73,000 is mapped to

Apply min-max normalization


Marks: 8,10,15,20
z-score normalization

Note that normalization can change the original data


quite a bit, especially the z-score method.
Apply z-score normalization :
marks :8,10,15,20
Decimal scaling
Suppose that the recorded values of A range
from -986 to 917.
The maximum absolute value of A is 986. To
normalize by decimal scaling, we therefore
divide each value by 1000 (i.e., j = 3) so that
-986 normalizes to -0.986 and 917 normalizes to
0.917.
Attribute construction
(feature construction)
– new attributes are constructed from the given
attributes and added in order to help improve the
accuracy and understanding of structure in high-
dimensional data.
- For example, given the two features height and
weight it might be advantageous to construct a
feature body mass index (BMI), which is expressed
by weight ÷ height2 .
-By attribute construction can discover missing
information.
Data aggregation
• Data aggregation is a type of data and information
mining process where data is searched, gathered and
presented in a report-based, summarized format to
achieve specific business objectives or processes
and/or conduct human analysis.
• Data aggregation is any process in which information is
gathered and expressed in a summary form, for
purposes such as statistical analysis. A common
aggregation purpose is to get more information about
particular groups based on specific variables such as
age, profession, or income.
• Data aggregation may be performed manually or
through specialized software.
• Data aggregation is a component of business intelligence (BI)
solutions. Data aggregation personnel or software search databases
find relevant search query data and present data findings in a
summarized format that is meaningful and useful for the end user
or application.
• Data cubes store multidimensional aggregated information.
• Data cubes provide fast access to precomputed, summarized data
thereby benefiting on-line analytical processing as well as
datamining.
Data Reduction
• Why data reduction?
A database/data warehouse may store terabytes
of data Complex data analysis/mining may take
a very long time to run or processing on the
complete data set
• Data reduction : Obtain a reduced
representation of the data set that is much
smaller in volume but yet produce the same
(or almost the same) analytical results
How?
• Reducing number of attributes
• Reducing number of attribute values
• Reducing number of tuples
Data Reduction
• Data reduction
• a) Data cube aggregation
• b) Attribute subset selection
• c) Dimensional reduction .
• d) Data Sampling.
• e) Numerosity reduction
• f) Discretization and concept hierarchy
generation
attribute subset selection
• Why attribute subset selection
• Data sets for analysis may contain hundreds of
attributes, many of which may be irrelevant to
the mining task or redundant.
For example,
• if the task is to classify customers as to whether
or not they are likely to purchase a popular new
CD at AllElectronics when notified of a sale,
attributes such as the customer’s telephone
number are likely to be irrelevant, unlike
attributes such as age or music_taste.
`
• Using domain expert to pick out some of the
useful Attributes Sometimes this can be a difficult
and time-consuming task, especially when the
behavior of the data is not well known.
• Leaving out relevant attributes or keeping
irrelevant attributes result in discovered patterns
of poor quality.
• In addition, the added volume of irrelevant or
redundant attributes can slow down the mining
process.
Attribute subset selection (feature selection):

• Attribute subset selection (feature selection):


• – Reduce the data set size by removing irrelevant
or redundant attributes
Goal:
• select a minimum set of features (attributes) such
that the probability distribution of different
classes given the values for those features is as
close as possible to the original distribution given
the values of all features
• It reduces the number of attributes appearing in
the discovered patterns, helping to make the
patterns easier to understand.
How can we find a ‘good’ subset of the
original attributes?
For n attributes, there are 2n possible subsets.
• An exhaustive search for the optimal subset of
attributes can be prohibitively expensive, especially as
n increase.
• Heuristic methods that explore a reduced search space
are commonly used for attribute subset selection.
• These methods are typically greedy in that, while
searching through attribute space, they always make
what looks to be the best choice at the time.
• Such greedy methods are effective in practice and may
come close to estimating an optimal solution.
Heuristic methods
• 1 Step-wise forward selection
• 2 Step-wise backward elimination
• 3 Combining forward selection and backward
elimination
• 4 Decision-tree induction
The “best” (and “worst”) attributes are typically
determined using:
-- the tests of statistical significance, which assume
that the attributes are independent of one another.
– the information gain measure used in building
decision trees for classification.
Stepwise forward selection
• – The procedure starts with an empty set of attributes as the
reduced set.
• – First: The best single-feature is picked.
• – Next: At each subsequent iteration or step, the best of the
remaining original attributes is added to the set.
Stepwise backward elimination:

• – The procedure starts with the full set of


attributes.
• – At each step, it removes the worst attribute
remaining in the set.
Combining forward selection and
backward elimination:
• – The stepwise forward selection and
backward elimination methods can be
combined
• – At each step, the procedure selects the best
attribute and removes the worst from among
the remaining attributes.
Decision tree induction:
– Decision tree algorithms, such as ID3, C4.5, and CART, were
originally intended for classification.
• Decision tree induction constructs a flowchart-like
structure where each internal (nonleaf) node denotes a
test on an attribute, each branch corresponds to an
outcome of the test and each external (leaf) node denotes
a class prediction.
• At each node, the algorithm chooses the “best” attribute to
partition the data into individual classes.
• When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data.
• All attributes that do not appear in the tree are assumed to
be irrelevant.
• Decision tree induction
Data Compression
• String compression: There are extensive theories
and well‐tuned algorithms
– Typically lossless
– But only limited manipulation is possible without
expansion
• Audio/video compression Typically lossy
compression, with progressive refinement
– Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
• Time sequence is not audio Typically short
and vary slowly with time
Discretization
Three types of attributes:
• Nominal —values from an unordered set, e.g., color, profession
• Ordinal —values from an ordered set, e.g., military or academic
rank
• Continuous —real numbers, e.g., integer or real numbers

Discretization:
• Divide the range of a continuous attribute into intervals
• Some classification algorithms only accept categorical attributes.
• Reduce data size by discretization
• Prepare for further analysis
Discretization
Typical methods:
1 Binning
2 Clustering analysis
3 Interval merging by χ2Analysis
Architecture of a typical data mining
system

Graphical user interface

Pattern evaluation

Knowledge base

Data mining engine

Database or data warehouse server

Data cleansing
Data Integration Filtering

Database Data warehouse


Data Mining Task Primitives
• Misconception: Data mining systems can
autonomously dig out all of the valuable
knowledge from a given large database, without
human intervention.

• If there was no user intervention then the system


would uncover a large set of patterns that may
even surpass the size of the database. Hence,
user interference is required.

• This user communication with the system is


provided by using a set of data mining primitives.
Data Mining Task Primitives
• Each user will have a data mining task in mind, that is,
some form of data analysis that he or she would like to
have performed.
• A data mining task can be specified in the form of a
data mining query, which is input to the data mining
system. A data mining query is defined in terms of data
mining task primitives
• These primitives allow the user to interactively
communicate with the data mining system during
discovery in order to direct the mining process, or
examine the findings from different angles or depths.
Data Mining Task Primitives
Data mining primitives :

• Task Relevant Data

• Kinds of knowledge to be mined

• Background knowledge

• Interestingness measure

• Presentation and visualization of discovered patterns


Task Relevant Data

• This specifies the portions of the database or the


set of data in which the user is interested.
• This includes :
o Database or data warehouse name „
o Database tables or data warehouse cubes „
o Condition for data selection
o Relevant attributes or dimensions
o „Data grouping criteria
Example:
• If a data mining task is to study associations between items
frequently purchased at AllElectronics by customers in Canada, the
task relevant data can be specified by providing the following
information:
– Name of the database or data warehouse to be used (e.g.,
AllElectronics_db)
– Names of the tables or data cubes containing relevant data (e.g.,
item, customer, purchases and items_sold)
– Conditions for selecting the relevant data (e.g., retrieve data
pertaining to purchases made in Canada for the current year)
– The relevant attributes or dimensions (e.g., name and price
from the item table and income and age from the customer table)
Kind of knowledge to be mined
• It is important to specify the knowledge to be mined, as this
determines the data mining function to be performed.
• User can also provide pattern templates. Also called
metapatterns or metarules or metaqueries.
• The data mining functions to be performed :
o Characterization
o Discrimination „
o Association „
o Classification/prediction „
o Clustering „
o Outlier analysis
o Other data mining tasks
Background knowledge
• This knowledge about the domain to be mined is
useful for guiding the knowledge discovery
process and for evaluating the patterns found
• Concept hierarchy: is a powerful form of
background knowledge.
o Four major types of concept hierarchies:
schema hierarchies
set-grouping hierarchies
operation-derived hierarchies
rule-based hierarchies
Schema hierarchies
• Schema hierarchy is the total or partial
order among attributes in the database
schema.
• May formally express existing semantic
relationships between attributes.
• Example: location hierarchy
street < city < province/state < country
Set-grouping hierarchies
• Organizes values for a given attribute into groups or sets or range of
values.

• Total or partial order can be defined among groups.

• Used to refine schema-defined hierarchies.

• Typically used for small sets of object relationships.

• Example: Set-grouping hierarchy for age


{young, middle_aged, senior}  all (age)
{20….29}  young
{40….59}  middle_aged
{60….89}  senior
Operation-derived hierarchies
• Operation-derived:
based on operations specified
operations may include
decoding of information-encoded strings
information extraction from complex data
objects
data clustering
Example: URL or email address
[email protected] gives login name < dept. < univ. < country
Rule-based hierarchies
• Rule-based:
Occurs when either whole or portion of a concept hierarchy is
defined as a set of rules and is evaluated dynamically based on
current database data and rule definition

• Example: Following rules are used to categorize items as


low_profit, medium_profit and high_profit_margin.
low_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-
P2)<50)
medium_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-
P2)≥50)^((P1-P2)≤250)
high_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-
P2)>250)
Interestingness measure
• Used to confine the number of uninteresting patterns
returned by the process.

• Based on the structure of patterns and statistics underlying


them.

• patterns not meeting the threshold are not presented to the


user.

• Objective measures of pattern interestingness:


simplicity
certainty (confidence)
utility (support)
novelty
Simplicity

a patterns interestingness is based on its


overall simplicity for human
comprehension.
Example: Rule length is a simplicity
measure
1>A,B,C => D
2> A=>B,C
A=>B
Support

• Utility (support)
– usefulness of a pattern
support (A=>B) = # tuples containing both A and B
total # of tuples
A support of 30% for the rule means that 30%
of all customers purchased both a computer
and software.
Every association rule has a support and a confidence.
“The support is the percentage of transactions that demonstrate the rule.”
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 1, 3, 5.
2: 1, 8, 14, 17, 12.
3: 4, 6, 8, 12, 9, 104.
4: 2, 1, 8.
support {8,12} = 2 (,or 50% ~ 2 of 4 customers)
support {1, 5} = 1 (,or 25% ~ 1 of 4 customers )
support {1} = 3 (,or 75% ~ 3 of 4 customers)

• An itemset is called frequent if its support is equal or greater than an


agreed upon minimal value – the support threshold
Confidence
• Certainty (confidence)
Assesses the validity or trustworthiness of a pattern.
confidence is a certainty measure
• Confidence is an indication of how often the rule has been
found to be true.
confidence (A=>B) = # tuples containing both A and B
# tuples containing A
A confidence of 85% for the rule buys(X,
“computer”)=>buys(X,“software”) means that 85% of all
customers who purchased a computer also bought software
• Confidence(X=>Y) = support(X,Y)
support(X)
• Association rules that satisfy both the
minimum confidence and support threshold
are referred to as strong association rules
Novelty
• Novelty
Patterns contributing new information to
the given pattern set are called novel
patterns
removing redundant patterns is a strategy
for detecting novelty.
Presentation and visualization
• For data mining to be effective, data mining
systems should be able to display the
discovered patterns in multiple forms, such
as rules, tables, crosstabs (cross-tabulations),
pie or bar charts, decision trees, cubes, or
other visual representations.

• User must be able to specify the forms of


presentation to be used for displaying the
discovered patterns.

You might also like