0% found this document useful (0 votes)
18 views66 pages

Unit 4

This document discusses data preprocessing tasks. It defines common data quality issues like incomplete, noisy, and inconsistent data. The main goals of data preprocessing are to handle these quality issues, prepare raw data for further processing, and ensure quality decisions can be made from the data. Key preprocessing tasks include data cleaning, integration, transformation, and reduction.

Uploaded by

Manan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views66 pages

Unit 4

This document discusses data preprocessing tasks. It defines common data quality issues like incomplete, noisy, and inconsistent data. The main goals of data preprocessing are to handle these quality issues, prepare raw data for further processing, and ensure quality decisions can be made from the data. Key preprocessing tasks include data cleaning, integration, transformation, and reduction.

Uploaded by

Manan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Unit-4

Data Pre-processing
Outline
 Why to preprocess data?
 Mean, median, mode & range
 Attribute types
 Data preprocessing tasks
• Data cleaning
• Data integration
• Data transformation
• Data reduction
 Data mining task primitives
Why to preprocess data?
 Real world data are generally “dirty”
• Incomplete: Missing attribute values, lack of certain attributes of interest,
or containing only aggregate data.
o E.g. Occupation=“ ”
• Noisy: Containing errors or outliers.
o E.g. Salary=“abcxy”
• Inconsistent: Containing similarity in codes or names.
o E.g. “Gujarat” & “Gujrat” (Common mistakes like spelling, grammar, articles)
Why data preprocessing is important?
“No quality data, No quality results”
 It looks like Garbage In Garbage Out (GIGO).

 Quality decisions must be based on quality data.


 Duplicate or missing data may cause incorrect or even misleading
statistics.
 Data preparation, cleaning and transformation are the majority task
in data mining. (could be as high as 90%).
 Data preprocessing prepares raw data for further processing.
𝑛
1
Mean 𝑋= ∑
𝑛 𝑖 =1
𝑥

 Mean is the average of a dataset.


 To find the mean, calculate the sum of all the data and then divide
by the total number of data.
 Example
✔ Find out mean for 12, 15, 11, 11, 7, 13

First, find the sum of the data.


12 + 15 +11 + 11 + 7 + 13 = 69
Then divide by the total number of data.
69 / 6 = 11.5 Mean
Median
 Median is the middle number in a dataset when the data is
arranged in numerical order (Sorted Order).
If count is Odd then middle number is
Median
If count is Even then take average of
middle two numbers that is Median
Median - Odd (Cont..)
 Example
 Find out Median for 12, 15, 11, 11, 7, 13, 15

In above example, count of data is 7. (Odd)


First, arrange the data in ascending order.
7, 11, 11, 12, 13, 15, 15
Partitioning data into equal halfs
7, 11, 11, 12, 13, 15, 15
12 Median
Median - Even (Cont..)
 Example
 Find out median for 12, 15, 11, 11, 7, 13

In above example, count of data is 6. (Even)


First, arrange the data in ascending order.
7, 11, 11, 12, 13, 15
Calculate an average of the two numbers in
the middle.
7, 11, 11, 12, 13, 15
(11 + 12)/2 = 11.5 Median
Mode
 The mode is the number that occurs most often within a set of
numbers.
 Example
Find mode.
12, 15, 11, 11, 7, 13
11 Mode (Unimodal)

Find mode.
12, 15, 11, 11, 7, 12, 13
11, 12 Mode (Bimodal)
Mode (Cont..)
 Example
Find mode.
12, 12 15, 11, 11, 7, 13, 7
7, 11, 12 Mode (Trimodal)

Find mode.
12, 15, 11, 10, 7, 14, 13
No Mode
Range
 The range of a set of data is the difference between the largest
and the smallest number in the set.
 Example
 Find range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50

First, arrange the data in ascending order.


26, 30, 34, 40, 40, 42, 43, 47, 48, 50, 50, 55
 In our example largest number is 55, and subtract the smallest
number is 26.

55 – 26 = 29 Range
Standard deviation
 The Standard Deviation is a measure of how spread out any data
are.
 Its symbol is σ (the Greek letter sigma).

 Standard Deviation is Square root of sample variance.


Standard deviation (Cont..)
 The Variance is defined as:
The average of the squared differences from the Mean.

To calculate the variance follow these steps:

1. Calculate the mean, x.


2. Write a table that subtracts the mean from each
observed value.
3. Square each of the differences, add this column.
4. Divide by n -1 where n is the number of items in the
sample, this is the variance (In actual case take n).
5. To get the standard deviation we take the square root of
the variance.
Standard deviation - example
 The owner of the Indian restaurant is interested in how much
people spend at the restaurant.
 He examines 10 randomly selected receipts for parties and writes
down the following data.
44, 50, 38, 96, 42, 47, 40, 39, 46, 50
1. Find out Mean (1st step)
 Mean is 49.2
2. Write a table that subtracts the mean from each observed value. (2nd step)
Standard deviation – example (Cont..)
Step : 3 X X – Mean ( X – Mean )2 Step : 4
44 -5.2 27.04
50 0.8 0.64
38 11.2 125.44 S2 288.7 ~ 289
96 46.8 2190.24
42 -7.2 51.84
Step : 5
47 -2.2 4.84
40 -9.2 84.64 S
39 -10.2 104.04 S 17

46 -3.2 10.24
50 0.8 0.64
Total 2600.4
Standard deviation – example (Cont..)
 Standard deviation can be thought of measuring how far the data
values lie from the mean, we take the mean and move on
standard deviation in either direction.
 The mean for this example is 49.2 and the standard deviation is
17.
 Now, 49.2 - 17 = 32.2 and 49.2 + 17 = 66.2
 This means that most of the data probably spend between 32.2
and 66.2.
 If all data are same then variance & standard deviation is 0 (zero).
Example (Try it)
 Calculate Mean, Median, Mode, Range, Variance &
Standard deviation .
13, 18, 13, 14, 13, 16, 14, 21, 13
 Mean is 15.
 Median is 14.
 Mode is 13 & 14 (Bimodal).
 Range is 8.
 Variance is 289.
 Standard deviation is 17.
Attribute Types
 An attribute is a property of the object.
 It also represents different features of the object.
 E.g. Person  Name, Age, Qualification etc.
 Attribute types can be divided into four categories.
1. Nominal
2. Ordinal
3. Interval
4. Ratio
1) Nominal Attribute Attribute Types

 Nominal attributes are named attributes which can be separated


into discrete (individual) categories which do not overlap.
 Nominal attributes values also called as distinct values.
 Example
2) Ordinal Attribute Attribute Types

 Ordinal attribute is the order of the values, that’s important and


significant, but the differences between each one is not really
known.
 Example
 Rankings  1st, 2nd, 3rd
 Ratings  ,
 We know that a 5 star is better than a 2 star or 3 star, but we don’t
know and cannot quantify–how much better it is?
3) Interval Attribute Attribute Types

 Interval attribute comes in the form of a numerical value where


the difference between points is meaningful.
 Example
 Temperature  10°-20°, 30°-50°, 35°-45°
 Calendar Dates  15th – 22nd, 10th – 30th
 We can not find true zero (absolute) value with interval attributes.
4) Ratio Attribute Attribute Types

 Ratio attribute is looks like interval attribute, but it must have a


true zero (absolute) value.
 It tells us about the order and the exact value between units or
data.
 Example
 Age Group  10-20, 30-50, 35-45 (In years)
 Mass  20-30 kg, 10-15 kg
 It does have a true zero (absolute) so, it is possible to compute
ratios.
Data Preprocessing
 Data have quality if they satisfy the requirements of the intended
use.
 There are many factors comprising data quality, including
accuracy, completeness, consistency, timeliness, believability, and
interpretability.
 The data you wish to analyze by data mining techniques are
✔ incomplete (lacking attribute values or certain attributes of interest, or
containing only aggregate data);
✔ inaccurate or noisy (containing errors, or values that deviate from the
expected); and
✔ inconsistent (e.g., containing discrepancies in the department codes used
to categorize items)
Data Preprocessing
 The elements defining data quality:
✔ accuracy,
✔ completeness,
✔ consistency.
 Inaccurate, incomplete, and inconsistent data are commonplace
properties of large real-world databases and data warehouses.
 Reasons for inaccurate data (i.e., having incorrect attribute
values):
✔ The data collection instruments used may be faulty.
✔ There may have been human or computer errors occurring at data entry.
✔ Users may purposely submit incorrect data values for mandatory fields
when they do not wish to submit personal information (e.g., by choosing
the default value “January 1” displayed for birthday). This is known as
disguised missing data.
✔ Errors in data transmission can also occur.
✔ There may be technology limitations such as limited buffer size for
coordinating synchronized data transfer and consumption.
✔ Incorrect data may also result from inconsistencies in naming conventions
or data codes, or inconsistent formats for input fields (e.g., date).
✔ Duplicate tuples also require data cleaning.
 Incomplete data can occur for a number of reasons.
✔ Attributes of interest may not always be available
✔ Data may not be included simply because they were not considered
important at the time of entry.
✔ Relevant data may not be recorded due to a misunderstanding or because
of equipment malfunctions.
Data Preprocessing Tasks
Data
Cleaning

Data Data
Transformation Integration

Data
Reduction
1) Data Cleaning
 Real-world data tend to be incomplete, noisy, and inconsistent.
 Data cleaning (or data cleansing) routines attempt to fill in
missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
1) Fill missing values Data Cleaning

 Ignore the tuple (record/row):


• Usually done when class label is missing.
• This method is not very effective, unless the tuple contains several
attributes with missing values.
• It is especially poor when the percentage of missing values per attribute
varies considerably
 Fill missing value manually:
• This approach is time consuming and may not be feasible given a large
data set with many missing values.
 Use a global constant to fill in the missing value:
• Replace all missing attribute values by the same constant such as a label like
“Unknown” or -∞.
1) Fill missing values (Cont..) Data Cleaning

 Use a measure of central tendency for the attribute (e.g., the


mean or median) to fill in the missing value:
• Use mean or median to fill missing values.
 Use the attribute mean or median for all samples belonging to
the same class as the given tuple
 Use the most probable value to fill in the missing value:
• This may be determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction.
2) Noisy Data Data Cleaning

 Noise is a random error or variance in a measured variable.


1. Binning method
2. Regression
3. Outlier analysis
1) Binning method
 Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.
 The sorted values are distributed into a number of “buckets,” or
bins.
 Because binning methods consult the neighborhood of values,
they perform local smoothing
Binning methods for data smoothing
2) Regression:
 Data smoothing can also be done by regression, a technique that
conforms data values to a function.
 Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to
predict the other.
 Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit
to a multidimensional surface.
2) Outlier analysis:
 Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.”
 Values that fall outside of the set of clusters may be considered
outliers.
Data Integration
 Data mining often requires data integration—the merging of data
from multiple data stores.
 Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
 This can help improve the accuracy and speed of the subsequent
data mining process.
Entity Identification Problem
 Data analysis task will involve data integration, which combines
data from multiple sources into a coherent data store, as in data
warehousing.
 These sources may include multiple databases, data cubes, or flat
files.
 There are a number of issues to consider during data integration.
 Schema integration and object matching can be tricky.
 How can equivalent real-world entities from multiple sources be
matched up? This is referred to as the entity identification
problem.
Redundancy and Correlation Analysis
 Redundancy is another important issue in data integration.
 An attribute (such as annual revenue, for instance) may be redundant
if it can be “derived” from another attribute or set of attributes.
 Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.
 Given two attributes, such analysis can measure how strongly one
attribute implies the other, based on the available data.
 For nominal data, we use the Χ2 (chi-square) test.
 For numeric attributes, use the correlation coefficient and covariance,
both of which access how one attribute’s values vary from those of
another.
Χ2 Correlation Test for Nominal Data
 For nominal data, a correlation relationship between two
attributes, A and B, can be discovered by a X2 (chi-square) test.
 Suppose A has c distinct values, namely a1,a2,….,ac. B has r
distinct values, namely b1,b2,…,br.
 The data tuples described by A and B can be shown as a
contingency table, with the c values of A making up the columns
and the r values of B making up the rows.
 Let (Ai ,Bj) denote the joint event that attribute A takes on value ai
and attribute B takes on value bj.
 The X2 value is computed as

 where n is the number of data tuples


 Suppose that a group of 1500 people was surveyed.
 The gender of each person was noted. Each person was polled as
to whether his or her preferred type of reading material was
fiction or nonfiction.
 Thus, we have two attributes, gender and preferred reading. The
observed frequency (or count) of each possible joint event is
summarized in the contingency table where the numbers in
parentheses are the expected frequencies.
 For example, the expected frequency for the cell (male, fiction) is
 For this 2X2 table, the degrees of freedom are
(r-1)(c-1) = (2-1)(2-1).
 For 1 degree of freedom, the X2 value needed to reject the
hypothesis at the 0.001 significance level is 10.828.
 Since our computed value is above this, we can reject the
hypothesis that gender and preferred reading are independent
and conclude that the two attributes are (strongly) correlated for
the given group of people.
chi square distribution table
Tuple Duplication
 To detect redundancies between attributes, duplication should
also be detected at the tuple level (e.g., where there are two or
more identical tuples for a given unique data entry case)
 For example, if a purchase order database contains attributes for
the purchaser’s name and address instead of a key to this
information in a purchaser database, discrepancies can occur, such
as the same purchaser’s name appearing with different addresses
within the purchase order database.
Data Value Conflict Detection and Resolution
 Data integration also involves the detection and resolution of data
value conflicts.
 For example, for the same real-world entity, attribute values from
different sources may differ.
 This may be due to differences in representation, scaling, or
encoding.
Data Reduction
 Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet
closely maintains the integrity of the original data.
 That is, mining on the reduced data set should be more efficient
yet produce the same (or almost the same) analytical results.
Overview of Data Reduction Strategies
 Data reduction strategies include
✔ Dimensionality reduction,
✔ Numerosity reduction, and
✔ Data compression.
Dimensionality reduction
 Dimensionality reduction is the process of reducing the number of
random variables or attributes under consideration.
 Dimensionality reduction methods include wavelet transforms and
principal components analysis, which transform or project the
original data onto a smaller space.
 Attribute subset selection is a method of dimensionality reduction
in which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed.
Numerosity reduction
 Numerosity reduction techniques replace the original data
volume by alternative, smaller forms of data representation.
 These techniques may be parametric or nonparametric.
 For parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored, instead
of the actual data.
 Regression and log-linear models are examples.
 Nonparametric methods for storing reduced representations of
the data include histograms, clustering, sampling, and data cube
aggregation.
Data Compression
 In data compression, transformations are applied so as to obtain a
reduced or “compressed” representation of the original data.
 If the original data can be reconstructed from the compressed
data without any information loss, the data reduction is called
lossless.
 If, instead, we can reconstruct only an approximation of the
original data, then the data reduction is called lossy.
Data Transformation
 Data transformation is the process of converting data from one
form to another form.
 Data often resides in different locations across the storage and
also differs in format.
 Data transformation is necessary to ensure that data from one
application or database is understandable to other applications
and databases also.
Data Transformation (Cont..)
 Data transformation strategies includes the following:
1. Smoothing
2. Attribute construction
3. Aggregation
4. Normalization
5. Discretization
6. Concept hierarchy generation for nominal data
Data Transformation (Cont..)
1. Smoothing
• It works to remove noise from the data.
• It is a form of data cleaning where users specify transformations to correct
data inconsistencies.
• Such techniques include binning, regression and clustering.
2. Attribute construction
• It is referred as new attributes are constructed and added from the given
set of attributes to help the mining process.
3. Aggregation
• In this, summary or aggregation operations are applied to the data.
• E.g. Daily sales data are aggregated at individual source so sales manager
can compute monthly and annually total amounts.
Data Transformation (Cont..)
4. Normalization
• Normalization is scaling technique or a mapping technique.
• With normalization, we can find new range from an existing range.
• There are three techniques for normalization.
1. Min-Max Normalization
o This is a simple normalization technique in which we fit given data in a pre-
defined boundary, or a pre-defined interval [0,1].
2. Decimal scaling
o In this technique we move the decimal point of values of the attribute.
1) Min-max normalization
 Min max is a technique that helps to normalizing the data.
 It will scale the data between 0 and 1.
 Example
Age
16
20
30
40
1) Min-max normalization (Cont..)
 Min : Minimum value = 16
 Max : Maximum value = 40
 V = Respective value of attributes. In our example V1= 16, V2=20,
V3=30 & V4=40.
 NewMax = 1
 NewMin = 0
Formula : V’
1) Min-max normalization (Cont..)
Formula : V’

For Age 16 :

MinMax (v’) = (16 – 16)/(40-16) * (1 – 0) + 0


= 0 / 24 * 1
=0

For Age 20 :

MinMax (v’) = (20 – 16)/(40-16) * (1 – 0) + 0


= 4 / 24 * 1
= 0.16
1) Min-max normalization (Cont..)
For Age 30 :

MinMax (v’) = (30 – 16)/(40-16) * (1 – 0) + 0


= 14 / 24 * 1
= 0.58

For Age 40 :

MinMax (v’) = (40 – 16)/(40-16) * (1 – 0) + 0


= 24 / 24 * 1
=1

Age After Min-max normalization


16 0
20 0.16
30 0.58
40 1
2) Decimal scaling
 In this technique we move the decimal point of values of the
attribute.
 This movement of decimal points totally depends on the
maximum value among all values in the attribute.
 Value V of attribute A can be normalized by the following formula
Normalized value of attribute = (vi / 10j)
Decimal scaling - Example
CGPA Formula After Decimal Scaling
2 2 / 10 0.2
3 3 / 10 0.3

 We will check maximum value among our attribute CGPA.


 Maximum value is 3 so, we can convert it into decimal by dividing
with 10. why 10?
 We will count total digits in our maximum value and then put 1.
 After 1 we can put zeros equal to the length of maximum value.
 Here 3 is maximum value and total digits in this value is only 1 so,
we will put one zero after 1.
Decimal scaling (Try it!)
Bonus Formula After Decimal Scaling
400 400/1000 0.4
310 310/1000 0.31

Salary Formula After Decimal Scaling


40,000 40000/100000 0.4
31,000 31000/100000 0.31
Data Transformation (Cont..)
5. Discretization
• Discretization techniques can be categorized based on how the separation
is performed, such as whether it uses class information or which direction
it proceeds (top-down or bottom-up).
• The raw values of a numeric attribute (e.g. age) are replaced by interval
labels (e.g. 0-10, 11-20 etc.) or conceptual labels (e.g. youth, adult,
senior).
6. Concept hierarchy generation for nominal data
• In this, attributes such as address can be generalized to higher-level
concepts, like street or city or state or country.
• Many hierarchies for nominal attributes are implicit within the database
schema.
• E.g. city, country or state table in RDBMS.
Data mining task primitives (Cont..)
 The data mining task primitives includes the following:
• Task-relevant data
• Kind of knowledge to be mined
• Background knowledge
• Interestingness measurement
• Presentation for visualizing the discovered patterns

You might also like