0% found this document useful (0 votes)
10 views

Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021

1) The document discusses data preprocessing techniques which are important for cleaning and preparing raw data for analysis. 2) It describes common data preprocessing steps like data cleaning, integration, transformation and reduction which are used to handle issues like missing values, noise, inconsistencies and reduce data size. 3) Descriptive analytics techniques are also covered, including measuring the central tendency of data using the mean, median and mode to understand characteristics of numeric attributes.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021

1) The document discusses data preprocessing techniques which are important for cleaning and preparing raw data for analysis. 2) It describes common data preprocessing steps like data cleaning, integration, transformation and reduction which are used to handle issues like missing values, noise, inconsistencies and reduce data size. 3) Descriptive analytics techniques are also covered, including measuring the central tendency of data using the mean, median and mode to understand characteristics of numeric attributes.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

19-08-2021

Data Preprocessing

Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge

Data Modeling Inference


Data Collection (Machine
Learning)

Data Preprocessing

Data
Feature
Database Cleaning and
Representation
Cleansing
2

1
19-08-2021

Need for Data Preprocessing


• Real world data are tend to be incomplete, noisy and
inconsistent due to their huge size and their likely origin
from multiple heterogeneous sources
• Preprocessing is important to clean the data
• Low quality data will lead to low quality of analysis results
• If the users believe the data is of low quality (dirty), they
are unlikely to trust the results of any data analytics that
has been applied to
• Low quality data can cause confusion for analytic procedure
using machine learning techniques, resulting in unreliable
output
• Data could be
– Incomplete,
– noisy and
– inconsistent
– These are common properties of large real world databases

Data Preprocessing Techniques


• Data cleaning:

• Data integration:

• Data transformation:

• Data reduction :

2
19-08-2021

Data Preprocessing Techniques


• Data cleaning:
– Applied to
• identify the missing values,
• fill in missing values,
• remove noise and
• correct inconsistency in the data
• Data integration:
– It merges data from multiple sources in to a coherent
data source
• Data transformation:
– Transforming the entries of data to a common format
– Techniques like normalization and standardization
applied to transform the data to another form to
improve the accuracy and efficiency of machine learning
(ML) algorithms involving distance measures

Data Preprocessing Techniques


• Data reduction:
– Applied to obtain a reduced representation that is much
smaller in volume, yet producing almost same analytical
results
– It can reduce the data size by
• Aggregation
• Eliminating irrelevant and redundant features (attributes)
through correlation analysis
• Reducing dimension
• These techniques are not mutually exclusive; they
may work together

3
19-08-2021

Descriptive Data Summarization


(Descriptive Analytics)
• It serves as a foundation for data preprocessing
• It helps us to study the general characteristics of data
and identify the presence of noise or outliers
• Data characteristics:
– Central tendency of data
• Centre of the data
• Measuring mean, median and mode
– Dispersion of data
• The degree to which numerical data tend to spread
• Measuring range, quartiles, interquartile range (IQR), the
five-number summery and standard deviation

Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57

N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83

Sum: 91

4
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57

N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83

Mean Years of 9.1


experience: Sum/10

Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57

N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83

Mean Salary: 55.4


Sum/10

5
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57

N
x
i 1
i 9 64
13 72
3 36
– Mean is a better measure of 6 43
central tendency for the 11 59
symmetric data 21 90
(symmetrically distributed 1 20
data) 16 83

Mean: 9.1 55.4

Descriptive Analytics:
Measuring Central Tendency
• Median: Number of records
– Let x1, x2, …, xN be a set of N values (tuples), N = 10
in an attribute. The median is the Years of Salary (in
"middle" number (value), when experience Rs 1000)
those numbers are listed in order 3 30
from smallest to greatest. 8 57
– Median is the value separating the 9 64
higher half from the lower half of 13 72
a data sample
3 36
– For a given data of N values in sorted 6 43
order
11 59
• If N is odd, then median is the middle
value of the ordered list 21 90
• If N is even, then median is the 1 20
average of middle two values 16 83

Illustration: Median of attribute “Years of experience”

6
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21

Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21

Median:

7
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21
– For asymmetrically distributed
Median: 10
(skewed) data, a better measure of
centre of data is median

Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Number of records
(tuples), N = 10
Years of Salary (in
experience Rs 1000)
Illustration: Mode of attribute 3 30
“Years of experience” 8 57
Assume that values are discrete 9 64
numerical
13 72
3 36
6 43
11 59
21 90
1 20
16 83

Mode: 3

8
19-08-2021

Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Number of samples, N = 61 • The mode of a continuous
Date Temperature
variable is the value at which
the probability density function,
Sept 1 25.47
f(x) , is at a maximum.
Sept 2 26.19
Sept 3 25.17 • It is a value that is most likely
Sept 4 24.30 to lie within the same interval as
Sept 5 24.07 the outcome
Sept 6 21.21
Sept 7 23.49
Sept 8 21.79
Sept 9 25.09
Sept 10 25.39
--- ---
Oct 29 23.06
Oct 30 23.72
Oct 31 23.02
Mean: 22.85
Mode: (22.32 – 23.62]
Median: 22.89

Descriptive Analytics:
Measuring Central Tendency

Positively Skewed Negatively Skewed


Symmetric Data
Data Data

9
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• The degree to which numerical data tend to spread
• It is also called as variance (in symmetrically
distributed data)
• Common measures of data dispersion:
– Range
– The five-number summery (based on quartiles)
– The inter quartile range (IQR)
– Standard deviation
• Range: The range of a finite set of values is the
difference between the maximum and minimum
values

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Number of records
(tuples), N = 10
– The kth percentile:
Years of Salary (in
• Let x1, x2, …, xN be a set of N experience Rs 1000)
values in an attribute 3 30
• The kth percentile of a set of data 8 57
in numerical order is the value of 9 64
xn having the property that k
13 72
percent of data entries lie at or
below xn 3 36
6 43
– Example: 50th percentile
11 59
• The value (number) below which
50% of the data entries (values) 21 90
lie 1 20
– Those 50% of entries have values 16 83
equal to or less that 50th
percentile

Illustration: 50th percentile of attribute “Years of


experience”

10
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 50th percentile
13
• The value (number) below which
50% of the data entries (values) 16
lie 16
– Those 50% of entries have values 21
equal to or less that 50th
percentile 50th Percentile: 10

Illustration: 50th percentile of attribute “Years of


experience”

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 25th percentile
13
• The value (number) below which
25% of the data entries (values) 16
lie 16
– Those 25% of entries have values 21
equal to or less that 25th
percentile 25th Percentile: 6
• Middle element between minimum
and 50th percentile
Illustration: 25th percentile of attribute “Years of experience”

11
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 75th percentile
13
• The value (number) below which
75% of the data entries (values) 16
lie 16
– Those 75% of entries have values 21
equal to or less that 75th
percentile 75th Percentile: 16
• Middle element between
maximum and 50th percentile
Illustration: 75th percentile of attribute “Years of experience”

Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles:
– The kth percentile:
• Let x1, x2, …, xN be a set of N values in an attribute
• The kth percentile of a set of data in numerical order is the
value of xn having the property that k percent of data
entries lie at or below xn
• Median is the 50th percentile (the second quartile (Q2))
• The first quartile (Q1): It is the 25th percentile
• The third quartile (Q3): It is the 75th percentile
– The quartiles including median give some indication of
centre, spread and shape of distribution
• The distance between the Q1 and Q3 is a simple
measure of spread
• Inter quartile range (IQR): Distance between the first
quartile (Q1) and third quartile (Q2)
IQR = Q3 – Q1

12
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median

Q1
Smallest
observation (min)
(bottom whisker)

• The whiskers terminate at


– Smallest (minimum) or largest (maximum) observations or
– the most extreme observations occurring within 1.5 x IQR of
respective quartiles (Q1 and Q3)

Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median

Q1
Smallest
observation (min)
(bottom whisker)

• 1.5 x IQR is equivalent to 2.7σ from mean if the distribution


is normal distribution
– It is close to 3σ from mean which is a standard in normal distribution

13
19-08-2021

Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median

Q1
Outlier: The values are larger Smallest
than 1.5 x IQR observation (min)
(bottom whisker)

Outlier(s): The values are less than 1.5 x IQR

• Lower bound: Q1 – (1.5 x IQR) Upper bound: Q3 + (1.5 x IQR)


• Outliers: Any datapoint less than the lower bound and
larger than the upper bound

Descriptive Analytics:
Measuring Dispersion of Data
• Variance (σ2):
– Let x1, x2, …, xN be a set of N values in an attribute.
variance (σ2) of this set of values is given by
1 N

2  xi   2 μ = mean
N  1 i 1
• Standard deviation (σ):
– The square root of variance   Variance
• Standard deviation measures the spread about the
mean
– It is used when the mean is chosen as the measure of
centre, especially in symmetric distribution
• The quartiles Q1 and Q3 measure the spread about
median
– Q1 and Q3 are used when the median is chosen as the
measure of centre, especially in skewed distribution
28

14

You might also like