Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021
Class4 DataPreprocessing DiscriptiveAnalytics 19aug2021
Data Preprocessing
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
2
1
19-08-2021
• Data integration:
• Data transformation:
• Data reduction :
2
19-08-2021
3
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57
N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Sum: 91
4
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57
N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57
N
x
i 1
i 9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
5
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Mean: Number of records
(tuples), N = 10
– Let x1, x2, …, xN be a set of N
Years of Salary (in
values in an attribute. Mean of experience Rs 1000)
this set of values is given by
3 30
1 N 8 57
N
x
i 1
i 9 64
13 72
3 36
– Mean is a better measure of 6 43
central tendency for the 11 59
symmetric data 21 90
(symmetrically distributed 1 20
data) 16 83
Descriptive Analytics:
Measuring Central Tendency
• Median: Number of records
– Let x1, x2, …, xN be a set of N values (tuples), N = 10
in an attribute. The median is the Years of Salary (in
"middle" number (value), when experience Rs 1000)
those numbers are listed in order 3 30
from smallest to greatest. 8 57
– Median is the value separating the 9 64
higher half from the lower half of 13 72
a data sample
3 36
– For a given data of N values in sorted 6 43
order
11 59
• If N is odd, then median is the middle
value of the ordered list 21 90
• If N is even, then median is the 1 20
average of middle two values 16 83
6
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21
Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21
Median:
7
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Median: Sort the values in “Years
– Let x1, x2, …, xN be a set of N values in of experience”
an attribute. The median is the Years of
"middle" number (value), when experience
those numbers are listed in order 1
from smallest to greatest. 3
– Median is the value separating the 6
higher half from the lower half of 8
a data sample
9
– For a given data of N values in sorted
11
order
13
• If N is odd, then median is the middle
value of the ordered list 16
• If N is even, then median is the 16
average of middle two values 21
– For asymmetrically distributed
Median: 10
(skewed) data, a better measure of
centre of data is median
Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Number of records
(tuples), N = 10
Years of Salary (in
experience Rs 1000)
Illustration: Mode of attribute 3 30
“Years of experience” 8 57
Assume that values are discrete 9 64
numerical
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Mode: 3
8
19-08-2021
Descriptive Analytics:
Measuring Central Tendency
• Mode: Most frequent value in an attribute in the data
Number of samples, N = 61 • The mode of a continuous
Date Temperature
variable is the value at which
the probability density function,
Sept 1 25.47
f(x) , is at a maximum.
Sept 2 26.19
Sept 3 25.17 • It is a value that is most likely
Sept 4 24.30 to lie within the same interval as
Sept 5 24.07 the outcome
Sept 6 21.21
Sept 7 23.49
Sept 8 21.79
Sept 9 25.09
Sept 10 25.39
--- ---
Oct 29 23.06
Oct 30 23.72
Oct 31 23.02
Mean: 22.85
Mode: (22.32 – 23.62]
Median: 22.89
Descriptive Analytics:
Measuring Central Tendency
9
19-08-2021
Descriptive Analytics:
Measuring Dispersion of Data
• The degree to which numerical data tend to spread
• It is also called as variance (in symmetrically
distributed data)
• Common measures of data dispersion:
– Range
– The five-number summery (based on quartiles)
– The inter quartile range (IQR)
– Standard deviation
• Range: The range of a finite set of values is the
difference between the maximum and minimum
values
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Number of records
(tuples), N = 10
– The kth percentile:
Years of Salary (in
• Let x1, x2, …, xN be a set of N experience Rs 1000)
values in an attribute 3 30
• The kth percentile of a set of data 8 57
in numerical order is the value of 9 64
xn having the property that k
13 72
percent of data entries lie at or
below xn 3 36
6 43
– Example: 50th percentile
11 59
• The value (number) below which
50% of the data entries (values) 21 90
lie 1 20
– Those 50% of entries have values 16 83
equal to or less that 50th
percentile
10
19-08-2021
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 50th percentile
13
• The value (number) below which
50% of the data entries (values) 16
lie 16
– Those 50% of entries have values 21
equal to or less that 50th
percentile 50th Percentile: 10
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 25th percentile
13
• The value (number) below which
25% of the data entries (values) 16
lie 16
– Those 25% of entries have values 21
equal to or less that 25th
percentile 25th Percentile: 6
• Middle element between minimum
and 50th percentile
Illustration: 25th percentile of attribute “Years of experience”
11
19-08-2021
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles: Sort the values in “Years
of experience”
– The kth percentile:
Years of
• Let x1, x2, …, xN be a set of N experience
values in an attribute 1
• The kth percentile of a set of data 3
in numerical order is the value of 6
xn having the property that k
8
percent of data entries lie at or
below xn 9
11
– Example: 75th percentile
13
• The value (number) below which
75% of the data entries (values) 16
lie 16
– Those 75% of entries have values 21
equal to or less that 75th
percentile 75th Percentile: 16
• Middle element between
maximum and 50th percentile
Illustration: 75th percentile of attribute “Years of experience”
Descriptive Analytics:
Measuring Dispersion of Data
• Quartiles:
– The kth percentile:
• Let x1, x2, …, xN be a set of N values in an attribute
• The kth percentile of a set of data in numerical order is the
value of xn having the property that k percent of data
entries lie at or below xn
• Median is the 50th percentile (the second quartile (Q2))
• The first quartile (Q1): It is the 25th percentile
• The third quartile (Q3): It is the 75th percentile
– The quartiles including median give some indication of
centre, spread and shape of distribution
• The distance between the Q1 and Q3 is a simple
measure of spread
• Inter quartile range (IQR): Distance between the first
quartile (Q1) and third quartile (Q2)
IQR = Q3 – Q1
12
19-08-2021
Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median
Q1
Smallest
observation (min)
(bottom whisker)
Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median
Q1
Smallest
observation (min)
(bottom whisker)
13
19-08-2021
Descriptive Analytics:
Measuring Dispersion of Data
• The five-number summery of distribution:
– It consists of minimum value, Q1, median, Q3 and maximum
value
• Box plots are the popular way of visualising distribution
Largest
observation (max) Q3
(top whisker) Q2/
IQR
Median
Q1
Outlier: The values are larger Smallest
than 1.5 x IQR observation (min)
(bottom whisker)
Descriptive Analytics:
Measuring Dispersion of Data
• Variance (σ2):
– Let x1, x2, …, xN be a set of N values in an attribute.
variance (σ2) of this set of values is given by
1 N
2 xi 2 μ = mean
N 1 i 1
• Standard deviation (σ):
– The square root of variance Variance
• Standard deviation measures the spread about the
mean
– It is used when the mean is chosen as the measure of
centre, especially in symmetric distribution
• The quartiles Q1 and Q3 measure the spread about
median
– Q1 and Q3 are used when the median is chosen as the
measure of centre, especially in skewed distribution
28
14