Data science-Unit-3-Complete
Data science-Unit-3-Complete
Unit – 3 Syllabus
Data analysis:
Introduction,
Terminology and Concepts,
Introduction to statistics,
Central tendencies and distributions,
Variance,
Data Science Distribution properties and arithmetic,
Samples/CLT
Data Pre-process:
Unit 3 Data Cleaning,
Consistency checking,
Heterogeneous and missing data,
Nawab Shah Alam Khan College of Engineering and Technology Data Transformation & Segmentation,
Machine Learning algorithms- Linear Regression, SVM, Naïve Bayes
1 2
3 4
15‐04‐2021
5 6
7 8
15‐04‐2021
9 10
Mode: If we want to find out the most common type of cylinder among the
population of cars, we will check the value which is repeated most number of
times. Here we can see that the cylinders come in two values, 4 and 6. Take a
look at the data set, you can see that the most recurring value is 6.
Hence 6 is our Mode.
11 12
15‐04‐2021
13 14
Mean Median
• •
15 16
15‐04‐2021
17 18
19 20
15‐04‐2021
21 22
Partial sample
•
The 25% percentile of Average_Pulse means that 25% of all of the training sessions have an
average pulse of <=100 beats per minute. Or it means that 75% of all of the training sessions have
an average pulse of >=100 beats per minute.
The 75% percentile of Average_Pulse means that 75% of all the training session have an average
pulse of <=111.
Or it means that 25% of all of the training sessions have an average pulse of >=111 beats per
minute
23 24
15‐04‐2021
Range: It is the given measure of how spread apart the values in a data set are.
Inter Quartile Range (IQR): It is the measure of variability, based on dividing a
data set into quartiles.
Variance: It describes how much a random variable differs from its expected
value. It entails computing squares of deviations.
25 26
Range Quartiles
• •
27 28
15‐04‐2021
29 30
Variance: is a number that indicates how the values are spread around the Step 1 to Calculate the Variance: Find the Mean
mean. Suppose we want to find the variance of Average_Pulse.
In fact, if you take the square root of the variance, you get the standard deviation. 1. Find the mean:
(80+85+90+95+100+105+110+115+120+125) / 10 = 102.5
Or the other way around, if you multiply the standard deviation by itself, you get the variance!
We will first use the data set with 10 observations to give an example of how we can calculate
The mean is 102.5
the variance: full_health_data dataset Step 2: For Each Value - Find the Difference From the Mean
Duration Average_Pulse Max_Pulse Calorie_Burnag Hours_Work Hours_Sleep 2. Find the difference from the mean for each value:
e
80 - 102.5 = -22.5
30 80 120 240 10 7
85 - 102.5 = -17.5
30 85 120 250 10 7
90 - 102.5 = -12.5
45 90 130 260 8 7
95 - 102.5 = -7.5
45 95 130 270 8 7 100 - 102.5 = -2.5
45 100 140 280 0 7 105 - 102.5 = 2.5
60 105 140 290 7 8 110 - 102.5 = 7.5
60 110 145 300 7 8 115 - 102.5 = 12.5
60 115 145 310 8 8 120 - 102.5 = 17.5
75 120 150 320 0 8 125 - 102.5 = 22.5
75 125 150 330 8 8
31 32
15‐04‐2021
Step 3: Find the square value for each difference: Standard Deviation
(-22.5)^2 = 506.25
(-17.5)^2 = 306.25 A mathematical function will have difficulties in predicting precise values, if the observations are
"spread".
(-12.5)^2 = 156.25 Variance is a common measure of data dispersion but in most cases the values are pretty large
(-7.5)^2 = 56.25 and hard to compare ( as the values are squared).
(-2.5)^2 = 6.25 In most analysis standard deviation is much more meaningful than variance.
2.5^2 = 6.25 standard deviation = 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
7.5^2 = 56.25
• Standard deviation is a measure of uncertainty.
12.5^2 = 156.25 • A low standard deviation means that most of the numbers are close to the mean (average)
17.5^2 = 306.25 value.
22.5^2 = 506.25 • A high standard deviation means that the values are spread out over a wider range.
Note:
1. We must square the values to get the total spread. Standard Deviation is often represented by the symbol Sigma: σ
2. Cannot have negative values otherwise the total will be 0. Example python code:
Step 4: The Variance is the Average Number of These Squared Values import numpy as np
4. Sum the squared values and find the average: std = np.std(full_health_data)
(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25 + 56.25 + 156.25 + 306.25 + 506.25) / 10 = 206.25 print(std)
The variance is 206.25.
33 34
Standard Deviation
• : average deviation of each value in a data set from the mean. •
Intuition:
If variance is high - means you have larger variability in your dataset.
(or we can say more values are spread out around your mean value.)
For a variable’s value is according to normal distribution, you need to know its mean and standard deviation.
Standard deviation represents the average distance of an observation from the mean If the distribution is indeed normal, you will get a plot that is close to the above figure.
The larger the standard deviation, larger the variability of the data.
35 36
15‐04‐2021
• •
37 38
• •
39 40
15‐04‐2021
The data points outside of three standard deviations are considered outliers as they are very
unlikely to occur.
41 42
• •
43 44
15‐04‐2021
Skewness
•What is Skewness? •
Skewness is a measure of asymmetry or distortion of symmetric distribution.
It measures the deviation of the given distribution of a random variable from a symmetric
normal distribution.
A normal distribution is without any skewness, as it is symmetrical on both sides.
Hence, a curve is regarded as skewed if it is shifted towards the right or the left.
45 46
1. Mesokurtic: Data that follows a mesokurtic distribution shows an excess kurtosis of zero or close
In other words, kurtosis identifies whether the tails of a given distribution contain extreme
to zero. This means that if the data follows a normal distribution, it follows a mesokurtic distribution.
values. 2. Leptokurtic: indicates a positive excess kurtosis.
Distribution shows heavy tails on either side, indicating large outliers.
Along with skewness, kurtosis is an important descriptive statistic of data distribution. In finance, a leptokurtic distribution shows that the investment returns may be prone to extreme
However, the two concepts must not be confused with each other. Skewness essentially values on either side. Therefore, an investment whose returns follow a leptokurtic distribution is
measures the symmetry of the distribution, while kurtosis determines the heaviness of the considered to be risky.
distribution tails. 3. Platykurtic: distribution shows a negative excess kurtosis. The kurtosis reveals a distribution with
flat tails. The flat tails indicate the small outliers in a distribution.
What is Excess Kurtosis? In the finance context, the platykurtic distribution of the investment returns is desirable for investors
An excess kurtosis is a metric that compares the kurtosis of a distribution against the kurtosis of a because there is a small probability that the investment would experience extreme returns.
normal distribution. The kurtosis of a normal distribution equals 3. Therefore, the excess kurtosis
is found using the formula below:
47 48
15‐04‐2021
• •
49 50
Coefficient of Variation
•
The coefficient of variation is used to get an idea of how large the standard deviation is.
import numpy as np
cv = np.std(full_health_data) / np.mean(full_health_data)
print(cv)
51 52
15‐04‐2021
The correlation coefficient can never be less than -1 or higher than 1. health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='scatter')
plt.show()
1 = there is a perfect linear relationship between the variables (like Average_Pulse against
Calorie_Burnage)
0 = there is no linear relationship between the variables
-1 = there is a perfect negative linear relationship between the variables (e.g. Less hours worked,
leads to higher calorie burnage during a training session)
53 54
55 56
15‐04‐2021
Corelation Coefficient
Example of No Linear Relationship (Correlation coefficient = 0) •
Correlation Coefficient = 0
Here, we have plotted Max_Pulse against Duration from the
full_health_data set.
Example
import matplotlib.pyplot as plt
57 58
Consider that there are 15 sections in the science department of a university and
each section hosts around 100 students. Our task is to calculate the average
weight of students in the science department. Sounds simple, right?
59 60
15‐04‐2021
•But what if the size of the data is huge? Does this approach make sense? •
Not really – measuring the weight of all the students will be a very tiresome and
long process. So, what can we do instead? Let’s look at an alternate approach.
First, draw groups of students at random from the class. We will call this a sample.
We’ll draw multiple samples, each consisting of 30 students.
1. Calculate the individual mean of these samples
2. Calculate the mean of these sample means
3. This value will give us the approximate mean weight of the students in the
science department
4. Additionally, the histogram of the sample mean weights of students will
resemble a bell curve (or normal distribution)
This, in a nutshell, is what the central limit theorem is all about.
61 62
• •
63 64
15‐04‐2021
•Significance of the Central Limit Theorem •Also, the sample mean can be used to create the range of values known as a
The central limit theorem has both statistical significance as well as practical confidence interval (that is likely to consist of the population mean)
applications.
We’ll look at both aspects to gauge where we can use them.
65 66
67 68
15‐04‐2021
Statistics – Probability
•The mean of the sample means is denoted as: •
µ X̄ = µ Statistical concepts that make your journey pleasant and lead you to success in
the field of Data Science.
where, Statistics is a powerful tool while performing the art of Machine Learning and
µ X̄ = Mean of the sample means Data Science.
µ= Population mean
σ X̄ = σ/sqrt(n)
where,
69 70
71 72
15‐04‐2021
73 74
• 4. Intentional data :users purposely submit incorrect data when they do not
want to disclose personal information
• Jan. 1 as everyone’s birthday?
• *This is also known as disguised missing data
75 76
75 76
15‐04‐2021
77
78
77 78
79 80
79 80
15‐04‐2021
81 82
81 82
83 84
83 84
15‐04‐2021
85 86
85 86
87 88
15‐04‐2021
89 90
91 92
15‐04‐2021
93 94
Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (X) variables, hence called as linear regression.
i.e., it finds how the value of the dependent variable is changing according to the
value of the independent variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
95 96
95 96
15‐04‐2021
97 98
97 98
The values for x and y variables are training datasets for Linear Regression model representation.
99 100
15‐04‐2021
101 102
Residuals: The distance between the actual value and predicted values is called residual.
If the observed points are far from the regression line, then the residual will be high, and so cost
function will be high.
If the scatter points are close to the regression line, then the residual will be small and hence the
cost function.
Gradient Descent:
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process
of finding the best model out of various models is called optimization.
103 104
103 104
15‐04‐2021
105 106
107 108
107 108
15‐04‐2021
You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane
which segregates(separate or set apart) the two classes better”.
In this scenario, hyper-plane “B” has excellently performed this job.
109 110
109 110
Above, you can see that the margin for hyper-plane C is high as compared to both A and B.
Hence, we name the right hyper-plane as C.
Another lightning reason for selecting the hyper-plane with higher margin is robustness.
If we select a hyper-plane having low margin then there is high chance of miss-classification. 111 112
111 112
15‐04‐2021
113 114
When we look at the hyper-plane in original input space it looks like a circle:
115 116
15‐04‐2021
117 118
117 118
119 120
119 120
15‐04‐2021
accurately.
None of the attributes is irrelevant and are assumed to be contributing equally to the • where A and B are events and P(B) != 0.
outcome. Basically, we are trying to find probability of event A, given the event B is true. Event B is also
termed as evidence.
Note: The assumptions made by Naive Bayes are not generally correct in real-world situations.
P(A) is the priori of A (or prior probability, i.e. Probability of event before evidence is seen). The
In-fact, the independence assumption is never correct but often works well in practice.
evidence is an attribute value of an unknown instance(here, it is event B).
P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
121 122
121 122
P(A|B) – the probability of event A occurring, given •X = (Rainy, Hot, High, False)
event B has occurred •y = No
P(B|A) – the probability of event B occurring, given •So basically, P(y|X) here means, the probability of “Not playing golf” given that the
event A has occurred weather conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no
P(A) – the probability of event A wind”.
P(B) – the probability of event B
123 124
123 124
15‐04‐2021
•
• Now, as the denominator remains constant for a given input, we can remove that term: • Let us try to apply the above formula manually on our weather dataset. For this, we need to do
some precomputations on our dataset.
𝑃 𝑦|𝑥1, 𝑥2, … 𝑥𝑛 ∝ 𝑃 𝑦 ∏ 𝑃 𝑥𝑖 |𝑦
• • We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been
demonstrated in the tables below:
125 126
125 126
Also, we need to find class probabilities (P(y)) which has been calculated in the table 5. For
example, P(play golf = Yes) = 9/14.
So now, we are done with our pre-computations and the classifier is ready!
Let us test it on a new set of features (let us call it today):
today = (Sunny, Hot, Normal, False)
So, probability of playing golf is given by:
P(SunnyOutlook|Yes)P(HotTemperature|Yes)P(NormalHumidity|Yes)P(NoWind|Yes)P(Yes)
𝑃 𝑌𝑒𝑠|𝑡𝑜𝑑𝑎𝑦)=
127 128
127 128
15‐04‐2021
129 130
131