0% found this document useful (0 votes)
4 views36 pages

Unit 5

The document outlines the course materials for 'Foundation of Data Science' (22CSC202) offered by the School of Physical Sciences, covering various statistical concepts including estimation, confidence intervals, and statistical inference. It details the syllabus for five units, focusing on data preprocessing, visualization, probability, and statistical methods. Additionally, it provides examples and methodologies for calculating confidence intervals and making inferences about population parameters based on sample statistics.

Uploaded by

Sree RK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views36 pages

Unit 5

The document outlines the course materials for 'Foundation of Data Science' (22CSC202) offered by the School of Physical Sciences, covering various statistical concepts including estimation, confidence intervals, and statistical inference. It details the syllabus for five units, focusing on data preprocessing, visualization, probability, and statistical methods. Additionally, it provides examples and methodologies for calculating confidence intervals and making inferences about population parameters based on sample statistics.

Uploaded by

Sree RK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

22CSC202- Learning Materials

Foundation of Data Science/22 Unit V

School of Physical Sciences

Department of Mathematics

Course Materials

Course Name : Foundation of Data Science

Course Code : 22CSC202

Programme Name : Int. M.Sc. Data Science

Year : II

Semester : III

Course Coordinator : Dr. P. Sriramakrishnan


Foundation of Data Science/22CSC202- Learning Materials Unit V

Syllabus
Unit I
Introduction, Causality and Experiments, Data Preprocessing: Data cleaning, Data reduction, Data
transformation, Data discretization. Visualization and Graphing: Visualizing Categorical
Distributions, Visualizing Numerical Distributions, Overlaid Graphs, plots, and summary statistics of
exploratory data analysis
Unit-II
Randomness, Probability, Sampling, Sample Means and Sample Sizes
Unit-III
Introduction to Statistics, Descriptive statistics – Central tendency, dispersion, variance, covariance,
kurtosis, five point summary, Distributions, Bayes Theorem, Error Probabilities;
Unit-IV
Statistical Inference; Hypothesis Testing, P-Values, Assessing Models, Decisions and Uncertainty,
Comparing Samples, A/B Testing, Causality.
Unit-V
Estimation, Confidence Intervals, Inference for Regression, Classification, Graphical Models,
Prediction, Updating Predictions.
Text Books:

1. Adi Adhikari and John DeNero, “Computational and Inferential Thinking: The Foundations
of Data Science”, e-book.
Reference Books:

1. Data Mining for Business Analytics: Concepts, Techniques and Applications in R, by Galit
Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, Kenneth C. Lichtendahl Jr., Wiley
India, 2018.

2. Rachel Schutt & Cathy O’Neil, “Doing Data Science” O’ Reilly, First Edition, 2013.

2
Foundation of Data Science/22CSC202- Learning Materials Unit V

Unit V

1. Estimation

Estimation refers to the process by which one makes inferences about a population, based on
information obtained from a sample.

Statisticians use sample statistics to estimate population parameters. For example, sample means are
used to estimate population means; sample proportions, to estimate population proportions.

Example: If Population is large

But if the population is very large – for example, if it consists of income of all the households in the
United States – then it might be too expensive and time-consuming to gather data from the entire
population. In such situations, data scientists rely on sampling at random from the population.

This leads to a question of inference: How to make justifiable conclusions about the unknown
parameter, based on the data in the random sample? We will answer this question by using
inferential thinking.

A statistic based on a random sample can be a reasonable estimate of an unknown parameter in the
population. For example, you might want to use the median/mean annual income of sampled
households as an estimate of the median/mean annual income of all households in the U.S.

Type of Estimation

An estimate of a population parameter may be expressed in two ways:

Point estimate: A point estimate of a population parameter is a single value of a statistic. For
example, the sample mean x is a point estimate of the population mean μ. Similarly, the sample
proportion p is a point estimate of the population proportion P.

Interval estimate: An interval estimate is defined by two numbers, between which a population
parameter is said to lie. For example, a < x < b is an interval estimate of the population mean μ. It
indicates that the population mean is greater than a but less than b.

2. Confidence Interval

A confidence interval is the mean of your estimate plus and minus the variation in that estimate. This
is the range of values you expect your estimate to fall between if you redo your test, within a certain
level of confidence.

3
Foundation of Data Science/22CSC202- Learning Materials Unit V

For example:

Statistical significant 𝛼 = 0.05

Confident level = 1- 𝛼 = 0.95 or 95%

Confident Interval = 𝑋 ± 𝑍

Note: See the below example for confident interval

3. Estimating population means using confident interval

 A population mean is an average of a numerical population variable.

 Confidence intervals are used to estimate population means.

Types of Population Estimation:

I. One sample – Known Variance

II. One Sample – Unknown Variance

III. Two Sample – Known Variance

IV. Two Samples – Unknown Variance

4
Foundation of Data Science/22CSC202- Learning Materials Unit V

I. One sample – Known Variance

For example:

Population: Nobel Prize winners

Variable: Age when they received the Nobel Prize

We can take a sample and calculate the mean and the standard deviation of that sample.

The sample data is used to make an estimation of the average age of all the Nobel Prize winners.

By randomly selecting 30 Nobel Prize winners we could find that:

The mean age in the sample is 62.1

The standard deviation of age in the sample is 13.46

Solution

The following steps are used to calculate a confidence interval for population mean:

 Check the conditions

 Find the point estimate

 Decide the confidence level

 Calculate the margin of error

 Calculate the confidence interval

i. Check the conditions

The conditions for calculating a confidence interval for a mean are:

The sample is randomly selected and population data is normally distributed

Sample size is less than or equal 30 means go with T test.

ii. Finding the Point Estimate

5
Foundation of Data Science/22CSC202- Learning Materials Unit V

The point estimate is the sample mean (𝒙).

The formula for calculating the sample mean is the sum of all the values and divided by the sample
size.

𝑺𝒂𝒎𝒑𝒍𝒆 𝒎𝒆𝒂𝒏 (𝒙) = 𝟔𝟐. 𝟏, standard deviation (𝒔) = 𝟏𝟑. 𝟒𝟔, 𝐒𝐚𝐦𝐩𝐥𝐞 𝐬𝐢𝐳𝐞 (𝐧) = 𝟑𝟎,

In our example, the mean age was 62.1 in the sample.

iii. Deciding the Confidence Level

The confidence level is expressed with a percentage or a decimal number.

For example, if the confidence level is 95% or 0.95:

The remaining probability (𝛼) is then: 5%, or 1 - 0.95 = 0.05.

Degree of freedom = n-1=30-1=29.

Here, sample size is 30, so that students t distribution is chosen.

iv. Calculating the Margin of Error

The margin of error is the difference between the point estimate and the lower and upper bounds.

𝑠
𝑀𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝐸𝑟𝑟𝑜𝑟 (𝐸) = 𝑡 (𝑑𝑓).
√𝑛

𝑡 is calculated from the standard normal distribution and the confidence level.

is calculated from the sample standard deviation (s) and the sample size (n).

In our example with a sample standard deviation (s) of 13.46 and sample size of 30:
Foundation of Data Science/22CSC202- Learning Materials Unit V

If we choose 95% as the confidence level, the 𝛼 = 0.05.

So we need to find the 𝑡 (𝑑𝑓) = 𝑡 .


( )
𝑡 . (29)

With Python use the Scipy Stats library t.ppf() – percent point function find the t-value for an 𝛼 /2 =
0.025 and 29 degrees of freedom.

import scipy.stats as stats

import numpy as np

print(stats.t.ppf(np.abs(0.025), 29))

output:

𝑡 . (29) = ~2.05

v. Calculate the Confidence Interval

The lower and upper bounds of the confidence interval are found by subtracting and adding the
margin of error (E) from the point estimate (𝒙).

The lower bound is: 𝒙−E = 62.1−5.0389 ≈ 57.06

The upper bound is: 𝒙+E = 62.1 + 5.0389 ≈ 67.14

The confidence interval is: [57.06,67.14]

Conclusion

And we can summarize the confidence interval by stating:

The 95% confidence interval for the mean age of Nobel Prize winners is between 57.06 and 67.14
years. Population mean fall between 57.06 and 67.14

Python
Foundation of Data Science/22CSC202- Learning Materials Unit V

import scipy.stats as stats


import math
import numpy as np
# Specify sample mean (x_bar), sample standard deviation (s), sample size (n) and confidence level
x_bar = 62.1
s = 13.46
n = 30
confidence_level = 0.95

# Calculate alpha, degrees of freedom (df), the critical t-value, and the margin of error
alpha = (1-confidence_level)
df = n - 1
standard_error = s/math.sqrt(n)
t_score = np.abs(stats.t.ppf(alpha/2, df)) #stats.norm.ppf(alpha/2) for z test
margin_of_error = t_score * standard_error

# Calculate the lower and upper bound of the confidence interval


lower_bound = x_bar - margin_of_error
upper_bound = x_bar + margin_of_error

# Print the results


print("T-Score: {:.3f}".format(t_score))
print("Margin of Error: {:.3f}".format(margin_of_error))
print("Confidence Interval: [{:.3f},{:.3f}]".format(lower_bound,upper_bound))
print("The {:.1%} confidence interval for the population mean is:".format(confidence_level))
print("between {:.3f} and {:.3f}".format(lower_bound,upper_bound))

Output

T-Score: 2.045

Margin of Error: 5.026

Confidence Interval: [57.074, 67.126]

The 95.0% confidence interval for the population mean is: between 57.074 and 67.126

8
STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

Chapter 9: One- and Two-Sample Estimation Problems:

9.1 Introduction:
• Suppose we have a population with some unknown
parameter(s).
Example: Normal(μ,σ)
μ and σ are parameters.
• We need to draw conclusions (make inferences) about the
unknown parameters.
• We select samples, compute some statistics, and make
inferences about the unknown parameters based on the
sampling distributions of the statistics.

* Statistical Inference
(1) Estimation of the parameters (Chapter 9)
→ Point Estimation
→ Interval Estimation (Confidence Interval)
(2) Tests of hypotheses about the parameters (Chapter 10)

9.3 Classical Methods of Estimation:

Point Estimation:
A point estimate of some population parameter θ is a single
value θˆ of a statistic Θˆ . For example, the value x of the statistic
X computed from a sample of size n is a point estimate of the
population mean μ.

Interval Estimation (Confidence Interval = C.I.):


An interval estimate of some population parameter θ is an
interval of the form (θˆL , θˆU ), i.e, θˆL <θ<θˆU . This interval
contains the true value of θ "with probability 1−α", that is
P(θˆL <θ<θˆU )=1−α.
• ( θˆL ,θˆU )= θˆL <θ<θˆU is called a (1−α)100%
confidence interval (C.I.) for θ.
• 1−α is called the confidence coefficient
• θˆL = lower confidence limit
Department of Statistics and O.R. − 83 − King Saud University
STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

• θˆU = upper confidence limit


• α=0.1, 0.05, 0.025, 0.01 (0<α<1)

9.4 Single Sample: Estimation of the Mean (μ):


Recall:
• E( X ) = μ X = μ
σ2
• Var ( X ) = σ X2 =
n
σ ⎞
• X ~ N ⎛⎜ μ ,

⎝ n⎠
X −μ
• Z= ~N(0,1)
σ/ n
(σ2 is known)
X −μ
• T= ~ t(n−1)
S/ n
(σ2 is unknown)
• We use the sampling distribution of X to make
inferences about μ.

Notation:
Za is the Z-value leaving an area of a to the
right; i.e., P(Z>Za)=a or equivalently,
P(Z<Za)=1−a

Point Estimation of the Mean (μ):


n
• The sample mean X = ∑ X i / n is a "good" point estimate
i =1
for μ.
Interval Estimation (Confidence Interval) of the Mean (μ):
(i) First Case: σ2 is known:
Result:
n
If X = ∑ X i / n is the sample mean of a random sample of size n
i =1
from a population (distribution) with mean μ and known
variance σ2, then a (1−α)100% confidence interval for μ is :

Department of Statistics and O.R. − 84 − King Saud University


STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

σ σ
( X − Zα , X + Zα )
2
n 2
n
σ
⇔ X ± Zα
2
n
σ σ
⇔ X − Zα < μ < X + Zα
2
n 2
n
where Z α is the Z-value leaving an area
2
of α/2 to the right; i.e., P(Z> Z α )=α/2, or
2
equivalently, P(Z< Z α )=1−α/2.
2
Note:
σ σ
We are (1−α)100% confident that μ ∈ ( X − Z α , X + Zα )
2
n 2
n

Example 9.2:
The average zinc concentration recorded from a sample of zinc
measurements in 36 different locations is found to be 2.6
gram/milliliter. Find a 95% and 99% confidence interval (C.I.)
for the mean zinc concentration in the river. Assume that the
population standard deviation is 0.3.
Solution:
μ=the mean zinc concentration in the river.
(unknown parameter)
Population Sample
μ=?? n=36
σ=0.3 X =2.6

First, a point estimate for μ is X =2.6.

(a) We want to find 95% C.I. for μ.


α = ??
95% = (1−α)100%
⇔ 0. 95 = (1−α)
⇔ α=0.05 ⇔ α/2 = 0.025
Department of Statistics and O.R. − 85 − King Saud University
STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

Z α = Z0.025
2
= 1.96
A 95% C.I. for μ is
σ
X ± Zα
2
n
σ σ
⇔ X − Zα < μ < X + Zα
2
n 2
n

⇔ 2.6 − (1.96)⎛⎜
0.3 ⎞ ⎛ 0.3 ⎞
⎟ < μ < 2.6 + (1.96)⎜ ⎟
⎝ 36 ⎠ ⎝ 36 ⎠
⇔ 2.6 − 0.098 < μ < 2.6 + 0.098
⇔ 2.502 < μ < 2.698
⇔ μ ∈( 2.502 , 2.698)
We are 95% confident that μ ∈( 2.502 , 2.698).

(b) Similarly, we can find that (Homework) A 99% C.I. for μ is


2.471 < μ < 2.729
⇔ μ ∈( 2.471 , 2.729)
We are 99% confident that μ ∈( 2.471 , 2.729)
Notice that a 99% C.I. is wider that a 95% C.I..
Note:
Error
|-----------|
______|______________ X _____ μ _________|____
σ σ
X − Zα X + Zα
2
n 2
n
|----------------------|--------------------------|
σ σ
Zα Zα
2
n 2
n
Theorem 9.1:
If X is used as an estimate of μ, we can then be (1−α)100%
σ
confident that the error (in estimation) will not exceed Z α .
2
n

Department of Statistics and O.R. − 86 − King Saud University


STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

Note:
σ
max error of estimation = Z α with (1−α)100% confidence.
2
n
Example:
In Example 9.2, we are 95% confident that the sample mean
X = 2.6 differs from the true mean μ by an amount less than
σ ⎛ 0.3 ⎞
Zα = (1.96)⎜ ⎟ = 0.098 .
2
n ⎝ 36 ⎠
Note:
σ
Let e be the maximum amount of the error, that is e = Z α ,
2
n
then:
2
σ σ ⎛ σ⎞
e = Zα ⇔ n = Zα ⇔ n = ⎜ Zα ⎟
⎜ ⎟
2
n 2
e ⎝ 2 e⎠
Theorem 9.2:
If X is used as an estimate of μ, we can then be (1−α)100%
confident that the error (in estimation) will not exceed a
2
⎛ σ⎞
specified amount e when the sample size is n = ⎜ Z α ⎟ .
⎜ ⎟
⎝ 2 e⎠
Note:
1. All fractional values of n = ( Z α σ / e) 2 are rounded up to the
2
next whole number.
2. If σ is unknown, we could take a preliminary sample of
size n≥30 to provide an estimate of σ. Then using
n
S= ∑ ( X i − X ) /( n − 1)
2
as an approximation for σ in
i =1
Theorem 9.2 we could determine approximately how
many observations are needed to provide the desired
degree of accuracy.
Example 9.3:
How large a sample is required in Example 9.2 if we want to be
95% confident that our estimate of μ is off by less than 0.05?
Solution:

Department of Statistics and O.R. − 87 − King Saud University


STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

We have σ= 0.3 , Z α = 1.96 , e=0.05. Then by Theorem 9.2,


2
2
⎛ σ ⎞ ⎛ 0 .3 ⎞
2
n = ⎜ Z α ⎟ = ⎜1.96 × ⎟ = 138.3 ≈ 139
⎜ e ⎟ ⎝ 0. 05 ⎠
⎝ 2 ⎠
Therefore, we can be 95% confident that a random sample of
size n=139 will provide an estimate X differing from μ by an
amount less than e=0.05.

Interval Estimation (Confidence Interval) of the Mean (μ):


(ii) Second Case: σ2 is unknown:
Recall:
X −μ
• T= ~ t(n−1)
S/ n
Result:
n n
If X = ∑ X i / n and S = ∑ ( X i − X ) 2 /(n − 1) are the sample mean
i =1 i =1
and the sample standard deviation of a random sample of size n
from a normal population (distribution) with unknown variance
σ2, then a (1−α)100% confidence interval for μ is :
S S
( X − tα , X + tα )
2
n 2
n
S
⇔ X ± tα
2
n
S S
⇔ X − tα < μ < X + tα
2
n 2
n

where tα is the t-value with ν=n−1 degrees of freedom leaving


2
an area of α/2 to the right; i.e., P(T> tα )=α/2, or equivalently,
2
P(T< tα )=1−α/2.
2
Example 9.4:
The contents of 7 similar containers of sulfuric acid are 9.8,
10.2, 10.4, 9.8, 10.0, 10.2, and 9.6 liters. Find a 95% C.I. for the
Department of Statistics and O.R. − 88 − King Saud University
STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

mean of all such containers, assuming an approximate normal


distribution.
Solution:
n n
.n=7 X = ∑ X i / n = 10.0 S= ∑ ( X i − X ) /( n − 1) = 0.283
2
i =1 i =1

n
First, a point estimate for μ is X = ∑ X i / n = 10.0
i =1

Now, we need to find a confidence interval for μ.


α = ??
95%=(1−α)100% ⇔ 0. 95=(1−α) ⇔ α=0.05 ⇔ α/2=0.025
t α = t0.025 =2.447 (with ν=n−1=6 degrees of freedom)
2
A 95% C.I. for μ is 0.025
X ± tα
S ↓
2
n 6 → t0.025=2.447
S S
⇔ X − tα < μ < X + tα
2
n 2
n

⇔ 10.0 − (2.447)⎛⎜
0.283 ⎞ ⎛ 0.283 ⎞
⎟ < μ < 10.0 + (2.447)⎜ ⎟
⎝ 7 ⎠ ⎝ 7 ⎠
⇔ 10.0 − 0.262< μ < 10.0 + 0.262
⇔ 9.74 < μ < 10.26
⇔ μ ∈( 9.74 , 10.26)
We are 95% confident that μ ∈( 9.74 , 10.26).

9.5 Standard Error of a Point Estimate:


• The standard error of an estimator is its standard
deviation.
n
• We use X = ∑ X i / n as a point estimator of μ, and we used
i =1
the sampling distribution of X to make a (1−α)100% C.I.
for μ.
• The standard deviation of X , which is σ X = σ / n , is
called the standard error of X . We write s.e.( X )= σ / n .

Department of Statistics and O.R. − 89 − King Saud University


STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

• Note: a (1−α)100% C.I. for μ, when σ2 is known, is


σ
X ± Zα = X ± Z α s.e( X ) .
2
n 2
• Note: a (1−α)100% C.I. for μ, when σ2 is unknown and
the distribution is normal, is
S
X ± tα = X ± t α sˆ.e( X ) . (ν=n−1 df)
2
n 2

9.7 Two Samples: Estimating the Difference between


Two Means (μ1−μ2):
Recall: For two independent samples:
• μ X1− X 2 = μ1 − μ 2 = E ( X 1 − X 2 )
σ 12 σ 22
• σ 2
X1 − X 2 = + = Var ( X 1 − X 2 )
n1 n2
σ 12 σ 22
• σX = σ X2 1 − X 2 = +
1−X2 n1 n2
( X 1 − X 2 ) − ( μ1 − μ 2 )
• Z= ~N(0,1)
σ 12 σ 22
+
n1 n2

Point Estimation of μ1−μ2:


• X 1 − X 2 is a "good" point estimate for μ1−μ2.

Confidence Interval of μ1−μ2:


(i) First Case: σ 12 and σ 22 are known:
( X 1 − X 2 ) − ( μ1 − μ 2 )
• Z= ~N(0,1)
σ 12 σ 22
+
n1 n2
• Result:
a (1−α)100% confidence interval for μ1−μ2 is :
σ 12 σ 22 σ 12 σ 22
( X1 − X 2 ) − Zα + < μ1 − μ 2 < ( X 1 − X 2 ) + Z α +
2
n1 n2 2
n1 n2

Department of Statistics and O.R. − 90 − King Saud University


STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

σ 12 σ 22
or ( X 1 − X 2 ) ± Z α +
2
n1 n2
⎛ σ 12 σ 22 σ 12 σ 22 ⎞⎟

or ( X 1 − X 2 ) − Z α + , ( X1 − X 2 ) + Z α +
⎜ n n n n2 ⎟
⎝ 2 1 2 2 1 ⎠

(ii) Second Case: σ 12 = σ 22 =σ2 is unknown:


• If σ 12 and σ 22 are unknown but σ 12 = σ 22 =σ2, then the
pooled estimate of σ2 is
(n1 − 1) S12 + (n2 − 1) S 22
S 2p =
n1 + n2 − 2
where S12 is the variance of the 1-st sample and S 22 is the
variance of the 2-nd sample. The degrees of freedom of S 2p
is ν=n1+n2−2.
• Result:
a (1−α)100% confidence interval for μ1−μ2 is :
S 2p S 2p S 2p S 2p
( X 1 − X 2 ) − tα + < μ1 − μ 2 < ( X 1 − X 2 ) + t α +
2
n1 n2 2
n1 n2
or
1 1 1 1
( X 1 − X 2 ) − tα S p + < μ1 − μ 2 < ( X 1 − X 2 ) + t α S p +
2
n1 n2 2
n1 n2
1 1
or ( X 1 − X 2 ) ± tα S p +
2
n1 n2

⎜ 1 1 1 1 ⎞⎟
or ( X 1 − X 2 ) − t α S p + , ( X 1 − X 2 ) + tα S p +

⎝ 2
n1 n 2 2
n1 n2 ⎟⎠
where tα is the t-value with ν=n1+n2−2 degrees of freedom.
2
Example 9.6: (1st Case: σ 12 and σ 22 are known)
An experiment was conducted in which two types of engines, A
and B, were compared. Gas mileage in miles per gallon was
measured. 50 experiments were conducted using engine type A
and 75 experiments were done for engine type B. The gasoline
used and other conditions were held constant. The average gas

Department of Statistics and O.R. − 91 − King Saud University


STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

mileage for engine A was 36 miles per gallon and the average
for engine B was 42 miles per gallon. Find 96% confidence
interval for μB−μA, where μA and μB are population mean gas
mileage for engines A and B, respectively. Assume that the
population standard deviations are 6 and 8 for engines A and B,
respectively.
Solution:
Engine A Engine B
nA=50 nB=75
X A =36 X B =42
σA=6 σB=8
A point estimate for μB−μA is X B − X A =42−36=6.

Confidence interval:
α = ??
96% = (1−α)100% ⇔ 0. 96 = (1−α) ⇔ α=0.04 ⇔ α/2 = 0.02
Z α = Z0.02 = 2.05
2
A 96% C.I. for μB−μA is
σ B2 σ A2 σ B2 σ A2
( X B − X A ) − Zα + < μB − μ A < ( X B − X A ) + Zα +
2
nB nA 2
nB nA
σ B2 σ A2
( X B − X A ) ± Zα +
2
nB nA
82 62
(42 − 36) ± Z 0.02 +
75 50
64 36
6 ± (2.05) +
75 50
6 ± 2.571
3.43 < μB−μA < 8.57
We are 96% confident that μB−μA ∈(3.43, 8.57).

Example 9.7: (2nd Case: σ 12 = σ 22 unknown) Reading Assignment

Department of Statistics and O.R. − 92 − King Saud University


STAT – 324 Summer Semester 1426/1427 Dr. Abdullah Al-Shiha

Example: (2nd Case: σ 12 = σ 22 unknown)


To compare the resistance of wire A with that of wire B, an
experiment shows the following results based on two
independent samples (original data multiplied by 1000):
Wire A: 140, 138, 143, 142, 144, 137
Wire B: 135, 140, 136, 142, 138, 140
Assuming equal variances, find 95% confidence interval for
μA−μB, where μA (μB) is the mean resistance of wire A (B).
B

Solution:
Wire A Wire B
nA=6 nB=6
X A =140.67 X B =138.50
S2A=7.86690 S2B=7.10009
A point estimate for μA−μB is X A − X B =140.67−138.50=2.17.

Confidence interval:
95% = (1−α)100% ⇔ 0. 95 = (1−α) ⇔ α=0.05 ⇔ α/2 = 0.025
ν= df = nA+nB − 2= 10B

t α = t0.025 = 2.228
2
(n A − 1) S A2 + (n B − 1) S B2
S 2p =
n A + nB − 2
(6 − 1)(7.86690) + (6 − 1)(7.10009)
= =7.4835
6+6−2
S p = S 2p = 7.4835 = 2.7356
A 95% C.I. for μA−μB is
1 1 1 1
( X A − X B ) − tα S p + < μ A − μ B < ( X A − X B ) + tα S p +
2
n A nB 2
n A nB
1 1
or ( X A − X B ) ± t α S p +
2
n A nB
1 1
(140.67 − 138.50) ± (2.228) (2.7356) +
6 6
2.17 ± 3.51890
−1.35< μA−μB < 5.69
We are 95% confident that μA−μB ∈(−1.35, 5.69)B

Department of Statistics and O.R. − 93 − King Saud University


Foundation of Data Science/22CSC202- Learning Materials Unit V

4. Regression analysis

Regression is defined as a statistical method that helps us to analyze and understand the relationship
between two or more variables of interest.

Regression is a statistical method used in finance, investing, and other disciplines that attempts to
determine the strength and character of the relationship between one dependent variable (usually
denoted by Y) and a series of other variables (known as independent variables).

In regression, we normally have one dependent variable and one or more independent variables.
Here we try to “regress” the value of the dependent variable “Y” with the help of the independent
variables.

Applications of Regression

Regression analysis is used for prediction and forecasting. This has substantial overlap with the field
of machine learning. This statistical method is used across different industries such as,

Financial Industry- Understand the trend in the stock prices, forecast the prices, and evaluate risks
in the insurance domain

Marketing- Understand the effectiveness of market campaigns, and forecast pricing and sales of the
product.

Manufacturing- Evaluate the relationship of variables that determine to define a better engine to
provide better performance

Medicine- Forecast the different combinations of medicines to prepare generic medicines for
diseases.

i. Linear Regression

The simplest of all regression types is Linear Regression which tries to establish relationships
between Independent and Dependent variables. Linear Regression is a predictive model used for
finding the linear relationship between a dependent variable and one or more independent variables.

Linear regression establishes the linear relationship between two variables based on a line of best fit.
Linear regression is thus graphically depicted using a straight line with the slope defining how the
change in one variable impacts a change in the other.

9
Foundation of Data Science/22CSC202- Learning Materials Unit V

Examples of Independent & Dependent Variables:

Here x is Rainfall and y is Crop Yield

Secondly, x is Advertising Expense and y is Sales

Simple Linear Regression

Dependent variable Y_Predict and independent variable X

Y_Predict = b0 + b1x

𝑛𝛴𝑥𝑦 − (𝛴𝑥)(𝛴𝑦)
𝑆𝑙𝑜𝑝𝑒 (𝑏1) =
𝑛𝛴𝑥 − (𝛴𝑥)

𝛴𝑦 − 𝑏1𝛴𝑥
𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 (𝑏0) =
𝑛

Where,

Y_Predict = Dependent variable

x = Independent variable

b0 = The y-intercept

b1 = is the slope of the independent variable(s)

u = The regression residual or error term

Example:

Note that the observed (x, y) data points fall directly on a line. Linear relationship between
Fahrenheit and Celsius. As you may remember, the relationship between degrees Fahrenheit and
degrees Celsius is known to be:

F=(9/5)C+32 or Y=32+(9/5)X

Here,

Y=F

X= C

a=32

b=9/5

10
Foundation of Data Science/22CSC202- Learning Materials Unit V

That is, if you know the temperature in degrees Celsius, you can use this equation to determine the
temperature in degrees Fahrenheit exactly.

Multiple Linear Regression

If the relationship between Independent and dependent variables is multiple in number, then it is
called Multiple Linear Regression.

Multiple linear regression is used to estimate the relationship between two or more independent
variables and one dependent variable. e.g. how rainfall, temperature, and amount of fertilizer added
affect crop growth).

Y=a+b1X1+b2X2+b3X3+...+btXt+u

Python
In the example below, the x-axis represents age, and the y-axis represents speed. We have registered
the age and speed of 13 cars as they were passing a tollbooth. Let us see if the data we collected
could be used in a linear regression:

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]

y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

Code:
Foundation of Data Science/22CSC202- Learning Materials Unit V

import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]

y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)

plt.show()

import numpy as np

import matplotlib.pyplot as plt

def estimate_coef(x, y):

# number of observations/points

n = np.size(x)

# mean of x and y vector

m_x = np.mean(x)

m_y = np.mean(y)

# calculating cross-deviation and deviation about x

SS_xy = n*np.sum(y*x) -(np.sum(x)*np.sum(y))

SS_xx = n*np.sum(x*x) - (np.sum(x)*np.sum(x))


Foundation of Data Science/22CSC202- Learning Materials Unit V

# calculating regression coefficients

b_1 = SS_xy / SS_xx

b_0 = (np.sum(y) - b_1*np.sum(x))/n

return (b_0, b_1)

def plot_regression_line(x, y, b):

# plotting the actual points as scatter plot

plt.scatter(x, y, color = "m", marker = "o", s = 30)

# predicted response vector

y_pred = b[0] + b[1]*x

# plotting the regression line

plt.plot(x, y_pred, color = "g")

# putting labels

plt.xlabel('x')

plt.ylabel('y')

# function to show plot

plt.show()

if __name__ == "__main__":

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])

y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])

# estimating coefficients

b = estimate_coef(x, y)

13
Foundation of Data Science/22CSC202- Learning Materials Unit V

print("Estimated coefficients:\nIntercept - b_0 = {} \nSlope - b_1 = {}".format(b[0], b[1]))

# plotting regression line

plot_regression_line(x, y, b)

Output:

Estimated coefficients:

Intercept - b_0 = 103.10596026490065

Slope - b_1 = -1.75128771155261

5. Classification

Classification is a process of categorizing data or objects into predefined classes or categories based
on their features or attributes. In machine learning, classification is a type of supervised learning
technique where an algorithm is trained on a labeled dataset to predict the class or category of new,
unseen data.

5.1 Types of Classification

Classification is of two types:

Binary Classification: In binary classification, the goal is to classify the input into one of two
classes or categories. Example – On the basis of the given health conditions of a person, we have to
determine whether the person has a certain disease or not.

Multiclass Classification: In multi-class classification, the goal is to classify the input into one of
Foundation of Data Science/22CSC202- Learning Materials Unit V

several classes or categories. For Example – On the basis of data about different species of flowers,
we have to determine which specie our observation belongs to.

5.2 Types of classification algorithms

There are various types of classifiers. Some of them are :

 Linear Classifiers: Linear models create a linear decision boundary between classes. They
are simple and computationally efficient. Some of the linear classification models are as
follows:

 Logistic Regression

 Support Vector Machines having kernel = ‘linear’

 Single-layer Perceptron

 Non-linear Classifiers: Non-linear models create a non-linear decision boundary between


classes. They can capture more complex relationships between the input features and the
target variable. Some of the non-linear classification models are as follows:

 K-Nearest Neighbours

 Kernel SVM

 Naive Bayes

 Random Forests,

 Multi-layer Artificial Neural Networks

5.3 Classification Process

The classification process typically involves the following steps:


Foundation of Data Science/22CSC202- Learning Materials Unit V

1. Understanding the problem: Before getting started with classification, it is important to


understand the problem you are trying to solve. What are the class labels you are trying to
predict? What is the relationship between the input data and the class labels?

 Suppose we have to predict whether a patient has a certain disease or not, on the basis
of 7 independent variables, called features. This means, there can be only two
possible outcomes:

 The patient has the disease, which means “True”.

 The patient has no disease. which means “False”.

 This is a binary classification problem.

2. Data preparation: Once you have a good understanding of the problem, the next step is to
prepare your data. This includes collecting and preprocessing the data and splitting it into
training, validation, and test sets. In this step, the data is cleaned, preprocessed, and
transformed into a format that can be used by the classification algorithm.

 X: It is the independent feature, in the form of an N*M matrix. N is the no. of


observations and M is the number of features.

 y: An N vector corresponding to predicted classes for each of the N observations.

3. Feature Extraction: The relevant features or attributes are extracted from the data that can
be used to differentiate between the different classes.

 Suppose our input X has 7 independent features, having only 5 features influencing
the label or target values remaining 2 are negligibly or not correlated, then we will
use only these 5 features only for the model training.

4. Model Selection: There are many different models that can be used for classification,
including logistic regression, decision trees, support vector machines (SVM), or neural
networks. It is important to select a model that is appropriate for your problem, taking into
account the size and complexity of your data, and the computational resources you have
available.

5. Model Training: Once you have selected a model, the next step is to train it on your training
data. This involves adjusting the parameters of the model to minimize the error between the
predicted class labels and the actual class labels for the training data.

16
Foundation of Data Science/22CSC202- Learning Materials Unit V

6. Model Evaluation: Evaluating the model: After training the model, it is important to
evaluate its performance on a validation set. This will give you a good idea of how well the
model is likely to perform on new, unseen data.

 Log Loss or Cross-Entropy Loss, Confusion Matrix, Precision, Recall, and AUC-
ROC curve are the quality metrics used for measuring the performance of the model.

7. Fine-tuning the model: If the model’s performance is not satisfactory, you can fine-tune it
by adjusting the parameters, or trying a different model.

8. Deploying the model: Finally, once we are satisfied with the performance of the model, we
can deploy it to make predictions on test data. it can be used for real world problem.

5.4. Classification model Evaluations

Here are some commonly used evaluation metrics:

Classification Accuracy: The proportion of correctly classified instances over the total number of
instances in the test set. It is a simple and intuitive metric but can be misleading in imbalanced
datasets where the majority class dominates the accuracy score.

Confusion matrix: A table that shows the number of true positives, true negatives, false positives,
and false negatives for each class, which can be used to calculate various evaluation metrics.

Precision and Recall: Precision measures the proportion of true positives over the total number of
actual positives, while recall measures the proportion of true positives over the total number of
predicted positives. These metrics are useful in scenarios where one class is more important than the
other, or when there is a trade-off between false positives and false negatives.

F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x recall) /
(precision + recall). It is a useful metric for imbalanced datasets where both precision and recall are
Foundation of Data Science/22CSC202- Learning Materials Unit V

important.

5.5. Applications of Classification Algorithm

Classification algorithms are widely used in many real-world applications across various domains,
including:

 Email spam filtering

 Credit risk assessment

 Medical diagnosis

 Image classification

 Sentiment analysis.

 Fraud detection

SVM based Classification using Python

# Importing the required libraries

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn import datasets

from sklearn import svm

from sklearn.tree import DecisionTreeClassifier

from sklearn.naive_bayes import GaussianNB

# import the iris dataset

iris = datasets.load_iris()

X = iris.data #input parameters

18
Foundation of Data Science/22CSC202- Learning Materials Unit V

y = iris.target # output parameter

# splitting X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.6, random_state=1)

print("X_train shape:", X_train.shape)

print("y_train shape:", y_train.shape)

print("X_test shape:", X_test.shape)

print("y_test.shape",y_test.shape)

# SUPPORT VECTOR MACHINE

svm_clf = svm.SVC(kernel='linear') # Linear Kernel

# train the model

svm_clf.fit(X_train, y_train)

# make predictions

svm_clf_pred = svm_clf.predict(X_test)

print("Obtained classification results while testing:",svm_clf_pred)

print("Actual classification results:", y_test)

# print the accuracy

print("Accuracy of Support Vector Machine: ",

accuracy_score(y_test, svm_clf_pred))

# print other performance metrics

print("Precision of Support Vector Machine: ", precision_score(y_test, svm_clf_pred,

19
Foundation of Data Science/22CSC202- Learning Materials Unit V

average='weighted'))

print("Recall of Support Vector Machine: ", recall_score(y_test, svm_clf_pred, average='weighted'))

print("F1-Score of Support Vector Machine: ", f1_score(y_test, svm_clf_pred, average='weighted'))

Output

X_train shape: (60, 4)

y_train shape: (60,)

X_test shape: (90, 4)

y_test.shape (90,)

Obtained classification results while testing: [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1


21212201

0122022120001002222212102200202211220

1 1 2 1 2 1 0 0 0 2 0 2 2 2 0 0]

Actual classification results: [0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2 1 2 1 2 2 0 1

0122022120001002222212102200202211220

1 1 2 1 2 1 0 0 0 2 0 1 2 2 0 0]

Accuracy of Support Vector Machine: 0.9888888888888889

Precision of Support Vector Machine: 0.9892255892255892

Recall of Support Vector Machine: 0.9888888888888889

F1-Score of Support Vector Machine: 0.988873348873349

Other Classifiers

# NAIVE BAYES CLASSIFIERS

gnb = GaussianNB()

# train the model

gnb.fit(X_train, y_train)

20
Foundation of Data Science/22CSC202- Learning Materials Unit V

# make predictions

gnb_pred = gnb.predict(X_test)

# print the accuracy

print("Accuracy of Gaussian Naive Bayes: ", accuracy_score(y_test, gnb_pred))

# print other performance metrics

print("Precision of Gaussian Naive Bayes: ", precision_score(y_test, gnb_pred, average='weighted'))

print("Recall of Gaussian Naive Bayes: ", recall_score(y_test, gnb_pred, average='weighted'))

print("F1-Score of Gaussian Naive Bayes: ", f1_score(y_test, gnb_pred, average='weighted'))

# DECISION TREE CLASSIFIER

dt = DecisionTreeClassifier(random_state=0)

# train the model

dt.fit(X_train, y_train)

# make predictions

dt_pred = dt.predict(X_test)

# print the accuracy

print("Accuracy of Decision Tree Classifier: ", accuracy_score(y_test, dt_pred))

# print other performance metrics

print("Precision of Decision Tree Classifier: ", precision_score(y_test, dt_pred, average='weighted'))

print("Recall of Decision Tree Classifier: ", recall_score(y_test, dt_pred, average='weighted'))

print("F1-Score of Decision Tree Classifier: ", f1_score(y_test, dt_pred, average='weighted'))

6. Prediction

The term predictive analytics refers to the use of statistics and modeling techniques to make

21
Foundation of Data Science/22CSC202- Learning Materials Unit V

predictions about future outcomes and performance. Predictive analytics looks at current and
historical data patterns to determine if those patterns are likely to emerge again. This allows
businesses and investors to adjust where they use their resources to take advantage of possible future
events.

Predictive analytics is a form of technology that makes predictions about certain unknowns in the
future. It draws on a series of techniques to make these determinations, including artificial
intelligence (AI), data mining, machine learning, modeling, and statistics. Predictive models are used
for all kinds of applications, including weather forecasts, creating video games, translating voice to
text, customer service, and investment portfolio strategies.

6.1 Applications

Forecasting

Forecasting is essential in manufacturing because it ensures the optimal utilization of resources in a


supply chain. Critical spokes of the supply chain wheel, whether it is inventory management or the
shop floor, require accurate forecasts for functioning.

Credit Score

Credit scoring makes extensive use of predictive analytics. When a consumer or business applies for
credit, data on the applicant's credit history and the credit record of borrowers with similar
characteristics are used to predict the risk that the applicant might fail to perform on any credit
extended.

Fraud Detection

Financial services can use predictive analytics to examine transactions, trends, and patterns. If any of
this activity appears irregular, an institution can investigate it for fraudulent activity. This may be
done by analyzing activity between bank accounts or analyzing when certain transactions occur.

Supply Chain

Supply chain analytics is used to predict and manage inventory levels and pricing strategies. Supply
chain predictive analytics use historical data and statistical models to forecast future supply chain
performance, demand, and potential disruptions. This helps businesses proactively identify and
address risks, optimize resources and processes, and improve decision-making. These steps allow
companies to forecast what materials will be on hand at any given moment and whether there will be
any shortages.

22
Foundation of Data Science/22CSC202- Learning Materials Unit V

Human Resources

Human resources uses predictive analytics to improve various processes, such as forecasting future
workforce needs and skills requirements or analyzing employee data to identify factors that
contribute to high turnover rates. Predictive analytics can also analyze an employee's performance,
skills, and preferences to predict their career progression and help with career development planning
in addition to forecasting diversity or inclusion initiatives.

6.2. Prediction Models

There are three common techniques used in predictive analytics: Decision trees, neural networks,
and regression. Read more about each of these below.

Regression

This is the model that is used the most in statistical analysis. Use it when you want to determine
patterns in large sets of data and when there's a linear relationship between the inputs. This method
works by figuring out a formula, which represents the relationship between all the inputs found in
the dataset. For example, you can use regression to figure out how price and other key factors can
shape the performance of a security.

Neural Networks

Neural networks were developed as a form of predictive analytics by imitating the way the human
brain works. This model can deal with complex data relationships using artificial intelligence and
pattern recognition. Use it if you have several hurdles that you need to overcome like when you have
too much data on hand, when you don't have the formula you need to help you find a relationship
between the inputs and outputs in your dataset, or when you need to make predictions rather than
come up with explanations.

Decision Trees

Decision trees are the simplest models because they're easy to understand and dissect. They're also
very useful when you need to make a decision in a short period of time.

Python code for prediction using Decision Tree:

import pandas

BIKE = pandas.read_csv("day.csv")

23
Foundation of Data Science/22CSC202- Learning Materials Unit V

bike = X=BIKE.drop(['dteday'],axis=1)

categorical_col_updated = ['season','yr','mnth','weathersit','holiday']

bike = pandas.get_dummies(bike, columns = categorical_col_updated)

#Separating the dependent and independent data variables into two data frames.

from sklearn.model_selection import train_test_split

X = bike.drop(['cnt'],axis=1)

Y = bike['cnt']

# Splitting the dataset into 80% training data and 20% testing data.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.20, random_state=0)

print("X train input:", X_train.shape)

print("Y Train input:", Y_train.shape)

print("X Test input", X_test.shape)

print("Y Test input",Y_test.shape)

Output
(X train input: (584, 32)
Y Train input: (584,)
X Test input (147, 32)
Y Test input (147,)

DT_model = DecisionTreeRegressor(max_depth=5).fit(X_train,Y_train)
DT_predict = DT_model.predict(X_test) #Predictions on Testing data
print("X_test prediction model output", DT_predict.shape)

24
Foundation of Data Science/22CSC202- Learning Materials Unit V

Output:
X_test prediction model output (147,)

import sklearn
from sklearn.metrics import explained_variance_score, mean_absolute_error,r2_score
print("R2 Score:", sklearn.metrics.r2_score(Y_test,DT_predict))
print("Variance:", sklearn.metrics.explained_variance_score(Y_test,DT_predict))
print("mean_absolute_error
",sklearn.metrics.mean_absolute_error(Y_test,DT_predict))

Output:
R2 Score: 0.9675629781060744
Variance: 0.9684464747668272
mean_absolute_error 277.66481694154714

Prediction model using KNN


from sklearn.neighbors import KNeighborsRegressor
KNN_model = KNeighborsRegressor(n_neighbors=3).fit(X_train,Y_train)
KNN_predict = KNN_model.predict(X_test) #Predictions on Testing data
print(KNN_predict)

Output:
R2 Score: 0.9963391318723084
Variance: 0.9963911261654863
mean_absolute_error 75.14739229024944

25

You might also like