Introduction To Statistics
Introduction To Statistics
INTRODUCTION
What is statistics?
Collection of data:
It is the process of measuring, gathering, assembling the raw data based on statistical
investigation through
Telephone survey
Questionnaire
2
Personal interview
CON…
Organization of data: is the summarization of data in meaningful way (e.g. in table form)
Analysis of data: is the process extracting relevant information from the summarized data
Inference of data: is the way of making interpretation, conclusion and decision to population
3
CLASSIFICATION OF STATISTICS
Depending on the use of the data, statistics can be categorized in to two branches
Descriptive statistics
It is a number that used to Summarize and Describe the data using tables, graphs, charts,
Example: all of the students in WCU who takes this course in this term
6
CON…
Variable: It is an item of interest that can take on many different numerical values.
7
TYPES OF VARIABLES
Qualitative Variables: are non-numeric variables and can't be measured.
8
SCALES OF MEASUREMENT
Measurement is the assignment of numbers to objects or events in a systematic fashion
Measurement scale refers to the property of value assigned to the data based on the
properties of
-Order
-Distance
-Fixed zero
The property of order exists when an object that has more of the attribute than another
object, is given a bigger number by the rule system.
9
TYPES OF SCALES
Nominal
Ordinal
Interval
Ratio
Ordinal Scales
The property of fixed zero is not important if the property of distance is not
satisfied.
Arithmetic operations are not applicable but relational operations are applicable.
e.g. Letter grades (A, B, C, D, F), Rating scales (Excellent, Very good, Good,
12
Fair, poor), Military status (…)
CON…
Interval Scales
Interval scales are measurement systems that possess the properties of Order and
distance but not the property of fixed zero.
Example:
Primary Data
A data measured/collect by the investigator/the user directly from the source through
o Telephone Interview
o Mail Questionnaires
o Door-to-Door Survey
o Personal Interview 15
CON…
Secondary Data
i. The purpose for which the data are collected and compatible with the present
problem.
iii. Census/sampling
Note: Data which are primary for one may be secondary for the other.
16
DATA PRESENTATION
After collected and edited the data, the next important step is organize/present the
data
i. Tabular presentation
ii. Diagrammatic
17
TABULAR PRESENTATION
A way of presenting data in table form
Frequency distribution: is the organization of raw data in table form using classes
and frequencies
Example: a social worker collected the following data on marital status for 25 persons
M S D W D
S S M M M
W D S M M
W D D S S
S W W D D 19
CON…
Solution:
We follow procedure to construct the frequency distribution
M //// 5 𝟓
= 𝟐𝟎
𝟐𝟓
S //// // 7 28
D //// // 7 28
W //// 6 24
Ungrouped frequency distribution
Is used to present small set of data
20
Is used for quantitative data specially for discrete data.
GROUPED FREQUENCY DISTRIBUTION
Is used for large data
11, 29, 6, 33, 14, 31, 22, 27, 19, 20, 18, 17, 22, 38, 23, 21, 26, 34, 39, 27
Solution:
Step 1: Find the highest and the lowest value H=39, L=6
𝐾 = 1 + 3.32 log(n)
Step 4: Find the class width; 𝑊 = 𝑅/𝑘 = 33/6 = 5.5 ≈ 6(rounding up)
The starting point is called the lower limit of the first class. Continue to add the class
width to this lower limit to get the rest of the lower limits.
Then continue to add the class width to this upper limit to find the rest of the upper
limits, e.g. the first upper class=12-1=12-1=11
11, 17, 23, 29, 35, 41 are the upper class limits.
23
CON…
Step 8: tally the data.
Step 9: Write the numeric values for the tallies in the frequency column.
24
CON…
The complete frequency distribution follows:
Class Class Class Tally Freq. Cf (less Cf (more rf. rcf (less
limit boundary Mark than than than type
type) type)
6 – 11 5.5 – 11.5 8.5 // 2 2 20 0.10 0.10
12 – 17 11.5 – 17.5 14.5 // 2 4 18 0.10 0.20
18 – 23 17.5 – 23.5 20.5 ////// 7 11 16 0.35 0.55
24 – 29 23.5 – 29.5 26.5 //// 4 15 9 0.20 0.75
30 – 35 29.5 – 35.5 32.5 /// 3 18 5 0.15 0.90
36 – 41 35.5 – 41.5 38.5 // 2 20 2 0.10 1.00
25
DIAGRAMMATIC AND GRAPHIC PRESENTATION OF DATA
These are techniques for presenting data in visual displays using geometric and pictures.
Importance:
They have greater attraction.
27
GRAPHICAL PRESENTATION
28
CHAPTER 3
MEASURES OF CENTRAL TENDENCY
Types of measures of central tendency
There are several different measures of central tendency, each has its advantage and
disadvantage.
The Mode
The Median
𝑥1 +𝑥2 +…..+ 𝑥𝑛 𝑛
𝑖=1 𝑥𝑖
𝑥= =
𝑛 𝑛
𝑥1 𝑓1 +𝑥2 𝑓2 +…..+ 𝑥𝑛 𝑓𝑛 𝑛
𝑖=1 𝑥𝑖 𝑓𝑖
𝑥= 𝑛 𝑓 =
𝑖=1 𝑖 𝑛
2, 7, 8, 2, 7, 3, 7
30
CON…
𝒙𝒊 𝒇𝒊 𝒙𝒊 𝒇𝒊
2 2 4
3 1 3
7 3 21
8 1 8
Total 7 36
𝑛
𝑖=1 𝑥𝑖 𝑓𝑖 2∗2 : 3∗1 : 7∗3 :(8∗1) 36
𝑥= = = = 𝟓. 𝟏𝟓
𝑛 2:1:3:1 7
31
CON…
Arithmetic Mean for Grouped Data
If data are given in the shape of a continuous frequency distribution the mean is obtained
as follows
𝑥1 𝑓1 +𝑥2 𝑓2 +…..+ 𝑥𝑛 𝑓𝑛 𝑛
𝑖=1 𝑥𝑖 𝑓𝑖
𝑥= = 𝐾 𝑓
𝑓1 :𝑓2 :...𝑓𝑘 𝑖=1 𝑖
Where,
𝐾
𝑖<1 𝑓𝑖 =𝑛
Class fi Xi X if i
6- 10 35 8 280
11- 15 23 13 299
16- 20 15 18 270
21- 25 12 23 276
26- 30 9 28 252
31- 35 6 33 198
Total 100 1575
6
𝑖=1 𝑥𝑖 𝑓𝑖 35∗8 : 23∗13 : 15∗18 : 12∗23 : 9∗28 :(6∗33) 1575
𝑥= 6 𝑓 = = =15.75
𝑖=1 𝑖 100 100 33
CON…
Combined mean(𝑥𝑐 ):
𝑘
𝑥1 𝑛1 :𝑥2 𝑛2 :⋯:𝑥𝑘 𝑛𝑘 𝑖 𝑥𝑖 𝑛𝑖
𝑥𝑐 = = 𝑘
𝑛1 :𝑛2 :⋯:𝑛𝑘 𝑖 𝑛𝑖
Example: The average score for a class of 35 students was 70. The 20 male students in the
class averaged 73. What was the average score for the female students in the class?
34
CON… CON…
Solutions:
For females, 𝑥𝑓 = ?, 𝑛𝑓 = 15 For males, 𝑥𝑚 = 73, 𝑛𝑚 = 20 𝑥𝑐 = 70
𝑥𝑓 𝑛𝑓 :𝑥𝑚 𝑛𝑚
𝑥𝑐 =
𝑛𝑓 :𝑛𝑚
𝑥𝑓 ∗15:73∗20
70 = = 70 ∗ 35 = 15 ∗ 𝑥𝑓 + 1460
15:20
990 = 15 ∗ 𝑥𝑓
𝑥𝑓 = 66
35
CON…
If a wrong figure has been used when calculating the mean the, correct mean can be
obtained without repeating the whole process using
Example:
An average weight of 10 students was calculated to be 65. Latter it was discovered that
one weight was misread as 40 instead of 80 k.g. Calculate the correct average weight.
(80;40)
Solutions: Correct mean = 65 + = 65 +4 = 69 k.g
10
36
CON…
Weighed mean (WM)
Let X1, X2, …,Xn be the value of items of a series and W1, W2, …,Wn their corresponding
weights, then the weighted mean denoted by 𝑥𝑤 is given by
𝑛
𝑋1 𝑊1 𝑋2 𝑊2 𝑋𝑛 𝑊𝑛 𝑖=1 𝑥𝑖 𝑤𝑖
WM= 𝑥𝑤 = + +, … , + = 𝑛 𝑤
𝑊1 𝑊2 𝑊𝑛 𝑖=1 𝑖
37
CON… CON…
Example:
Solution:
𝑛
𝑖=1 𝑥𝑖 𝑤𝑖 60∗1:75∗2:63∗1:59∗3:55∗3
𝑥𝑤 = 𝑛 𝑤 =
𝑖=1 𝑖 1:2:1:3:3
= 615/10 = 61.5 38
CON…
Geometric mean
The geometric mean of a set of n observation is the nth root of their product.
The geometric mean of x1, x2 ,x3 …xn is denoted by G.M and given by:
𝑛
GM = x1∗ x2∗ x3∗…..xn
1
= {log𝑥1 +log𝑥2 +…..+log𝑥𝑛 }
𝑛
1 𝑛
log(GM) =
𝑛 𝑖 𝑙𝑜𝑔𝑥𝑖
𝑛
HM = 𝑛1
𝑖𝑥
𝑖
Example: X1= 2, X2 = 4, X3 = 6
3 3
HM = 1 1 1 = = 36/11 = 3.27/////
: : 11/12
2 4 6
The mode may not exist and even if it does exist, it may not be unique.
In case of discrete distribution the value having the maximum frequency is the model
value.
Examples:
i. Find the mode of 5, 3, 5, 8, 9 Mode =5
iii. Find the mode of 4, 12, 3, 6, and 7 No mode for this data.
42
CON…
Mode for Grouped data
The mode of a set of numbers X1, X2, …Xn is denoted by 𝒙
∆1
𝑥 = 𝐿𝑚𝑜 + 𝑤 ( )
∆1 :∆2
Where
𝐿𝑚𝑜 is the lower class boundary of the modal class
𝑤 is the class width
∆1 = 𝑓𝑚𝑜 - 𝑓1 and 𝑓1 is the frequency of the class preceding the modal class
∆2 = 𝑓𝑚𝑜 - 𝑓2 and 𝑓2 is the frequency of the class following the modal class
𝑓𝑚𝑜 is the frequency of the modal class
43
∆1 = 𝑓𝑚𝑜 - 𝑓1 = 31 – 29 = 2
∆2 = 𝑓𝑚𝑜 - 𝑓2 = 31 – 5 = 26
∆1 2
𝑥 = 𝐿𝑚𝑜 + 𝑤 ( ) = 45 + 10 ( )
∆1 :∆2 2:26
= 45 + 0.71
= 45.71
45
MEDIAN
Median is the value of the variable which divides it in to two equal parts
It is the middle most value in the sense that the number of values less than the median is
equal to the number of values greater than it
If X1, X2, …Xn be the observations arranged in ascending order, the median is given by
𝑥 𝑛+1 , 𝑖𝑓 𝑖𝑠 𝑜𝑑𝑑
2
𝑥 = 1
𝑥 𝑛 +𝑥 𝑛
:1
, 𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
2 2 2
46
CON…
Example: Find the median of the following numbers.
i. 6, 5, 2, 8, 9, 4
ii. 2, 1, 8, 3, 5
Solution
i. 2, 4, 5, 6, 8, 9
1 1 1
𝑥= 𝑥 𝑛 +𝑥 𝑛
:1
= (𝑥3 +𝑥4 ) = (5+6) = 5.5
2 2 2 2 2
ii. 1, 2, 3, 5, 8
47
𝑥 = 𝑥 𝑛+1 = 𝑥 5+1 = 𝑥 6 = 𝑥,3- = 3
2 2 2
CON…
Median for grouped data
The median for grouped data are given by
𝑤 𝑛
𝑥 = 𝐿𝑚𝑒𝑑 + ( − 𝑐)
𝑓𝑚𝑒𝑑 2
Where:
𝐿𝑚𝑒𝑑 is the lower class boundary of the median class
w is the class width
n is total number of observations
c is the cumulative frequency (less than type) of a class preceding the median class
𝒇𝒎𝒆𝒅 is the frequency of the median class 48
CON…
Note: The median class is the class with the smallest cumulative frequency (less than type)
𝒏
greater than or equal to
𝟐
49
QUANTILES
Their measures that depend up on their positions in distribution quartiles, deciles and
percentiles are collectively called quantiles
i. Quartiles
Quartiles are measures that divide the frequency distribution in to four equal parts
The value of the variables corresponding to these divisions are denoted Q1, Q2, and Q3
often called the first, the second and the third quartile respectively
𝒊𝑵
To find Qi (i=1, 2, 3) we count of the classes beginning from the lowest class
𝟒
51
CON…
𝑤 𝑖𝑛
𝑄𝑖 = 𝐿𝑄𝑖 + ( - c), 𝑖 = 1, 2, 3
𝑓 𝑄𝑖 4
Where,
𝐿𝑄𝑖 the lower class boundary of the quartile class
c is cumulative frequency (less than type) of a class preceding the quartile class
52
CON…
Remark: The quartile class (class containing Qi ) is the class with the smallest cumulative
𝑖𝑛
frequency (less than type) greater than or equal to
4
Deciles
Deciles are measures that divide the frequency distribution in to ten equal parts
The values of the variables corresponding to these divisions are denoted by D1, D2,.. D9
often called the first, the second,…, the ninth decile respectively.
53
CON…
For grouped data, we have the following formula
𝑤 𝑖𝑛
𝐷𝑖 = 𝐿𝐷𝑖 + ( - c), 𝑖 = 1, 2,…….,9
𝑓𝐷𝑖 10
Where,
𝑳𝑫𝒊 the lower class boundary of the class containing 𝐷𝑖
𝒘 is the class width
𝒏 is total number of observations
The class containing Di (𝑖 = 1, 2, …9) is the class with the smallest cumulative frequency (less
𝑖𝑛 54
than type) greater than or equal to
10
CON…
Percentiles
Percentiles are measures that divide the frequency distribution in to hundred equal parts
The values of the variables corresponding to these divisions are denoted P1, P2,.. P99 often
called the first, the second,…, the ninety-ninth percentile respectively
𝑤 𝑖𝑛
𝑃𝑖 = 𝐿𝑃𝑖 + ( - c), 𝑖 = 1, 2,…….,99
𝑓𝑃𝑖 100
55
CON…
Where,
𝑳𝑷𝒊 the lower class boundary of the class containing 𝑃𝑖
c is cumulative frequency (less than type) of a class preceding the decile class
59
CON…
7∗493 = 190 +9.45
b. 𝐷7 : = 3451/10 = 345.1
10 = 199.45
- thus, the class containing 𝐷7 is -This implies that about 70% of the
190 – 200 observations/items are less than/equal
- 𝐿𝐷7 = 190, w =10, 𝑓𝐷7 = 107, c = 244 to 199.45 but 30% of them are greater
In other words, it is the degree to which numerical data tend to spread about an average
value
Measures of dispersions are statistical measures which provide ways of measuring the
extent in which data are dispersed or spread out
62
CCON
ON…
…
If data are given in the shape of continuous frequency distribution, then the range is
computed as
𝑅 = 𝑈𝐶𝐿𝐿 - 𝑈𝐶𝐿𝐹 or R = 𝑋𝐿 - 𝑋𝐹
Where,
𝑼𝑪𝑳𝑳 is the upper class limit of the last class
Example: Compute Q.D and its C.Q.D for the previous example
Solutions:
In the previous chapter we have obtained the values of all quartiles as:
= 14.47
203.83;174.90
C.Q.D = = 28.93/378.73
203.83:174.90
= 0.076
66
CON… CON…
The Mean Deviation (M.D)
The M.D of a set of items is defined as the arithmetic mean of the values of the absolute
deviations from a given average.
𝑘
𝑖=1 𝑥𝑖;𝑥
M.D (𝑥 ) =
𝑛
67
CON… CON…
For the case of frequency distribution it is given as
𝑘
𝑖=1 𝑓𝑖 𝑥𝑖 ;𝑥
M.D (𝑥 ) =
𝑛
𝑘
𝑖=1 𝑓𝑖 𝑥𝑖 ;𝑥
For the case of frequency distribution it is given as M.D(𝑥)=
𝑛
𝑀.𝐷(𝑥)
C.M.D (𝑥) =
𝑥
𝑀.𝐷(𝑥)
C.M.D (𝑥) = 69
𝑥
CON…
Example: The following are the number of visit made by ten mothers to the local doctor’s
surgery. 8, 6, 5, 5, 7, 4, 5, 9, 7, 4
Find mean deviation about mean, median and mode and their coefficients
Solution: 𝑥 = 6 𝑥=5 𝑥 = 5.5
-Then take the deviations of each observation from these averages
Xi 4 4 5 5 5 6 7 7 8 9 total
Xi 6 2 2 1 1 1 0 1 1 2 3 14
X i 5.5 1.5 1.5 0.5 0.5 0.5 0.5 1.5 1.5 2.5 3.5 14
Xi 5 1 1 0 0 0 1 2 2 3 4 14
70
CON…
10
𝑖=1 𝑓𝑖 𝑥𝑖;6 14
M.D (𝑥 ) = = = 1.4
10 10
10
𝑖=1 𝑥𝑖 ;5.5 14
M.D(𝑥)= = = 1.4
10 10
10
𝑖=1 𝑥𝑖 ;5 14
M.D(𝑥)= = = 1.4
10 10
𝑀.𝐷(𝑥) 1.4
C.M.D (𝑥) = = = 0.233
𝑥 6
𝑀.𝐷(𝑥) 1.4
C.M.D (𝑥) = = = 0.255
𝑥 5.5
𝑀.𝐷(𝑥) 1.4
C.M.D (𝑥) = = = 0.28 71
𝑥 5
THE VARIANCE
The variance is the average squared deviation from the mean
Population variance (𝟐 )
1
2 = (𝑥𝑖 − 𝜇)2 , 𝑖 = 1,2,3, … , 𝑁
𝑁
For the case of frequency distribution Population variance is expressed as:
1
2 = 𝑓𝑖 (𝑥𝑖 − 𝜇)2 , 𝑖 = 1,2,3, … , 𝑘
𝑁
Sample variance (𝒔𝟐 )
1 𝑛
𝑠2 = 𝑖<1(𝑥𝑖 − 𝑥)2
𝑛;1
S = 𝑠2
𝜎 = 2
Coefficient of Variation (C.V)
It is the ratio of standard deviation to the mean usually expressed as percent.
𝑠
C. V = *100
𝑥
73
CON…
The distribution having less C.V is said to be less variable or more consistent
Examples:
Find the C.V for the following sample data 5, 17, 12, 10
Solutions: 𝑥 = 11
𝑥𝑖 5 10 12 17 Total
(𝑥𝑖 − 𝑥)2 36 1 1 36 74
1 𝑛 74
𝑠2 = 𝑖<1(𝑥𝑖 − 𝑥)2 = = 24.64
𝑛;1 3
S = 𝑠 2 = 24.64 = 4.97
𝑠 4.97 74
C. V = ∗ 100 = ∗ 100 = 45%
𝑥 11
CHAPTER 5
ELEMENTARY PROBABILITY
75
CHAPTER -7
SAMPLING AND SAMPLING DISTRIBUTION
Definitions
Parameter: Characteristic or measure obtained from a population.
All elements in the population have the same pre-assigned non-zero probability to
be included in to the sample
In this case, sampling may be with or without replacement
Subjects are selected by using the lottery method or table of random numbers 78
CON…
Stratified Random Sampling (SRS)
The population is divided in to non-overlapping but exhaustive groups called
strata
Samples/subjects are chosen from each stratum
What is Inference?
It is the process of making interpretations or conclusions from sample data for
the totality of the population.
It is only the sample data that is ready for inference.
In statistics there are two ways though which inference can be made
2. Interval estimation
It is the procedure that results in the interval of values as an estimate for a
parameter
Definitions
Confidence Interval: An interval estimate with a specific level of confidence
Consistent Estimator: An estimator which gets closer to the value of the
parameter as the sample size increases
Estimator: A sample statistic which is used to estimate population parameter
- It must be unbiased, consistent, and relatively efficient 83
Estimate: Is the different possible values which an estimator can assumes
CON…
Relatively Efficient Estimator: The estimator for a parameter with the
smallest variance
Unbiased Estimator: An estimator whose expected value is the value of the
parameter being estimated
Point estimation
𝒏
𝒊=𝟏 𝒙𝒊
𝒙= is a point estimator of the population mean
𝒏
Interval estimation of the population mean
Case 1: If sample size is large or if the population is normal with known variance
𝑥 ± 𝑧𝛼 / 𝑛 is a 100(1-α)% confidence interval for
2
But usually is not known, in that case we estimate by its point estimator S2
𝑥 ± 𝑧𝛼 s/ 𝑛 is a 100(1-α)% confidence interval for 84
2
CON…
Case 2: If sample size is small and the population variance, is not known
𝑥 ± 𝑡𝛼 s/ 𝑛 is a 100(1-α)% confidence interval for
2
A 95% CI = 𝑥 ± 𝑧𝛼 / 𝑛
2
= 32 ±1.96*4.2 25
= 32 ± 1.65
= (30.35, 33.65)//// 85
CON…
Example: A drug company is testing a new drug which is supposed to reduce blood
pressure. From the six people who are used as subjects, it is found that the average
drop in blood pressure is 2.28 points, with a standard deviation of .95 points. What is
the 95% confidence interval for the mean change in pressure?
Solution:
α
𝑥 = 2.28, s = 0.95, 1-α = 0.95 α = 0.05, = 0.025, 𝑡α = 2.571 with df. = 5 from table,
2 2
n=6
A 95% CI = 𝑥 ± 𝑡𝛼 s/ 𝑛
2
= 2.28 ±2.571*0.95 6
= 2.28 ± 1.008
86
= (1.28, 3.28)/////
HYPOTHESIS TESTING
Definitions:
Statistical hypothesis:
is a statement about the population whose plausibility is to be evaluated on
the basis of the sample data.
Test statistic
is a statistics whose value serves to determine whether to reject or accept the
hypothesis to be tested. It is a random variable.
Statistic test
is a test or procedure used to evaluate a statistical hypothesis and its value
depends on sample data
87
CON…
Types of hypothesis
Null hypothesis
Usually denoted by H0
Alternative hypothesis:
Usually denoted by H1 or Ha
88
CON…
𝑥;μ0
Where, 𝑍𝑐𝑎𝑙 =
/ 𝑛
< 0 𝑍𝑐𝑎𝑙 < -𝑍α 𝑍𝑐𝑎𝑙 > -𝑍α 𝑍𝑐𝑎𝑙 = -𝑍α 90
C
CON …
ON…
𝑥 ;μ0
Where, 𝑡𝑐𝑎𝑙 =
𝑠/ 𝑛
< 0 𝑡𝑐𝑎𝑙 < -𝑡α 𝑡𝑐𝑎𝑙 > -𝑡α 𝑡𝑐𝑎𝑙 = -𝑡α 91
CON…
Case3: When sampling is from a non- normally distributed population or
a population whose functional form is unknown
- If a sample size is large one can perform a test hypothesis about the mean by
using:
𝑥 ;μ0
𝑍𝑐𝑎𝑙 = if 2 is known
/ 𝑛
𝑥 ;μ0
= if 2 is unknown
𝑠/ 𝑛
92
GENERAL STEPS IN HYPOTHESIS TESTING
1. Specify the null hypothesis (H0) and the alternative hypothesis (H1).
6. Making decision.
Solution:
t- Statistic is appropriate because population variance is not known and the sample 94
size
is also small
C
CON …
ON…
Example: The mean life time of a sample of 16 fluorescent light bulbs produced
by a company is computed to be 1570 hours. The population standard deviation is
120 hours. Suppose the hypothesized value for the population mean is 1600
hours. Can we conclude that the life time of light bulbs is decreasing?
Solution:
97
CON…
Step 5: Computations:
𝑥 ;μ0 1570;1600
𝑍𝑐𝑎𝑙 = = = -0.1
/ 𝑛 120/ 16
Step 6: Decision
Step 7: Conclusion
At 5% level of significance, we have no evidence to say that that the life time of
light bulbs is decreasing, based on the given sample data
98
TEST OF ASSOCIATION
The chi-square procedure test is used to test the hypothesis of independency
of two attributes say A and B and suppose A has 𝑟 categories and B has
𝑐 categories
(𝑜𝑖𝑗 ;𝑒𝑖𝑗 )2
2 𝑐𝑎𝑙 = 𝑟
𝑖<1
𝑐
𝑗<1* + ~ 2*𝑑𝑓< 𝑟;1 ∗(𝑐;1)+
𝑒𝑖𝑗
Where, 𝑜𝑖𝑗 is the number of units that belong to category 𝑖 of 𝐴 and category 𝑗 of 𝐵
𝑒𝑖𝑗 is given by
𝑅𝑖 ∗𝐶𝑗
𝑒𝑖𝑗 =
𝑛
Where, 𝑅𝑖 is the 𝑖𝑡 row total, 𝐶𝑖 is the 𝑗𝑡 column total and 𝑛 is the total
number of observations.
100
CON…
101
CON…
Decision: we reject 𝐻0 if
Example: A geneticist took a random sample of 300 men to study whether there
102
CON…
Father Son
Bold Not
Bold 85 59
Solution: Not 65 91
𝐻1 : not 𝐻0
𝑅1 ∗𝐶1 144∗150
𝑒11 = = = 72
𝑛 300
𝑅1 ∗𝐶2 144∗150
𝑒12 = = = 72
𝑛 300
𝑅2 ∗𝐶1 156∗150
𝑒21 = = = 78
𝑛 300
𝑅2 ∗𝐶2 156∗150
𝑒22 = = = 78
𝑛 300
Conclusion:
105
CHAPTER - 9
SIMPLE LINEAR REGRESSION AND CORRELATION
Linear regression and correlation is studying and measuring the linear
relation ship among two or more variables.
When only two variables are involved, the analysis is referred to as simple
correlation and simple linear regression analysis.
When there are more than two variables the term multiple regression and
partial correlation is used.
Regression Analysis: is a statistical technique that can be used to develop a
mathematical equation showing how variables are related
Correlation Analysis: deals with the measurement of the closeness of the
106
relation ship which are described in the regression equation
C
CON …
ON…
Correlation coefficient(r) computed from the sample data
measures the strength and direction of a linear relationship between
two quantitative variables.
The correlation coefficient between variable X and Y is given by
𝑋𝑌;𝑛𝑋𝑌
𝑟=
𝑋 2 ;𝑛𝑋 2 * 𝑌 2 ;𝑛𝑌 2 +
The range of the correlation coefficient is from -1 to 1.
If there is a strong positive linear relationship between X and Y , the value of r
will be close to 1.
If there is a strong negative linear relationship between the X and Y , the
value of r will be close to -1.
When there is no linear relationship between X and Y or only a weak relation
107
ship, the value of r will be close to 0.
SIMPLE LINEAR REGRESSION
Simple linear regression refers to the linear relation ship between two variables.
A simple regression line is the line fitted to the points plotted in the scatter
diagram, which would describe the average relation ship between the two variables.
Therefore, to see the type of relation ship, it is advisable to prepare scatter plot
before fitting the model.
𝒀 = 𝜶 + 𝜷𝑿 + 𝝐
108
CON…
CON…
Where, 𝑌 = dependent variable, 𝑥 = independent variable, 𝛼 = regression
𝑌 = 𝑎 + 𝑏𝑋
Where, 𝑎 is a constant term or intercept and b is the slope/coefficient of 𝑋
𝑋𝑌;𝑛𝑋𝑌
𝑏= and 𝑎 = 𝑌- 𝑏𝑋
𝑋 2 ;𝑛𝑋 2
109
CON…
Examples: The following data shows two variables; mid semester(X) score
and final exam(Y) scores of 10 students (both out of 50)
Student Mid exam Final exam
(X) (Y)
1 31 31
2 23 29
3 41 34
4 32 35
5 29 25
6 33 35
110
7 28 33
8 31 42
CON…
111
CON…
Solution
𝑛 = 10, 𝑋 = 31.2, 𝑌 = 32.9, 𝑋 2 = 973.4, 𝑌 2 = 1082.4, 𝑋𝑌 = 10331,
𝑋 2 = 9920, 𝑌 2 = 11003
𝑋𝑌;𝑛𝑋𝑌
A) 𝑟 =
𝑋 2 ;𝑛𝑋 2 * 𝑌 2 ;𝑛𝑌 2 +
10331;10∗31.2∗32.9
=
(9920;10∗973.4)(11003;10∗1082.4)
66.2
= = 0.363
182.5
This means mid exam and final exam scores have a slightly positive correlation
112
CON…
B)
113
C
CON …
ON…
𝑋𝑌;𝑛𝑋𝑌 10331;10∗31.2∗32.9 66.2
𝑏= = = = 0.36
𝑋 2 ;𝑛𝑋 2 9920;10∗973.4 186
114
M
115