0% found this document useful (0 votes)
16 views114 pages

Introduction To Statistics

Uploaded by

adugnawbeshaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views114 pages

Introduction To Statistics

Uploaded by

adugnawbeshaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

CHAPTER 1

INTRODUCTION
What is statistics?

We can define statistics in two ways

 Plural sense (lay man definition)

 It is an aggregate or collection of numerical facts

 It is description and summary of events

 Singular sense (formal definition)

 Statistics is defined as the science of Collecting, Organizing, Presenting, Analyzing and


Interpreting numerical data for assisting in making a more effective decision or
1
conclusion
STAGES IN STATISTICAL INVESTIGATION
There are five stages or steps in any statistical investigation

 Collection of data:

 It is the process of measuring, gathering, assembling the raw data based on statistical

investigation through

 Telephone survey

 Questionnaire
2
 Personal interview
CON…

 Organization of data: is the summarization of data in meaningful way (e.g. in table form)

 Presentation data: The process of re-organization, classification, compilation, and

summarization of data to present it in a meaningful form

 Analysis of data: is the process extracting relevant information from the summarized data

 Inference of data: is the way of making interpretation, conclusion and decision to population

based on sample data

3
CLASSIFICATION OF STATISTICS
 Depending on the use of the data, statistics can be categorized in to two branches

 Descriptive statistics

 It is a number that used to Summarize and Describe the data using tables, graphs, charts,

calculations (mean, mode, median…..)

 It is a collection of methods for summarizing data

 It does not involve generalizing beyond the data at hand


4
CON…
Examples:
 The average weight of chemical Engineering student is 55 kg.
 The annual income from milk sales of farmers in kebele A is 2500$
Inferential statistics
 It is concerned with methods for making conclusions about a population using
information obtained from the sample
Example:
 The analysis from 1000 families indicated that the average monthly income of
households in Ethiopia is 10$
5
DEFINITION OF SOME TERMS
 Population: It is the collection of all individuals/items under consideration in the study

 Example: all of the students in WCU who takes this course in this term

 Sample: is a subset/portion of the population

6
CON…

 Sampling: is the process/method of sample selection from the population.

 Sample size: The number of elements/observations to be included in the sample.

 Census: Complete enumeration of the elements of the population.

- Or it is the collection of data from every element in a population

 Parameter: Characteristic or measure obtained from a population.

 Statistic: Characteristic or measure obtained from a sample

 Variable: It is an item of interest that can take on many different numerical values.

7
TYPES OF VARIABLES
 Qualitative Variables: are non-numeric variables and can't be measured.

 e.g. gender, place of residence, state of birth etc.


Quantitative Variables: are numerical variables and can be measured.

e.g. number of students in a class, weight, height etc.

 Quantitative variables are either

- Discrete (which can assume only certain values)

- Continuous (which can assume any value within a specific range)

8
SCALES OF MEASUREMENT
 Measurement is the assignment of numbers to objects or events in a systematic fashion

 Measurement scale refers to the property of value assigned to the data based on the
properties of

-Order

-Distance

-Fixed zero

 The property of order exists when an object that has more of the attribute than another
object, is given a bigger number by the rule system.

9
TYPES OF SCALES

 Four levels of measurement scales are commonly used

 Nominal

 Ordinal

 Interval

 Ratio

 Each possessed different properties of measurement systems


10
CON…
Nominal Scales

 Are measurement systems that possess none of the three properties

 No ordering or ranking of the levels/categories

 No arithmetic and relational operation can be applied

e.g. -Political party preference (Republican, Democrat, Other,)

-Sex (Male or Female.)

-Marital status(married, single, widow, divorce)

 Are the lowest level of measurement


11
CON…

 Ordinal Scales

 Ordinal Scales are measurement systems that possess

the property of order, but not the property of distance

 The property of fixed zero is not important if the property of distance is not
satisfied.

 Ordering or ranking of the levels/categories is meaningful

 Arithmetic operations are not applicable but relational operations are applicable.

e.g. Letter grades (A, B, C, D, F), Rating scales (Excellent, Very good, Good,
12
Fair, poor), Military status (…)
CON…

 Interval Scales
 Interval scales are measurement systems that possess the properties of Order and
distance but not the property of fixed zero.

 The order of categories/levels is meaningful

 However, there is no meaningful zero, so ratios are meaningless. All arithmetic


operations except division are applicable.

 Relational operations are also possible.


e.g. - IQ
13
-Temperature in 𝐹0.
CON…
Ratio Scales
 Are measurement systems that possess all three properties:
- order
- distance
- fixed zero.
 All arithmetic and relational operations are applicable

 there is a true zero.

 True ratios exist between the different units of measure.

Example:

Weight, Height, Age, Number of students… 14


CHAPTER 2

METHODS OF DATA PRESENTATION AND COLLECTION


 There are two sources of data:

Primary Data

 A data measured/collect by the investigator/the user directly from the source through

o Telephone Interview

o Mail Questionnaires

o Door-to-Door Survey

o Personal Interview 15
CON…
 Secondary Data

 Data gathered from published and unpublished sources or files.

 When our source is secondary data check that:

i. The purpose for which the data are collected and compatible with the present
problem.

ii. The nature and classification of data is appropriate to our problem.

iii. Census/sampling

Note: Data which are primary for one may be secondary for the other.
16
DATA PRESENTATION
 After collected and edited the data, the next important step is organize/present the

data

 The presentation of data is broadly classified in to the following two categories:

i. Tabular presentation

ii. Diagrammatic

iii. Graphic presentation

17
TABULAR PRESENTATION
 A way of presenting data in table form

 Raw data: recorded information in its original collected form

 Frequency: is the number of values in a specific class of the distribution

 Frequency distribution: is the organization of raw data in table form using classes
and frequencies

 There are three basic types of frequency distributions

i. Categorical frequency distribution

ii. Ungrouped frequency distribution


18
iii. Grouped frequency distribution
CON…

 Categorical frequency distribution

 Is used to present categorical/qualitative data (as nominal, or ordinal)

 Example: a social worker collected the following data on marital status for 25 persons

(M=married, S=single, W=widowed, D=divorced)

M S D W D
S S M M M
W D S M M
W D D S S
S W W D D 19
CON…
 Solution:
 We follow procedure to construct the frequency distribution

Class Tally Frequency Percent

M //// 5 𝟓
= 𝟐𝟎
𝟐𝟓
S //// // 7 28
D //// // 7 28
W //// 6 24
 Ungrouped frequency distribution
 Is used to present small set of data
20
 Is used for quantitative data specially for discrete data.
GROUPED FREQUENCY DISTRIBUTION
 Is used for large data

 Is used for quantitative variable/data

Example: Construct a grouped frequency distribution for the following data

11, 29, 6, 33, 14, 31, 22, 27, 19, 20, 18, 17, 22, 38, 23, 21, 26, 34, 39, 27

 Solution:

 Steps for constructing Grouped frequency Distribution

 Step 1: Find the highest and the lowest value H=39, L=6

 Step 2: Find the range; R=H-L=39-6=33


21
CON…

 Step 3: Select the number of classes’ desired using Sturges formula;

 𝐾 = 1 + 3.32 log(n)

 𝐾 = 1 + 3.32 log(20) = 5.32 ≈ 6 (rounding up)

 Step 4: Find the class width; 𝑊 = 𝑅/𝑘 = 33/6 = 5.5 ≈ 6(rounding up)

 Step 5: Select the starting point, let it be the minimum observation

 The starting point is called the lower limit of the first class. Continue to add the class
width to this lower limit to get the rest of the lower limits.

 6, 12, 18, 24, 30, 36 are the lower class limits


22
CON…
 Step 6: Find the upper limit of the first class, subtract U from the lower limit of the
second class.

 Then continue to add the class width to this upper limit to find the rest of the upper
limits, e.g. the first upper class=12-1=12-1=11

 11, 17, 23, 29, 35, 41 are the upper class limits.

Step 7: Find the class boundaries;

 E.g. for class 1 Lower class boundary = 6 -1/2=5.5

Upper class boundary =11+1/2=11.5

23
CON…
 Step 8: tally the data.

 Step 9: Write the numeric values for the tallies in the frequency column.

 Step 10: Find cumulative frequency.

 Step 11: Find relative frequency or/and relative cumulative frequency.

24
CON…
 The complete frequency distribution follows:

Class Class Class Tally Freq. Cf (less Cf (more rf. rcf (less
limit boundary Mark than than than type
type) type)
6 – 11 5.5 – 11.5 8.5 // 2 2 20 0.10 0.10
12 – 17 11.5 – 17.5 14.5 // 2 4 18 0.10 0.20
18 – 23 17.5 – 23.5 20.5 ////// 7 11 16 0.35 0.55
24 – 29 23.5 – 29.5 26.5 //// 4 15 9 0.20 0.75
30 – 35 29.5 – 35.5 32.5 /// 3 18 5 0.15 0.90
36 – 41 35.5 – 41.5 38.5 // 2 20 2 0.10 1.00

25
DIAGRAMMATIC AND GRAPHIC PRESENTATION OF DATA
 These are techniques for presenting data in visual displays using geometric and pictures.

 Importance:
 They have greater attraction.

 They facilitate comparison.

 They are easily understandable.

 Diagrams are appropriate for presenting qualitative data.


 The three most commonly used diagrammatic presentation for discrete as well as
qualitative data are:
 Pie charts
 Pictogram
 Bar charts 26
CON…
 There are different types of bar charts.

 Simple bar chart

 Deviation or two way bar chart

 Broken bar chart

 Component or sub divided bar chart.

 Multiple bar charts.

27
GRAPHICAL PRESENTATION

 Are used for continuous data

 The most commonly applied graphical representation are

 Histogram: Class boundaries (horizontal axes) vs Frequency (vertical axis)

 Frequency polygon: Class mid points vs frequency

 Cumulative frequency graph(Ogive): Class boundaries vs cumulative frequencies

28
CHAPTER 3
MEASURES OF CENTRAL TENDENCY
Types of measures of central tendency

 There are several different measures of central tendency, each has its advantage and

disadvantage.

 The Mean (Arithmetic, Geometric and Harmonic)

 The Mode

 The Median

 Quantiles (Quartiles, Deciles and Percentiles) 29


CON…
 Arithmetic Mean (AM)

 The AM of the observations 𝑥1 , 𝑥2 ,….. 𝑥𝑛 is given by

𝑥1 +𝑥2 +…..+ 𝑥𝑛 𝑛
𝑖=1 𝑥𝑖
𝑥= =
𝑛 𝑛

 If 𝑥1 occurs 𝑓1 times, 𝑥2 occurs 𝑓1 times, ….. 𝑥𝑛 occurs 𝑥𝑛 times

𝑥1 𝑓1 +𝑥2 𝑓2 +…..+ 𝑥𝑛 𝑓𝑛 𝑛
𝑖=1 𝑥𝑖 𝑓𝑖
𝑥= 𝑛 𝑓 =
𝑖=1 𝑖 𝑛

Example: Obtain the mean of the following numbers

2, 7, 8, 2, 7, 3, 7
30
CON…

𝒙𝒊 𝒇𝒊 𝒙𝒊 𝒇𝒊
2 2 4
3 1 3
7 3 21
8 1 8
Total 7 36

𝑛
𝑖=1 𝑥𝑖 𝑓𝑖 2∗2 : 3∗1 : 7∗3 :(8∗1) 36
𝑥= = = = 𝟓. 𝟏𝟓
𝑛 2:1:3:1 7
31
CON…
 Arithmetic Mean for Grouped Data
 If data are given in the shape of a continuous frequency distribution the mean is obtained
as follows

𝑥1 𝑓1 +𝑥2 𝑓2 +…..+ 𝑥𝑛 𝑓𝑛 𝑛
𝑖=1 𝑥𝑖 𝑓𝑖
𝑥= = 𝐾 𝑓
𝑓1 :𝑓2 :...𝑓𝑘 𝑖=1 𝑖

 Where,
𝐾
 𝑖<1 𝑓𝑖 =𝑛

 𝑓𝑖 is the frequency of the 𝑖𝑡𝑕 class


32
 𝑥𝑖 the class mark of the 𝑖𝑡𝑕 class
CON…

Example: Calculate the mean for the following age distribution

Class fi Xi X if i
6- 10 35 8 280
11- 15 23 13 299
16- 20 15 18 270
21- 25 12 23 276
26- 30 9 28 252
31- 35 6 33 198
Total 100 1575
6
𝑖=1 𝑥𝑖 𝑓𝑖 35∗8 : 23∗13 : 15∗18 : 12∗23 : 9∗28 :(6∗33) 1575
 𝑥= 6 𝑓 = = =15.75
𝑖=1 𝑖 100 100 33
CON…

 Combined mean(𝑥𝑐 ):

 If 𝑥1 is the mean of 𝑛1 observations, 𝑥2 is the mean of 𝑛2 observations…. 𝑥𝑘 is the mean


of 𝑛𝑘 observations, then the combined mean of all mean is given by:

𝑘
𝑥1 𝑛1 :𝑥2 𝑛2 :⋯:𝑥𝑘 𝑛𝑘 𝑖 𝑥𝑖 𝑛𝑖
𝑥𝑐 = = 𝑘
𝑛1 :𝑛2 :⋯:𝑛𝑘 𝑖 𝑛𝑖

Example: The average score for a class of 35 students was 70. The 20 male students in the

class averaged 73. What was the average score for the female students in the class?

34
CON… CON…
Solutions:
For females, 𝑥𝑓 = ?, 𝑛𝑓 = 15 For males, 𝑥𝑚 = 73, 𝑛𝑚 = 20 𝑥𝑐 = 70

𝑥𝑓 𝑛𝑓 :𝑥𝑚 𝑛𝑚
 𝑥𝑐 =
𝑛𝑓 :𝑛𝑚

𝑥𝑓 ∗15:73∗20
70 = = 70 ∗ 35 = 15 ∗ 𝑥𝑓 + 1460
15:20

2450 = 15 ∗ 𝑥𝑓 + 1460 2450 − 1460 = 15 ∗ 𝑥𝑓

990 = 15 ∗ 𝑥𝑓
𝑥𝑓 = 66
35
CON…

 If a wrong figure has been used when calculating the mean the, correct mean can be
obtained without repeating the whole process using

(𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑣𝑎𝑙𝑢𝑒;𝑤𝑟𝑜𝑛𝑔 𝑣𝑎𝑙𝑢𝑒𝑟)


 Correct mean = wrong mean +
𝑛

 Example:

 An average weight of 10 students was calculated to be 65. Latter it was discovered that
one weight was misread as 40 instead of 80 k.g. Calculate the correct average weight.
(80;40)
 Solutions: Correct mean = 65 + = 65 +4 = 69 k.g
10

36
CON…
 Weighed mean (WM)
 Let X1, X2, …,Xn be the value of items of a series and W1, W2, …,Wn their corresponding
weights, then the weighted mean denoted by 𝑥𝑤 is given by
𝑛
𝑋1 𝑊1 𝑋2 𝑊2 𝑋𝑛 𝑊𝑛 𝑖=1 𝑥𝑖 𝑤𝑖
WM= 𝑥𝑤 = + +, … , + = 𝑛 𝑤
𝑊1 𝑊2 𝑊𝑛 𝑖=1 𝑖

37
CON… CON…
Example:

 A student obtained the following percentage in an examination: English 60,


Biology 75, Mathematics 63, Physics 59, and chemistry 55. Find the students
weighted arithmetic mean if weights 1, 2, 1, 3, 3 respectively are allotted to the
subjects.

 Solution:
𝑛
𝑖=1 𝑥𝑖 𝑤𝑖 60∗1:75∗2:63∗1:59∗3:55∗3
𝑥𝑤 = 𝑛 𝑤 =
𝑖=1 𝑖 1:2:1:3:3

= 615/10 = 61.5 38
CON…

Geometric mean

 The geometric mean of a set of n observation is the nth root of their product.

 The geometric mean of x1, x2 ,x3 …xn is denoted by G.M and given by:
𝑛
GM = x1∗ x2∗ x3∗…..xn

 Taking the logarithms of both sides, we get

log(GM) = log( 𝑛 x1∗ x2∗ x3∗…..xn)


1
= log{(x1∗ x2∗ x3∗…..xn) }
𝑛
39
CON…
1
= log{x1∗ x2∗ x3∗…..xn}
𝑛

1
= {log𝑥1 +log𝑥2 +…..+log𝑥𝑛 }
𝑛

1 𝑛
 log(GM) =
𝑛 𝑖 𝑙𝑜𝑔𝑥𝑖

 The logarithm of the G. M of a set observations is the arithmetic mean of their


logarithm.

Example: x1 =2, x2 = 4.5 and x2= 3


3 3
GM = 2 ∗ 4.5 ∗ 3 = 27 = 3/////
40
CON…

Harmonic mean (HM)

 The harmonic mean of X1, X2 , X3 …Xn is denoted by HM and given by

𝑛
HM = 𝑛1
𝑖𝑥
𝑖

Example: X1= 2, X2 = 4, X3 = 6

3 3
HM = 1 1 1 = = 36/11 = 3.27/////
: : 11/12
2 4 6

NB: H.M ≤ G.M ≤ A.M


41
MODE
 Mode is a value which occurs most frequently in a set of values

 The mode may not exist and even if it does exist, it may not be unique.

 In case of discrete distribution the value having the maximum frequency is the model
value.

Examples:
i. Find the mode of 5, 3, 5, 8, 9  Mode =5

ii. Find the mode of 8, 9, 9, 7, 8, 2, and 5  It is a bimodal Data: 8 and 9

iii. Find the mode of 4, 12, 3, 6, and 7  No mode for this data.

42
CON…
Mode for Grouped data
 The mode of a set of numbers X1, X2, …Xn is denoted by 𝒙
∆1
𝑥 = 𝐿𝑚𝑜 + 𝑤 ( )
∆1 :∆2

Where
 𝐿𝑚𝑜 is the lower class boundary of the modal class
 𝑤 is the class width
 ∆1 = 𝑓𝑚𝑜 - 𝑓1 and 𝑓1 is the frequency of the class preceding the modal class
 ∆2 = 𝑓𝑚𝑜 - 𝑓2 and 𝑓2 is the frequency of the class following the modal class
 𝑓𝑚𝑜 is the frequency of the modal class
43

Note: The modal class is a class with the highest frequency.


CON…
Example: The following is the distribution of the size of certain farms selected at
random from a district. Find the mode

Size of farms No. of farms


5-15 8
15-25 12
25-35 17
35-45 29
45-55 31
55-65 5
65-75 3 44
CON…
 The modal class is 45 – 55 great frequency of 31

 𝐿𝑚𝑜 = 45, 𝑤 = 10, 𝑓1 = 29, 𝑓2 = 5, 𝑓𝑚𝑜 = 31

 ∆1 = 𝑓𝑚𝑜 - 𝑓1 = 31 – 29 = 2

 ∆2 = 𝑓𝑚𝑜 - 𝑓2 = 31 – 5 = 26

∆1 2
𝑥 = 𝐿𝑚𝑜 + 𝑤 ( ) = 45 + 10 ( )
∆1 :∆2 2:26

= 45 + 0.71

= 45.71
45
MEDIAN

 Median is the value of the variable which divides it in to two equal parts

 It is the middle most value in the sense that the number of values less than the median is
equal to the number of values greater than it

 If X1, X2, …Xn be the observations arranged in ascending order, the median is given by

𝑥 𝑛+1 , 𝑖𝑓 𝑖𝑠 𝑜𝑑𝑑
2
𝑥 = 1
𝑥 𝑛 +𝑥 𝑛
:1
, 𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
2 2 2

46
CON…
Example: Find the median of the following numbers.

i. 6, 5, 2, 8, 9, 4

ii. 2, 1, 8, 3, 5

Solution

i. 2, 4, 5, 6, 8, 9

1 1 1
𝑥= 𝑥 𝑛 +𝑥 𝑛
:1
= (𝑥3 +𝑥4 ) = (5+6) = 5.5
2 2 2 2 2

ii. 1, 2, 3, 5, 8
47
𝑥 = 𝑥 𝑛+1 = 𝑥 5+1 = 𝑥 6 = 𝑥,3- = 3
2 2 2
CON…
 Median for grouped data
 The median for grouped data are given by
𝑤 𝑛
𝑥 = 𝐿𝑚𝑒𝑑 + ( − 𝑐)
𝑓𝑚𝑒𝑑 2

Where:
 𝐿𝑚𝑒𝑑 is the lower class boundary of the median class
 w is the class width
 n is total number of observations
 c is the cumulative frequency (less than type) of a class preceding the median class
 𝒇𝒎𝒆𝒅 is the frequency of the median class 48
CON…
Note: The median class is the class with the smallest cumulative frequency (less than type)
𝒏
greater than or equal to
𝟐

Example: Find the median of the following distribution.

49
QUANTILES
 Their measures that depend up on their positions in distribution quartiles, deciles and
percentiles are collectively called quantiles

i. Quartiles

 Quartiles are measures that divide the frequency distribution in to four equal parts

 The value of the variables corresponding to these divisions are denoted Q1, Q2, and Q3
often called the first, the second and the third quartile respectively
𝒊𝑵
 To find Qi (i=1, 2, 3) we count of the classes beginning from the lowest class
𝟒

51
CON…
𝑤 𝑖𝑛
𝑄𝑖 = 𝐿𝑄𝑖 + ( - c), 𝑖 = 1, 2, 3
𝑓 𝑄𝑖 4

Where,
 𝐿𝑄𝑖 the lower class boundary of the quartile class

 𝒘 is the class width

 𝒏 is total number of observations

 𝒇𝑸𝒊 is the frequency of the quartile class

 c is cumulative frequency (less than type) of a class preceding the quartile class

52
CON…
 Remark: The quartile class (class containing Qi ) is the class with the smallest cumulative
𝑖𝑛
frequency (less than type) greater than or equal to
4

 Deciles

 Deciles are measures that divide the frequency distribution in to ten equal parts

 The values of the variables corresponding to these divisions are denoted by D1, D2,.. D9
often called the first, the second,…, the ninth decile respectively.

53
CON…
 For grouped data, we have the following formula

𝑤 𝑖𝑛
𝐷𝑖 = 𝐿𝐷𝑖 + ( - c), 𝑖 = 1, 2,…….,9
𝑓𝐷𝑖 10

Where,
 𝑳𝑫𝒊 the lower class boundary of the class containing 𝐷𝑖
 𝒘 is the class width
 𝒏 is total number of observations

 𝒇𝑸𝒊 is the frequency of the decile class


 c is cumulative frequency (less than type) of a class preceding the decile class

 The class containing Di (𝑖 = 1, 2, …9) is the class with the smallest cumulative frequency (less
𝑖𝑛 54
than type) greater than or equal to
10
CON…
 Percentiles
 Percentiles are measures that divide the frequency distribution in to hundred equal parts

 The values of the variables corresponding to these divisions are denoted P1, P2,.. P99 often
called the first, the second,…, the ninety-ninth percentile respectively

 For grouped data, we have the following formula

𝑤 𝑖𝑛
𝑃𝑖 = 𝐿𝑃𝑖 + ( - c), 𝑖 = 1, 2,…….,99
𝑓𝑃𝑖 100

55
CON…
Where,
 𝑳𝑷𝒊 the lower class boundary of the class containing 𝑃𝑖

 𝒘 is the class width

 𝒏 is total number of observations

 𝒇𝑷𝒊 is the frequency of the decile class

 c is cumulative frequency (less than type) of a class preceding the decile class

 The class containing Pi (𝑖 = 1, 2, …99) is the class with the smallest


𝑖𝑛
cumulative frequency (less than type) greater than or equal to
100
56
CON…
 Example: Find Values Frequency Cum.Freq(less

i. All quartiles than type)

ii. The 7th decile 140- 150 17 17


150- 160 29 46
iii. The 90th percentile
160- 170 42 88
170- 180 72 160
180- 190 84 244
190- 200 107 351
200- 210 49 400
210- 220 34 434
220- 230 31 465
230- 240 16 481
57
240- 250 12 493
CON..
Then find - 𝐿𝑄1 = 170, w = 10, 𝑓𝑄1 = 72, c = 88, n = 493
a. All quartiles.
b. The 7th decile. 𝑄1 = 𝐿𝑄1 +
𝑤 1∗𝑛
( - c) =
10
170+ (123.25 - 88)
𝑓𝑄1 4 72
c. The 90th percentile
= 170+ 4.90
Solution:
= 174.90
1∗𝑛 1∗493
a. 𝑄1 , = = 123.25  This implies that about 25% of the
4 4

- thus, the class containing 𝑄1 observations are less than/equal to 174.90

is 170 – 180 but 75% of them are greater than/equal to


174.90
58
CON…
3∗𝑛 3∗493
𝑄3 : = = 1479/4 = 369.75 = 200+3.83
4 4

- The class containing 𝑄3 is 200 – 210 = 203.83


- 𝐿𝑄3 = 200, 𝑓𝑄1 = 49, 𝑤 = 10, 𝑐 = 351 𝑛 - This implies that about 75% of the
= 493 observations/items are less
𝑤 3∗𝑛 than/equal to 203.83 and 25% of the
𝑄3 = 𝐿𝑄3 + ( - c)
𝑓 𝑄3 4
items are greater than/equal to
10
= 200+ (369.75 – 351) 203.83
49

59
CON…
7∗493 = 190 +9.45
b. 𝐷7 : = 3451/10 = 345.1
10 = 199.45
- thus, the class containing 𝐷7 is -This implies that about 70% of the
190 – 200 observations/items are less than/equal
- 𝐿𝐷7 = 190, w =10, 𝑓𝐷7 = 107, c = 244 to 199.45 but 30% of them are greater

𝑤 7∗𝑛 than/equal to 199.45


𝐷7 = 𝐿𝐷7 + ( - c)
𝑓𝐷7 10 Exercise: find 𝐷1 , 𝐷4 and 𝐷10
10
= 190+ (345.1 - 244)
107
60
CON…
𝑖𝑛 90∗493
c. 𝑃90 : = 𝑃90 = 𝐿𝑃90 +
𝑤
(
90∗𝑛
- c)
100 100 𝑓𝑃90 100
=44370/100 10
= 220+ (443.7- 434)
31
=443.7
= 220 + 3.13
- thus, the class containing 𝑃90 is
220 – 230 = 223.13

- 𝐿𝑃90 = 220, 𝑤 = 10, 𝑓𝑃99 = 31 - This implies that 90% of the


observations are less than/equal to
𝑐 = 434, 𝑛 = 493
223.13 but 10% of them are greater
61
than/equal to 223.13.
CHAPTER 4
MEASURES OF DISPERSION/VARIATION
 It is the scatter or spread of items of a distribution is known as dispersion or variation.

 In other words, it is the degree to which numerical data tend to spread about an average
value

 Measures of dispersions are statistical measures which provide ways of measuring the
extent in which data are dispersed or spread out

62
CCON
ON…

Types of Measures of Dispersion

 The most commonly used measures of dispersions are


i. Range(R) and relative range(RR)

ii. Quartile deviation(Q.D) and coefficient of Quartile deviation(C.Q.D)

iii. Mean deviation(M.D) and coefficient of Mean deviation(C.M.D)

iv. Standard deviation(S.D) and coefficient of variation(C.V)

 The Range (R)

The range is the largest score minus the smallest score(𝑅 = 𝐿 − 𝑆)

 It is a quick and dirty measure of variability 63


CCON …
ON…

 Range for grouped data

 If data are given in the shape of continuous frequency distribution, then the range is
computed as

𝑅 = 𝑈𝐶𝐿𝐿 - 𝑈𝐶𝐿𝐹 or R = 𝑋𝐿 - 𝑋𝐹

Where,
 𝑼𝑪𝑳𝑳 is the upper class limit of the last class

 𝑼𝑪𝑳𝑭 is the upper class limit of the first class

 𝑿𝑳 is the class mark of the last class

 𝑿𝑭 is the class mark of the first class 64


CON… CON…
 Relative Range (RR)
 It is also some times called coefficient of range and given by
𝑳;𝑺 𝑹
𝑹𝑹 = =
𝑳:𝑺 𝑳:𝑺
 Quartile Deviation (Q.D)
 It is the difference between the third and the first quartiles of a set of items
𝑸𝟑 ;𝑸𝟏
 Q.D =
𝟐
 Coefficient of Quartile Deviation (C.Q.D)
(𝑸𝟑 ;𝑸𝟏 )/𝟐 𝟐∗𝑸.𝑫 𝑸𝟑 ;𝑸𝟏
C. Q. D = = =
(𝑸𝟑 :𝑸𝟏 )/𝟐 𝑸𝟑 :𝑸𝟏 𝑸𝟑 :𝑸𝟏
65
CON… CON…

Example: Compute Q.D and its C.Q.D for the previous example
 Solutions:

 In the previous chapter we have obtained the values of all quartiles as:

 Q1= 174.90, Q2= 190.23, Q3=203.83


𝑄3 ;𝑄1 203.83;174.90
 Q.D = = = 28.93/2
2 2

= 14.47
203.83;174.90
 C.Q.D = = 28.93/378.73
203.83:174.90

= 0.076
66
CON… CON…
 The Mean Deviation (M.D)

 The M.D of a set of items is defined as the arithmetic mean of the values of the absolute
deviations from a given average.

 Depending up on the type of averages used we have different mean deviations

i. Mean Deviation about the mean

 Denoted by M.D (𝑥 ) and given by

𝑘
𝑖=1 𝑥𝑖;𝑥
M.D (𝑥 ) =
𝑛

67
CON… CON…
 For the case of frequency distribution it is given as
𝑘
𝑖=1 𝑓𝑖 𝑥𝑖 ;𝑥
M.D (𝑥 ) =
𝑛

ii. Mean Deviation about the median


 Denoted by M.D( 𝑥) and given by
𝑘
𝑖=1 𝑥𝑖 ;𝑥
M.D(𝑥)=
𝑛

 For the case of frequency distribution it is given as


𝑘
𝑖=1 𝑓𝑖 𝑥𝑖 ;𝑥
M.D(𝑥)= 68
𝑛
CON… CON…
iii. Mean Deviation about the mode
𝑘
𝑖=1 𝑥𝑖;𝑥
 Denoted by M.D(𝑥) and given by M.D(𝑥)=
𝑛

𝑘
𝑖=1 𝑓𝑖 𝑥𝑖 ;𝑥
 For the case of frequency distribution it is given as M.D(𝑥)=
𝑛

 Coefficient of Mean Deviation (C.M.D)


𝑀.𝐷(𝑥)
C.M.D (𝑥) =
𝑥

𝑀.𝐷(𝑥)
C.M.D (𝑥) =
𝑥

𝑀.𝐷(𝑥)
C.M.D (𝑥) = 69
𝑥
CON…

Example: The following are the number of visit made by ten mothers to the local doctor’s
surgery. 8, 6, 5, 5, 7, 4, 5, 9, 7, 4
 Find mean deviation about mean, median and mode and their coefficients
Solution: 𝑥 = 6 𝑥=5 𝑥 = 5.5
-Then take the deviations of each observation from these averages
Xi 4 4 5 5 5 6 7 7 8 9 total
Xi  6 2 2 1 1 1 0 1 1 2 3 14
X i  5.5 1.5 1.5 0.5 0.5 0.5 0.5 1.5 1.5 2.5 3.5 14
Xi  5 1 1 0 0 0 1 2 2 3 4 14

70
CON…
10
𝑖=1 𝑓𝑖 𝑥𝑖;6 14
 M.D (𝑥 ) = = = 1.4
10 10

10
𝑖=1 𝑥𝑖 ;5.5 14
 M.D(𝑥)= = = 1.4
10 10

10
𝑖=1 𝑥𝑖 ;5 14
 M.D(𝑥)= = = 1.4
10 10

𝑀.𝐷(𝑥) 1.4
 C.M.D (𝑥) = = = 0.233
𝑥 6
𝑀.𝐷(𝑥) 1.4
 C.M.D (𝑥) = = = 0.255
𝑥 5.5
𝑀.𝐷(𝑥) 1.4
 C.M.D (𝑥) = = = 0.28 71
𝑥 5
THE VARIANCE
 The variance is the average squared deviation from the mean
 Population variance (𝟐 )
1
2 = (𝑥𝑖 − 𝜇)2 , 𝑖 = 1,2,3, … , 𝑁
𝑁
 For the case of frequency distribution Population variance is expressed as:
1
2 = 𝑓𝑖 (𝑥𝑖 − 𝜇)2 , 𝑖 = 1,2,3, … , 𝑘
𝑁
 Sample variance (𝒔𝟐 )
1 𝑛
𝑠2 = 𝑖<1(𝑥𝑖 − 𝑥)2
𝑛;1

 For the case of frequency distribution, sample variance is expressed as:


1 𝑘
𝑠2 = 𝑖<1 𝑓𝑖 (𝑥𝑖 − 𝑥)2 72
𝑛;1
CON…
 Standard deviation

 Sample standard deviation

S = 𝑠2

 Population standard deviation

𝜎 = 2
 Coefficient of Variation (C.V)
 It is the ratio of standard deviation to the mean usually expressed as percent.
𝑠
C. V = *100
𝑥
73
CON…
 The distribution having less C.V is said to be less variable or more consistent
 Examples:
 Find the C.V for the following sample data 5, 17, 12, 10
Solutions: 𝑥 = 11
𝑥𝑖 5 10 12 17 Total
(𝑥𝑖 − 𝑥)2 36 1 1 36 74

1 𝑛 74
 𝑠2 = 𝑖<1(𝑥𝑖 − 𝑥)2 = = 24.64
𝑛;1 3

 S = 𝑠 2 = 24.64 = 4.97
𝑠 4.97 74
 C. V = ∗ 100 = ∗ 100 = 45%
𝑥 11
CHAPTER 5
ELEMENTARY PROBABILITY

75
CHAPTER -7
SAMPLING AND SAMPLING DISTRIBUTION
 Definitions
Parameter: Characteristic or measure obtained from a population.

Statistic: Characteristic or measure obtained from a sample.

Sampling: The process or method of sample selection from the population.

Sampling unit: the elements of the population to be sampled.


Sampling frame: is the list of all elements in a population
Example: If one studies the academic performance of students in some college.
- Sampling unit = students
- Sampling frame = List of students
76
CON…
 Types of Errors in sample survey
Two types of errors
Sampling error: Is the discrepancy between the population value and sample
value
- May arise due to in appropriate sampling techniques applied
Non sampling errors: are errors due to procedure bias such as
- Due to incorrect responses
- Measurement
- Errors at different stages in processing the data
Why we Need Sampling over census
 Reduced cost
 Greater speed
 Greater accuracy
 Greater scope 77
 More detailed information can be obtained.
TYPES OF SAMPLING
 There are two types of sampling
1. Random Sampling /probability sampling

 In which all elements in the population have a pre-assigned non-zero probability


to be included in to the sample.
Simple Random Sampling (SRS)
 Every possible sample of specific size has an equal chance of being selected

 All elements in the population have the same pre-assigned non-zero probability to
be included in to the sample
 In this case, sampling may be with or without replacement

 Subjects are selected by using the lottery method or table of random numbers 78
CON…
Stratified Random Sampling (SRS)
 The population is divided in to non-overlapping but exhaustive groups called
strata
 Samples/subjects are chosen from each stratum

 Elements in the same strata should be more or less homogeneous

 But should be different in different strata

 It is applied if the population is heterogeneous

Cluster Sampling (CS))


 The population is divided in to non-overlapping groups called clusters.

 A sample of groups or cluster of elements is chosen i.e. subjects are selected


by using an intact group that is representative of the Population and
 All the sampling units in the selected clusters will be surveyed
79
CON…
 Elements within a cluster are heterogeneous/dissimilar
 It is useful when it is difficult/costly to generate SRS

Systematic Sampling (SS)


 A complete list of all elements with in the population (sampling frame) is
required
 The procedure starts in determining the first element to be included in the
sample
 Then the technique is to take the kth item from the sampling frame
𝑁
 Let N = population size, n = sample size, 𝑘 = = sampling interval
𝑛
 Chose any number between 1 and 𝑘. Suppose it is 𝑗 (1 ≤ 𝑗 ≤ 𝑘)
 The 𝑗
𝑡𝑕 unit is selected at first and then (𝑗 + 𝑘)𝑡𝑕 , (𝑗 + 2𝑘)𝑡𝑕 until the required

sample size is reached 80


CON…
 Non Random Sampling/non-probability sampling
 In which the choice of individuals for a sample depends on the basis of convenience,
personal choice or interest
Judgment Sampling
In this case, the person taking the sample has direct or indirect control over
which items are selected for the sample.
Convenience Sampling
In this method, the decision maker selects a sample from the population
in a manner that is relatively easy and convenient.
Quota Sampling
In this method, the decision maker requires the sample to contain a certain
number of items with a given characteristic.
81
CHAPTER 8
ESTIMATION AND HYPOTHESIS TESTING

What is Inference?
 It is the process of making interpretations or conclusions from sample data for
the totality of the population.
 It is only the sample data that is ready for inference.
 In statistics there are two ways though which inference can be made

- These are Statistical estimation and Statistical hypothesis testing


Statistical Estimation
 This is one way of making inference about the population parameter where
the investigator does not have any prior notion about values or characteristics
of the population parameter. 82
CON…
There are two ways of Statistical Estimation
1. Point Estimation
 It is a procedure that results in a single value as an estimate for a parameter

2. Interval estimation
 It is the procedure that results in the interval of values as an estimate for a
parameter
Definitions
Confidence Interval: An interval estimate with a specific level of confidence
Consistent Estimator: An estimator which gets closer to the value of the
parameter as the sample size increases
Estimator: A sample statistic which is used to estimate population parameter
- It must be unbiased, consistent, and relatively efficient 83
Estimate: Is the different possible values which an estimator can assumes
CON…
Relatively Efficient Estimator: The estimator for a parameter with the
smallest variance
Unbiased Estimator: An estimator whose expected value is the value of the
parameter being estimated
 Point estimation
𝒏
𝒊=𝟏 𝒙𝒊
𝒙= is a point estimator of the population mean
𝒏
 Interval estimation of the population mean
Case 1: If sample size is large or if the population is normal with known variance
𝑥 ± 𝑧𝛼 / 𝑛 is a 100(1-α)% confidence interval for 
2

But usually is not known, in that case we estimate by its point estimator S2
𝑥 ± 𝑧𝛼 s/ 𝑛 is a 100(1-α)% confidence interval for  84
2
CON…
Case 2: If sample size is small and the population variance, is not known
𝑥 ± 𝑡𝛼 s/ 𝑛 is a 100(1-α)% confidence interval for 
2

Example: From a normal sample of size 25 a mean of 32 was found. Given


that the population standard deviation is 4.2. Find a 95% confidence interval
for the population mean
Solution:
𝑥 = 32,  = 4.2, 1-α = 0.95  α = 0.05, α/2 = 0.025, 𝑧α = 1.96 , 𝑛 = 25
2

A 95% CI = 𝑥 ± 𝑧𝛼 / 𝑛
2

= 32 ±1.96*4.2 25
= 32 ± 1.65
= (30.35, 33.65)//// 85
CON…

Example: A drug company is testing a new drug which is supposed to reduce blood
pressure. From the six people who are used as subjects, it is found that the average
drop in blood pressure is 2.28 points, with a standard deviation of .95 points. What is
the 95% confidence interval for the mean change in pressure?
Solution:
α
𝑥 = 2.28, s = 0.95, 1-α = 0.95  α = 0.05, = 0.025, 𝑡α = 2.571 with df. = 5 from table,
2 2

n=6
A 95% CI = 𝑥 ± 𝑡𝛼 s/ 𝑛
2

= 2.28 ±2.571*0.95 6
= 2.28 ± 1.008
86
= (1.28, 3.28)/////
HYPOTHESIS TESTING
Definitions:
Statistical hypothesis:
 is a statement about the population whose plausibility is to be evaluated on
the basis of the sample data.
Test statistic
 is a statistics whose value serves to determine whether to reject or accept the
hypothesis to be tested. It is a random variable.
Statistic test
 is a test or procedure used to evaluate a statistical hypothesis and its value
depends on sample data
87
CON…
 Types of hypothesis
Null hypothesis

 It is the hypothesis to be tested

 It is the hypothesis of equality or the hypothesis of no difference

 Usually denoted by H0

Alternative hypothesis:

 It is the hypothesis available when the null hypothesis has to be rejected.

 It is the hypothesis of difference

 Usually denoted by H1 or Ha
88
CON…

 Types of errors in hypothesis testing


Type I error: Rejecting the null hypothesis when it is true
Type II error: Failing to reject the null hypothesis when it is false
 Type I error ( α) and type II error ( β) have inverse relationship and
therefore, can not be minimized at the same time
Hypothesis testing about the population mean, 
 Suppose the assumed or hypothesized value of is denoted by , then one can
formulate two sided (1) and one sided (2 and 3) hypothesis as follows:

(1) 𝐻0 :  = 0 Vs 𝐻1 :  ≠ 0 (3) 𝐻0 :  = 0 Vs 𝐻1 :  < 0


89
(2) 𝐻0 :  = 0 Vs 𝐻1 :  > 0
CON…
Case 1: When sampling is from a normal distribution with known
- The relevant test statistic is given by
𝑥;
Z=
/ 𝑛
 Summary table for decision rule

𝑯𝟎 Reject 𝑯𝟎 if Accept 𝑯𝟎 Inclusive if


if

𝐻0 :  = 0 | 𝑍𝑐𝑎𝑙 | > 𝑍α | 𝑍𝑐𝑎𝑙 | > 𝑍α 𝑍𝑐𝑎𝑙 = 𝑍α or 𝑍𝑐𝑎𝑙 = -𝑍α


2 2 2 2

𝑥;μ0
Where, 𝑍𝑐𝑎𝑙 =
/ 𝑛
 < 0 𝑍𝑐𝑎𝑙 < -𝑍α 𝑍𝑐𝑎𝑙 > -𝑍α 𝑍𝑐𝑎𝑙 = -𝑍α 90
C
CON …
ON…

Case 2: When sampling is from a normal distribution with unknown


and small sample size
- The relevant test statistic is
𝑥 ;
t= ~ t – distribution with 𝑛 − 1 degree of freedom
𝑠/ 𝑛

𝑯𝟎 Reject 𝑯𝟎 if Accept 𝑯𝟎 if Inclusive if

𝐻0 :  = 0 | 𝑡𝑐𝑎𝑙 | > 𝑡α | 𝑡𝑐𝑎𝑙 | < 𝑡α 𝑡𝑐𝑎𝑙 = 𝑡α or 𝑡𝑐𝑎𝑙 = -𝑡α


2 2 2 2

𝑥 ;μ0
Where, 𝑡𝑐𝑎𝑙 =
𝑠/ 𝑛
 < 0 𝑡𝑐𝑎𝑙 < -𝑡α 𝑡𝑐𝑎𝑙 > -𝑡α 𝑡𝑐𝑎𝑙 = -𝑡α 91
CON…
Case3: When sampling is from a non- normally distributed population or
a population whose functional form is unknown
- If a sample size is large one can perform a test hypothesis about the mean by
using:

𝑥 ;μ0
𝑍𝑐𝑎𝑙 = if 2 is known
/ 𝑛

𝑥 ;μ0
= if 2 is unknown
𝑠/ 𝑛

92
GENERAL STEPS IN HYPOTHESIS TESTING

1. Specify the null hypothesis (H0) and the alternative hypothesis (H1).

2. Select a significance level,

3. Identify the sampling distribution of the estimator.

4. Calculate a statistic analogous to the parameter specified by the null


hypothesis.

5. Identify the critical region.

6. Making decision.

7. Summarization of the result


93
CON…
Example: Test the hypotheses that the average height content of containers of certain
lubricant is 10 liters if the contents of a random sample of 10 containers are 10.2, 9.7,
10.1, 10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 liters. Use the 0.01 level of significance and
assume that the distribution of contents is normal.

Solution:

Step 1: Specify/state the appropriate hypothesis


𝐻0 :  = 10 Vs 𝐻0 :  ≠ 10
Step 2: select the level of significance,
Step 3: Select an appropriate test statistics

t- Statistic is appropriate because population variance is not known and the sample 94
size
is also small
C
CON …
ON…

Step 4: identify the critical region.


Here we have two critical regions since we have two tailed hypothesis
- Thus the critical region is |𝑡𝑐𝑎𝑙 | > 𝑡 9 0.005 = 3.2498
Step 5: Computations:
10.06;10
X  10.06, S  0.25 𝑡𝑐𝑎𝑙 =
0.25/ 10
= 0.76
Step 6: Decision
we can accept H0 , since tcal is in the acceptance region
Step 7: Conclusion
At 1% level of significance, we have no evidence to say that the average height
content of containers of the given lubricant is different from 10 litters, based on95the
given sample data.
C
CON …
ON…

Example: The mean life time of a sample of 16 fluorescent light bulbs produced
by a company is computed to be 1570 hours. The population standard deviation is
120 hours. Suppose the hypothesized value for the population mean is 1600
hours. Can we conclude that the life time of light bulbs is decreasing?

(Use α = 0.05 and assume the normality of the population)

Solution:

0 = 1600 𝑥 = 1570  = 120 𝑛 = 16

Step 1: Identify the appropriate hypothesis

𝐻0 = 1600 Vs 𝐻1 < 1600 96


CON…
Step 2: select the level of significance,   0.05 ( given)
Step 3: Select an appropriate test statistics

 Z- Statistic is appropriate because population variance is known

Step 4: identify the critical region

 The critical region is 𝑍𝑐𝑎𝑙 < −𝑍0.005 = -1.645

 (-1.645, ∞) is the acceptance region

97
CON…

Step 5: Computations:

𝑥 ;μ0 1570;1600
𝑍𝑐𝑎𝑙 = = = -0.1
/ 𝑛 120/ 16
Step 6: Decision

Accept H0 , since Zcal is in the acceptance region.

Step 7: Conclusion

At 5% level of significance, we have no evidence to say that that the life time of
light bulbs is decreasing, based on the given sample data

98
TEST OF ASSOCIATION
 The chi-square procedure test is used to test the hypothesis of independency

of two attributes say A and B and suppose A has 𝑟 categories and B has
𝑐 categories

 For instance we may be interested

 Whether the presence or absence of hypertension is independent of smoking


habit or not.

 Whether the size of the family is independent of the level of education


attained by the mothers.
99
 Whether there is association between father and son regarding boldness
CON…

 The Chi-square statistic is given by

(𝑜𝑖𝑗 ;𝑒𝑖𝑗 )2
2 𝑐𝑎𝑙 = 𝑟
𝑖<1
𝑐
𝑗<1* + ~ 2*𝑑𝑓< 𝑟;1 ∗(𝑐;1)+
𝑒𝑖𝑗

 Where, 𝑜𝑖𝑗 is the number of units that belong to category 𝑖 of 𝐴 and category 𝑗 of 𝐵

𝑒𝑖𝑗 is given by
𝑅𝑖 ∗𝐶𝑗
𝑒𝑖𝑗 =
𝑛

Where, 𝑅𝑖 is the 𝑖𝑡𝑕 row total, 𝐶𝑖 is the 𝑗𝑡𝑕 column total and 𝑛 is the total
number of observations.
100
CON…

101
CON…

 The null and alternative hypothesis may be stated as:

 𝐻0 : There is no association between 𝐴 and 𝐵

 𝐻1 : There is association between 𝐴 and 𝐵

 Decision: we reject 𝐻0 if

2 𝑐𝑎𝑙 > 2*𝑑𝑓< 𝑟;1 ∗(𝑐;1)+


at α level of significance

Example: A geneticist took a random sample of 300 men to study whether there

is association between father and son regarding boldness (using α = 5%)

102
CON…

Father Son
Bold Not

Bold 85 59

Solution: Not 65 91

𝐻0 : There is no association between Father and Son regarding boldness Vs

𝐻1 : not 𝐻0

 First calculate the row and column totals

𝑅1 = 144 𝑅2 = 156 𝐶1 = 150 𝐶2 = 150


103
 Then calculate the expected frequencies (𝑒𝑖𝑗 )
C …
ON…
CON

𝑅1 ∗𝐶1 144∗150
𝑒11 = = = 72
𝑛 300

𝑅1 ∗𝐶2 144∗150
𝑒12 = = = 72
𝑛 300

𝑅2 ∗𝐶1 156∗150
𝑒21 = = = 78
𝑛 300

𝑅2 ∗𝐶2 156∗150
𝑒22 = = = 78
𝑛 300

(85;72)2 (59;72)2 (65;78)2 (91;78)2


2 𝑐𝑎𝑙
= + + + = 9.028
72 72 78 78

 Obtain the tabulated value of chi-square


104
C …
ON…
CON

 2 𝑡𝑎𝑏 (df. = 1) = 3.841

Decision: we reject 𝐻0 since

2 𝑐𝑎𝑙 = 9.028 > 2 𝑡𝑎𝑏 (df. = 1) = 3.841

Conclusion:

At 5% level of significance we have evidence to say there is association


between father and son regarding boldness, based on this sample data.

105
CHAPTER - 9
SIMPLE LINEAR REGRESSION AND CORRELATION
 Linear regression and correlation is studying and measuring the linear
relation ship among two or more variables.
 When only two variables are involved, the analysis is referred to as simple
correlation and simple linear regression analysis.
 When there are more than two variables the term multiple regression and
partial correlation is used.
Regression Analysis: is a statistical technique that can be used to develop a
mathematical equation showing how variables are related
Correlation Analysis: deals with the measurement of the closeness of the
106
relation ship which are described in the regression equation
C
CON …
ON…
 Correlation coefficient(r) computed from the sample data
measures the strength and direction of a linear relationship between
two quantitative variables.
 The correlation coefficient between variable X and Y is given by
𝑋𝑌;𝑛𝑋𝑌
𝑟=
𝑋 2 ;𝑛𝑋 2 * 𝑌 2 ;𝑛𝑌 2 +
 The range of the correlation coefficient is from -1 to 1.
 If there is a strong positive linear relationship between X and Y , the value of r
will be close to 1.
 If there is a strong negative linear relationship between the X and Y , the
value of r will be close to -1.
 When there is no linear relationship between X and Y or only a weak relation
107
ship, the value of r will be close to 0.
SIMPLE LINEAR REGRESSION
 Simple linear regression refers to the linear relation ship between two variables.

 We usually denote the dependent variable by 𝑌 and the independent variable by 𝑋.

 A simple regression line is the line fitted to the points plotted in the scatter
diagram, which would describe the average relation ship between the two variables.

 Therefore, to see the type of relation ship, it is advisable to prepare scatter plot
before fitting the model.

 The linear model is given by

𝒀 = 𝜶 + 𝜷𝑿 + 𝝐

108
CON…
CON…
 Where, 𝑌 = dependent variable, 𝑥 = independent variable, 𝛼 = regression

constant, 𝛽 = regression slope, 𝜖 = random disturbance term which has


normal distribution with mean 0 and variance of 2

 The above model is estimated by

𝑌 = 𝑎 + 𝑏𝑋
Where, 𝑎 is a constant term or intercept and b is the slope/coefficient of 𝑋

𝑋𝑌;𝑛𝑋𝑌
𝑏= and 𝑎 = 𝑌- 𝑏𝑋
𝑋 2 ;𝑛𝑋 2
109
CON…
Examples: The following data shows two variables; mid semester(X) score
and final exam(Y) scores of 10 students (both out of 50)
Student Mid exam Final exam
(X) (Y)

1 31 31
2 23 29
3 41 34
4 32 35
5 29 25
6 33 35
110
7 28 33
8 31 42
CON…

A. Calculate a simple correlation coefficient (𝑟)

B. Fit a regression line of final exam(Y) on mid semester(X)

C. Predict the amount of mid semester(X) uses 40 score

111
CON…
Solution
𝑛 = 10, 𝑋 = 31.2, 𝑌 = 32.9, 𝑋 2 = 973.4, 𝑌 2 = 1082.4, 𝑋𝑌 = 10331,
𝑋 2 = 9920, 𝑌 2 = 11003

𝑋𝑌;𝑛𝑋𝑌
A) 𝑟 =
𝑋 2 ;𝑛𝑋 2 * 𝑌 2 ;𝑛𝑌 2 +

10331;10∗31.2∗32.9
=
(9920;10∗973.4)(11003;10∗1082.4)

66.2
= = 0.363
182.5

This means mid exam and final exam scores have a slightly positive correlation
112
CON…

B)

113
C
CON …
ON…
𝑋𝑌;𝑛𝑋𝑌 10331;10∗31.2∗32.9 66.2
𝑏= = = = 0.36
𝑋 2 ;𝑛𝑋 2 9920;10∗973.4 186

𝑎 = 𝑌- 𝑏𝑋 = 32.9 – 0.36*31.2 = 32.9 – 11.232 = 21.7


∴ 𝑌 = 𝑎 + 𝑏𝑥  𝑌 = 21.7 +0.36𝑥
C. When X = 40 kg
𝑌 = 21.7 +0.36*40 = 21.7+14.4 = 36.1 scores

114
M

115

You might also like