0% found this document useful (0 votes)
20 views74 pages

Module 9. Statistics New

This document discusses key statistical tools in data management, focusing on normal distribution and regression/correlation techniques. It outlines chapter objectives, the process of statistics, and the properties of normal distribution, including the empirical rule and examples of its application. Additionally, it emphasizes the importance of inferential statistics in making predictions about a population based on sample data.

Uploaded by

luna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views74 pages

Module 9. Statistics New

This document discusses key statistical tools in data management, focusing on normal distribution and regression/correlation techniques. It outlines chapter objectives, the process of statistics, and the properties of normal distribution, including the empirical rule and examples of its application. Additionally, it emphasizes the importance of inferential statistics in making predictions about a population based on sample data.

Uploaded by

luna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Module 9

Statistics: Data Management


Introduction

This chapter covers two of the most important statistical


tools in data management. The first part discusses the normal
distribution and empirical rules to solve application problems.
The second part tackles regression and correlation which are
two statistical techniques used to establish linear association
of variables.
Chapter Objectives
At the end of this chapter, the students should be able to:
1. use empirical rules to solve an application problem.
2. use the normal distribution to solve an application problem
involving probabilities.
3. determine the linear regression model.
4. use linear regression to make predictions.
Introduction: Statistics Overview
STATISTICS

Make decisions
Solve Problems
Collection
Design Products and
Organization
Processes
Presentation
Analysis
Interpretation
Introduction: Statistics Overview

STATISTICS POPULATION STATISTICS


uses
is the science
inductive of learning
reasoning information
from data
SAMPLE
Introduction: Statistics Overview

DESCRIPTIVE

Point
STATISTICS Probability
Estimation

INFERENTIAL Interval

Hypothesis
Testing
The Process of Statistics
Sampling Theory
POPULATION SAMPLE
Descriptive Statistics

Inferential Statistics

PARAMETER STATISTIC
The Process of Statistics
Problem Definition

Data Gathering

Data Analysis

Data Interpretation
The Process of Statistics
Problem Definition

Data Gathering It can suggest the type of data


that will be involved in the
research process
Data Analysis

Data Interpretation
The Process of Statistics
Problem Definition

It determines the precision


Data Gathering
with which pertinent information
will be collected
Data Analysis (retrospective, observational,
designed experiment)

Data Interpretation
The Process of Statistics
Problem Definition Statistical Objective?
-describe
Data Gathering -identify/classify
-compare/test
-predict
Data Analysis -explain
Number of Variables?
-one
Data Interpretation -two
-more…
The Process of Statistics
Problem Definition Type of Variable?
-independent
Data Gathering -dependent
-intervening
Level of Measurement?
Data Analysis -nominal
-ordinal
-interval
Data Interpretation
-ratio
Section1: Normal Distribution and
the Central Limit Theorem
Inferential statistics uses sampling distribution to
draw conclusions about a given population based on
the analysis of random samples. One of the most
important topics in sampling distribution is the central
limit theorem.
Components of Statistical Research

Design – the researcher must know the appropriate statistical methods to


carry out a plan, implement rules, and evaluate experiments properly.
Description – the researcher must know how to guide readers in
understanding the methods of a research and in analyzing its results.
Inference – the researcher must use the results of data analysis to make
good predictions and correct decisions.
In addition, the researcher must do his or her best to have minimal
experimental errors to obtain high precision and a high degree of
reliability. This can only be achieved if the experiment is well planned
and implemented.
The Normal Distribution

The normal distribution is perhaps the most commonly used


continuous probability distribution in the entire field of statistics.
Consider an experiment that can generate interval data (that is,
continuous). For example, selecting random students in the class and
recording their heights. It can be shown that with a sufficiently large
sample (say at least 30 students), majority of the students have heights
that are close to the average while few have “extreme” measures
(either tall students or short students).
Properties of Normal Distribution
➢ The shape of the distribution is bell-shaped curve.
➢ The curve is symmetric with respect to the middle value.
➢ All three central measures (mean, median, mode) coincide at the middle.
➢ The span of the “bell” is determined by the standard deviation of the
distribution; the larger the standard deviation, the wider is the span (or
range) of the “bell”. (The notation 𝑁(𝜇, 𝜎) is used for this purpose)
➢ The curve is asymptotic to the horizontal axis, which means that a value
that is far from the central value has a small relative frequency.
➢ The area under the curve is 1.
The Empirical Rule for Normal Distribution

• About 68.3% of the population falls within the interval 𝜇 ± 𝜎.


• About 95.4% of the population falls within the interval 𝜇 ± 2𝜎.
• About 99.7% of the population falls within the interval 𝜇 ± 3𝜎.
Where 𝜇 is the population mean and 𝜎 is the population standard
deviation.
Example
Suppose the heights of 40 students are normally distributed with a
mean of 136 cm and a standard deviation of 8 cm. How many students
have heights ranging from
a. 128 cm to 144 cm?
b. 120 cm to 152 cm ?
c. 112 cm to 160 cm?
Example: Solution
Example: Solution
Example: Solution
The Standard Normal Distribution 𝑁(0,1)

While a variable X can


be used to refer to a
normal random
variable, we use Z to
represent the standard
normal variable.
The Standard Normal Distribution 𝑁(0,1)

𝑋 𝑍
The Empirical Rule in 𝑁(0,1)
Areas Under the Normal Curve
TABULAR AREA

Probability 𝑃 𝑍 < 1.18 = 0.8810


Remark

Tabular values corresponding to z-values identified by the


row-label and column-label represent either
a) area under the curve to the left of z; or
b) cumulative probability for all values less than z
Formula

Let 𝑋 be a normally distributed variable with mean 𝜇 and


standard deviation 𝜎. Then any value 𝑥 of 𝑋 can be
transformed into a standard normal score 𝑧 using the formula

𝒙−𝝁
𝒛=
𝝈
Example
A statistics examination was administered to two sections, Section
ABC and Section XYZ. In Section ABC, the average score of the
students was 85 with a standard deviation of 4. In Section XYZ, the
average was 83 with standard deviation of 3. Kara and Mia, who
belong to ABC and XYZ respectively, both scored 87 in the said
examination. Who scored better in terms of their relative position in
their respective sections? Assume that test scores in the 2 sections are
normally distributed.
Solution
𝑥−𝜇 87−85
For Kara: 𝑧 = = = 0.5
𝜎 4
𝑥−𝜇 87−83
For Mia: 𝑧 = = = 1.3
𝜎 3

The standardized score of Mia is higher than the standardized


score of Kara. This means that Mia performed better in her section as
compared to Kara.
Example
Let X be a random variable that is normally distributed with mean
𝜇 = 12 and a standard deviation 𝜎 = 2.4. Find the standard score for
𝑥 = 15.
Example
Let X be a random variable that is normally distributed with mean
𝜇 = 12 and a standard deviation 𝜎 = 2.4. Find the probability
P(𝑥 < 15).
Example
Let X be a random variable that is normally distributed with mean
𝜇 = 12 and a standard deviation 𝜎 = 2.4. Find the probability
P(𝑥 < 15).

Solution: We compute the probability of the normally distributed


variable X using the standard normal distribution.
𝑃 𝑋 < 15 = 𝑃(𝑧 < 1.25)
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 1: Tabular Method
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 2: Calculator Method (For CASIO only)
i) Set calculator to STAT Mode

Press Mode ==> ==> Press 3

Just press AC
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 2: Calculator Method (For CASIO only)
ii) Find probability
press 1
Press Shift ==> ==> Press 5

Note: For other versions, just look for “Distr” key


Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 2: Calculator Method (For CASIO only)
ii) Find probability
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 3: Excel Command

Click
Insert Function
Command
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 3: Excel Command
* Using Normal Distribution choose “NORM.DIST”
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 3: Excel Command
* Using Normal Distribution choose “NORM.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 3: Excel Command
* Using Normal Distribution choose “NORM.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 3: Excel Command
* Using Normal Distribution choose “NORM.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
*Click “OK”
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 3: Excel Command
* Using Standard Normal Distribution choose “NORM.S.DIST”
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 3: Excel Command
* Using Standard Normal Distribution choose “NORM.S.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 3: Excel Command
* Using Standard Normal Distribution choose “NORM.S.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
Finding 𝑷(𝒛 < 𝟏. 𝟐𝟓)
Method 3: Excel Command
* Using Standard Normal Distribution choose “NORM.S.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
Example
In a certain section, the scores in the Quiz 1 of MMW is known to be
normally distributed with a mean of 84 and a standard deviation of 3.5.
Determine the probability that a student in this section obtained a
score of
a) Less than or equal to 90
b) 88 or better
c) Between 85 to 90
Solution: Given 𝜇 = 84, 𝜎 = 3.5
For purposes of computation, let X be the random variable for the
GWA.
Solution: Given 𝜇 = 84, 𝜎 = 3.5
Example
A popular burger chain sells a particular soda brand using a machine
that discharges an average of 500 milliliters (ml) per cup. If the
amount of drink is normally distributed with a standard deviation of
35 ml,
a. what fraction of the cups will contain more than 550 ml?
b. how many cups will overflow if 530-ml cups will be used for
1,500 drinks?
c. below what value do you get the smallest 20% of the drinks?
Example) Given: 𝜇 = 500, 𝜎 = 35
A popular burger chain sells a particular soda brand using a machine that discharges an average of
500 milliliters (ml) per cup. If the amount of drink is normally distributed with a standard
deviation of 35 ml,
a. what fraction of the cups will contain more than 550 ml?
With the assumption of normality, we can standardize 𝑥 = 550
𝑥 − 𝜇 550 − 500
𝑧= = = 1.43
𝜎 35
𝑃 𝑋 > 550 = 𝑃 𝑍 > 1.43 = 𝑅 1.43 = 0.0764 𝑜𝑟 7.6%

Calculator
Example) Given: 𝜇 = 500, 𝜎 = 35
A popular burger chain sells a particular soda brand using a machine that discharges an average of
500 milliliters (ml) per cup. If the amount of drink is normally distributed with a standard
deviation of 35 ml,
b. how many cups will overflow if 530-ml cups will be used for
1,500 drinks?
Note: Cups overflow if the discharged content exceeds 530 ml
Find: 𝑃(𝑋 > 530)
𝑥−𝜇 530−500
𝑧= = = 0.86
𝜎 35
𝑃 𝑋 > 530 = 𝑃 𝑍 > 0.86 = 𝑅 0.86 = 0.1949
Then, (1500)(0.1949)≈ 293 cups will overflow.
Example) Given: 𝜇 = 500, 𝜎 = 35
A popular burger chain sells a particular soda brand using a machine that discharges an average of
500 milliliters (ml) per cup. If the amount of drink is normally distributed with a standard
deviation of 35 ml,
c. below what value do you get the smallest 20% of the drinks?
Note: This is an inverse probability problem. Only Casio fx991EX
(or higher versions) has the capability to do this. You may
use Excel instead.
Find 𝑥 such that 𝑃 𝑋 < 𝑥 = 0.20
*use the syntax: NORM.INV(probability, mean, standard_dev)
*or use syntax: NORM.S.INV(probability)
Example) Given: 𝜇 = 500, 𝜎 = 35
c. below what value do you get the smallest 20% of the drinks?
Find 𝑥 such that 𝑃 𝑋 < 𝑥 = 0.20
*use the syntax: NORM.INV(probability, mean, standard_dev)
𝑥 = 470.5 𝑚𝑙
*or use syntax: NORM.S.INV(probability)
𝑧 = −0.84162
𝑥−𝜇 𝑥−500
𝑧= <==> = −0.84162
𝜎 35
𝑥 = −0.84162 35 + 500 = 470.5 𝑚𝑙
Inferential Statistics

One important role of statistics is to describe a large group of


subjects (called population) using only a part or portion (called
sample) of the group.
Inferential Statistics serves this purpose. Given a population
of size 𝑁, we can consider a smaller group or a sample of size
𝑛 such that whatever characteristic(s) is obtained from the
sample can be used to describe the entire population from
which the sample was drawn.
Inferential Statistics

By “characteristic” it means the common quantities that are


computed such as mean, variance, and proportions.
So, if the mean 𝜇 of the population is needed, Population

𝝁
then the sample mean 𝑥ҧ can be used by some
rules of inferential statistics. Here, 𝜇 is called a
parameter while 𝑥ҧ is called a statistic. ഥ
𝒙
Sample
Inferential Statistics

Other parameters are the population variance (𝜎 2 ),


proportion (𝑝), and correlation (𝜌). The corresponding
statistics are the sample variance (𝑠 2 ), Population
𝟐
sample proportion (𝑝),ො and sample 𝝁 𝝈
𝝆
correlation (𝑟). Inferential statistics 𝒑

𝒑
is concerned with the “estimation” of 𝒔𝟐
𝒓 𝒙 ഥ
the parameters using the sample statistics. Sample
Inferences about population mean (𝜇)

Consider finding the mean (𝜇) of a large population. We first


form a sample of size 𝑛 ≥ 30. Then we compute the mean (𝑥). ҧ
We can use 𝑥ҧ to determine 𝜇.

….but what guarantees that this can be done?????


The Central Limit Theorem (CLT)

If random samples of size 𝑛 are formed from a population


with mean 𝜇 and a standard deviation 𝜎, then the means of
the samples tend to a normal distribution as the sample size
increases. In this case, the standard deviation of the means is
given by
𝜎
𝑠=
𝑛
Moreover, the mean of the sample means equals the
population mean.
Remarks

1. CLT does not require the population to be normally


distributed. So, whether the population is normally
distributed or not, the means of the samples (with fixed
size 𝑛) are normally distributed.
2. By the Empirical Rule, any mean 𝑥ҧ computed from a
random sample can be as close as possible to the
population mean. Specifically, 68% chance that it is 1𝑠
from 𝜇, 95% chance that is 2𝑠 from 𝜇, and 99% chance
that it is 3𝑠 from 𝜇.
Remarks

3. Since the means of the samples are normally distributed,


any specific mean 𝑥ҧ must have a corresponding standard
value (or standard normal score). The formula in this case is

ҧ
𝑥−𝜇
𝑧= 𝜎
𝑛
Example
In the MMW class of Professor A, the students obtained an average of
86.2 in an online quiz, with a standard deviation of 8. Assume that the
scores are normally distributed.
a) What is the probability that a randomly selected student scored
less than or equal to 88?
b) If a random sample of 15 students is selected from the class, what
is the probability that their average is less than or equal to 88?
Ex.) Given: 𝜇 = 86.2, 𝜎 = 8

a) What is the probability that a randomly selected student scored


less than or equal to 88?

𝑥−𝜇 88−86.2
Standardize 𝑥 = 88: 𝑧 = = = 0.225
𝜎 8

𝑃 𝑋 ≤ 88 = 𝑃 𝑍 ≤ 0.225 = 𝑃 0.225 = 0.5890


Ex.) Given: 𝜇 = 86.2, 𝜎 = 8, 𝑛 = 15
b) If a random sample of 15 students is selected from the class, what is
the probability that their average is less than or equal to 88?
ҧ
𝑥−𝜇 88−86.2
Standardize 𝑥ҧ = 88: 𝑧 = 𝜎 = 8 = 0.87
𝑛 15
𝑃 𝑋ത ≤ 88 = 𝑃 𝑍 ≤ 0.87 = 𝑃 0.87 = 0.8079
Example

A manufacturing firm produces LED lamps with a mean lifetime of


900 hours and a standard deviation of 55 hours. Find the probability
that a random sample of 100 lamps will last on the average of
a. more than 915 hours
b. between 895 and 905 hours.
Ex.)Given: 𝜇 = 900, 𝜎 = 55, 𝑛 = 100

A manufacturing firm produces LED lamps with a mean lifetime of


900 hours and a standard deviation of 55 hours. Find the probability
that a random sample of 100 lamps will last on the average of

a. more than 915 hours


915−900
𝑧= 55 = 2.73
100
𝑃 𝑋ത > 915 = 𝑃 𝑍 > 2.73 = 0.0003
E) Given: 𝜇 = 900, 𝜎 = 55, 𝑛 = 100

A manufacturing firm produces LED lamps with a mean lifetime of


900 hours and a standard deviation of 55 hours. Find the probability
that a random sample of 100 lamps will last on the average of
b. between 895 and 905 hours
895−900 905−900
𝑧1 = 55 = −0.91 ; 𝑧2 = 55 = 0.91
100 100
𝑃 895 < 𝑋ത < 905 = 𝑃(−0.91 < 𝑍 < 0.91)
= 𝑃 𝑍 < 0.91 − 𝑃(𝑍 < −0.91)
= 𝑃 0.91 − 𝑃 −0.91 = 0.6372
Correlation and Regression

Sometimes you might wonder how two separate things could


relate to one another. For example, you might ask yourself:
Why does savings generally increase when expenditure
decreases? Or, why does your weight change, when you eat
more or eat less? These questions are about the relationship
between two variables or quantities. Data that involve two
variables are called Bivariate Data.
Correlation and Regression

In univariate data, the major purpose of the analysis is to


describe that data based on the descriptive statistics computed
such as averages, standard deviations, frequency counts, and
the likes. On the other hand, in Bivariate data, the purpose of
the analysis is to describe the relationships. We will be
discussing the relationship in terms of strength and direction.
The statistical procedure that is used to do this is called
correlation analysis.
Correlation Analysis
Correlation analysis is one statistical technique used to study
relationships among variables. Regression analysis is used to
determine the nature of relationship. In a two-variable linear
regression or simple linear regression, a positive relationship
occurs when the two variables increase at the same time while
a negative relationship occurs when one variable increases and
the other variable decreases, or vice versa.
Correlation Coefficient

To determine if there exists a linear relationship between two


variables, use correlation coefficient r whose values range
from –1 to 1.
Useful Formulas

n is the sample size and “SS” stands for sum of the squares
Coefficient of Determination

The square of r is called the coefficient of


determination which describes the degree of
variability between the dependent variable y and
the independent variable x.
The Regression Line: 𝑦 = 𝑏𝑥 + 𝑎

The line corresponding to a given set of points is called the


least-squares line of the linear regression model. Here,
Example
The grades of 10 senior high school students on a midterm report x
and on the final examination y are as follows:

a. Determine the correlation coefficient r.


b. Determine the linear regression line.

You might also like