ECE-069_Engineering-Data-Analysis_WM
ECE-069_Engineering-Data-Analysis_WM
This document and the information thereon is the property of PHINMA Education
ECE 069 - ENGINEERING DATA ANALYSIS
SYLLABUS
I. Course Description:
This course is designed for undergraduate engineering students with emphasis on problem
solving related to societal issues that engineers and scientists are called upon to solve. It introduces
different methods of data collection and the suitability of using a particular method for a given
situation. The relationship of probability to statistics is also discussed, providing students with the
tools they need to understand how "chance" plays a role in statistical analysis. Probability
distributions of random variables and their uses are also considered. The course also includes
estimation techniques for unknown parameters; and hypothesis testing used in making inferences
from sample to population; inference for regression parameters and build models for estimating
means and predicting future values of key variables under study.
II. Course Objectives:
At the end of the course, the students should be able to:
1. Apply statistical methods in the analysis of data.
2. Design experiments involving several factors.
RANDOM VARIABLES
6 Types of Random Variables : Discrete & Continuous
Probability Distribution of Random Variables 4
PLANNING AND CONDUCTING SURVEYS
7
Steps in Collecting Data Through Surveys
FIRST PERIODICAL EXAM 5
PLANNING AND CONDUCTING EXPERIMENTS
8
Steps in Collecting Data Through Experiments
DISCRETE PROBABILITY DISTRIBUTION – 6
9 BINOMIAL DISTIBUTION
Definition of a Binomial Probability Distribution
Descriptive Statistics of Binomial Distribution
DISCRETE PROBABILITY DISTRIBUTION –
POISSON DISTIBUTION
10
Definition of a Poisson Probability Distribution 7
Descriptive Statistics of Poisson Distribution
QUIZ#2
CONTINUOUS PROBABILITY DISTRIBUTION –
NORMAL DISTIBUTION
11
Definition of a Normal Probability Distribution
Descriptive Statistics of Normal Distribution
8
CONTINUOUS PROBABILITY DISTRIBUTION –
EXPONENTIAL DISTIBUTION
12
Definition of an Exponential Probability Distribution
Descriptive Statistics of Exponential Distribution
SECOND PERIODICAL EXAM 9
SAMPLING DISTRIBUTION AND POINT ESTIMATION
Definition of Point Estimate and Point Estimator
13 Distribution of Point Estimator
Properties of Point Estimator
Definition & Interpretation of Standard Error 10
STATISTICAL INTERVALS
Forms of Interval Estimation : Confidence Intervals
14 Prediction Intervals
Tolerance Intervals
Statistical Intervals for Normal Distribution
ECE 069 - ENGINEERING DATA ANALYSIS
SYLLABUS
IV. Textbook:
1. Probability (Schaum’s Outline Series) By: Seymour Lipchitz
2. Probability and Statistics (Sixth Edition) By: R. E. Walpole, R. H. Myers & S. L. Myers
3. Introduction to Probability & Statistics (10 th Edition) By: Mendenhall/Beaver/Beaver
V. Grading System:
Passing score is 60%
Final Grade = (0.17 x P1) + (0.17 x P2) + (0.16 x P3) + (0.50 x FE)
P1, P2, P3 = First Periodical Grade; Second Periodical Grade; Third Periodical Grade ;
respectively
Periodical Grade = (0.50 x Class Standing) + (0.50 x Periodical Exam)
ECE 069: Engineering Data Analysis
Module #8 Student Activity Sheet
A. LESSON PREVIEW/REVIEW
1) Introduction (2 mins)
Before conducting a research experiment, researchers come up with a research design. Experimental
research design serves as an instruction manual on how the experiment is conducted. The design helps
the researcher stay on track and makes sure all bases are being properly covered to ensure the experiment's
validity. Designed Experiments achieve manufacturing cost savings by minimizing process variation and
reducing rework, scrap, and the need for inspection.
B. MAIN LESSON
In planning to conduct an experiment to collect the research data, the following must already be defined.
The research problem and the research objectives. Formulate the research question or a problem
statement.
The responses and the factors. The variables of interest, in relation to your research problems or
objectives, should be identified. Indicate the independent and the dependent variables. Make some
predictions or hypothesis of the possible outcome (the dependent variable or the response) when the
independent variables (the factors) are manipulated. Combination of the factors is termed as treatments.
For example, if you designed an experiment to determine how quickly a cup of hot chocolate drink cools,
then, the manipulated independent variable is time and the dependent measured variable is temperature.
The experiment research design. This is the process of planning an experiment to test the researcher’s
hypothesis. The relationship between two variables - the dependent and the independent variable is
determined. Data collected in experimental research usually are quantitative in nature.
Control Group. The group of the experimental design not exposed to treatment. The difference in the
performance of the control group and the treatment group measures the effects of the full treatment on the
treatment group.
Before-and-after without control design: In such a design a single test group or area is selected and
the dependent variable is measured before the introduction of the treatment. The treatment is then
introduced and the dependent variable is measured again after the treatment has been introduced. The
effect of the treatment would be equal to the level of the phenomenon after the treatment minus the
level of the phenomenon before the treatment.
The main difficulty of such a design is that with the passage of time considerable extraneous variations
may be there in its treatment effect.
After-only with control design: In this design two groups or areas (test area and control area) are
selected and the treatment is introduced into the test area only. The dependent variable is then
measured in both the areas at the same time. Treatment impact is assessed by subtracting the value
of the dependent variable in the control area from its value in the test area.
The basic assumption in such a design is that the two areas are identical with respect to their behavior
towards the phenomenon considered. If this assumption is not true, there is the possibility of extraneous
variation entering into the treatment effect. However, data can be collected in such a design without the
introduction of problems with the passage of time. In this respect the design is superior to before-and-
after without control design.
Before-and-after with control design: In this design two areas are selected and the dependent
variable is measured in both the areas for an identical time-period before the treatment. The treatment
is then introduced into the test area only, and the dependent variable is measured in both for an
identical time-period after the introduction of the treatment. The treatment effect is determined by
subtracting the change in the dependent variable in the control area from the change in the dependent
variable in test area.
This design is superior to the above two designs for the simple reason that it avoids extraneous
variation resulting both from the passage of time and from non-comparability of the test and control
areas. But at times, due to lack of historical data, time or a comparable control area, we should prefer
to select one of the first two informal designs stated above.
This document is a property of PHINMA EDUCATION
ECE 069: Engineering Data Analysis
Module #8 Student Activity Sheet
Observational Study. Here, data are collected through observation from experiments.
Simulations: This procedure uses a mathematical, physical, or computer models to replicate a real-life
process or situation. It is frequently used when the actual situation is too expensive, dangerous, or
impractical to replicate in real life. This method is commonly used in engineering and operational research
for learning purposes and sometimes as a tool to estimate possible outcomes of real research.
2. Activity 3: Skill-building Activities (with answer key) (18 mins + 2 mins checking).
Read the abstract of a research paper and answer the following questions?
1. What is the research problem of the paper being presented?
2. What are the dependent (response) variable and the independent variables (factors)?
3. What is the control group and the experimental or the treatment group?
Abstract— Since the ancient times, many researches and advancements were carried to enhance the physical and
mechanical properties of concrete. Fiber reinforced concrete is one among those advancements which offers a convenient,
practical and economical method for overcoming micro cracks and similar type of deficiencies. Since concrete is weak in
tension hence some measures must be adopted to overcome this deficiency. Human hair is generally strong in tension; hence
it can be used as a fiber reinforcement material. Human hair Fiber is an alternative non-degradable matter available in
abundance and at cheap cost. It also reduces environmental problems. Also addition of human hair fibers enhances the
binding properties, micro cracking control, Imparts ductility and also increases swelling resistance. The experimental
findings in our studies would encourage future research in the direction for long term performance to extending this cost
of effective type of fibers for use in structural applications. Experiments were conducted on concrete cubes, cylinders and
beams of standard sizes with addition of various percentages of human hair fiber i.e., 0%, 0.5%, 1% and 1.5% by weight of
cement, fine & coarse aggregate and results were compared with those of plain cement concrete of M-20 grade. For each
percentage of human hair added in concrete, four cubes, three cylinders and three beams were tested for their respective
mechanical properties at curing periods of 3 , 7 and 28 days. Optimum hair fiber content was obtained as 1.5% by weight
of cement.
C. LESSON WRAP-UP
1) Activity 6: Thinking about Learning (5 mins)
You are done with the session! Let's track your progress.
Period 1 Period 2 Period 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
KEY TO CORRECTION
Activity #3
1) Usage of human hair as a fiber reinforcement material.
2) Mechanical properties of concrete are the dependent variable or the response in the experiment.
The various percentage of human hair in concrete are the independent variables or the factors.
3) Mechanical properties of plain cement concrete of of M-20 grade is the control group Mechanical
properties of concrete (dependent variable or the response in the experiment) with the various
percentage of human hair (the independent variables), 0%, 0.5%, 1% and 1.5% by weight is the
experimental group/ the treatment group.
Productivity Tip: “Tomorrow becomes never. No matter how small the task, take the first step now! “ – T. Ferriss
A. LESSON PREVIEW/REVIEW
1) Introduction (3 mins)
Let us play a Binomial Experiment. To do this, please toss a one-peso coin 20 times and tally the
outcome for each toss whether a head or a tail appears. Fill in the table below for the result of your
statistical experiment.
Fill in the first column of what you know to answer the questions on the second column of the table below.
B. MAIN LESSON
Statistical Experiments are experiments that have three things in common. The experiments have more
than one possible outcomes, each possible outcome can be specified ahead of time, and each outcome
depends on chance. An examples is flipping a coin; where there two possible outcomes, (more than one
outcome), the outcomes can be specified in advance either a head or tail, and the outcome is uncertain
(depends on chance).
Probability Distribution. A probability distribution is a table or an equation that links each different
outcomes of a statistical experiment with its probability of occurrence. In some cases, the probability
distribution is represented as a graph. The outcomes of the experiment are represented by a random
variable.
For example, in flipping a coin two times. An outcome of the experiment might be the number of heads that
we see in two coin flips. If we let the variable X be the number of heads that come up, then X is termed as
the random variable which could take a value of X = 1 (meaning of the two coins flipped only one head
appears, so a tail appears on the other coin) or X= 2 (meaning of the two coins flipped e heads appear, so
no heads appear) or X = 0 (meaning no head appears and that 2 heads appear in flipping the 2 coins).
Let the outcome be the number of heads that you see in flipping two coins. Represented by the random
variable X. Note that the possible outcomes of this experiment are {HH, HT, TH, & TT}. Below is the
probability distribution of the above statistical experiment.
Probability Distribution of tossing a coin 2 times
Binomial distribution describes the probability of a particular outcome in a series of experiment where
the outcome has two distinct possibilities, success or failure. The prefix bi means two. Binomial
distributions are discrete ( that is the events are separate ) and can be used to model the total number
of successes in repeated trials as long as each trial is independent ( means the outcome of one trial does
not affect the outcome of another trial) and the probability of getting either outcome remains constant.
Binomial distribution is a series of independent and identically distributed Bernoulli trials. In a Bernoulli trial,
the experiment is said to be random and could only have two possible outcomes: success or failure. For
example, flipping a coin is considered to be a Bernoulli trial; each trial can only take one of two values (heads
or tails), each success has the same probability (the probability of flipping a head is 0.5), and the results of
one trial do not influence the results of another. The Bernoulli distribution is a special case of the binomial
distribution where the number of trials n = 1. So, repeated flipping of a coin is considered as a binomial
experiment.
The random variable X which has a binomial probability distribution can be represented as ,
The mean is also termed as the expected value or the average of the outcomes. Mean =n p ; where n= the
total number of trials and p = probability of success. For, example the number of heads in 100 trials is 50,
then the mean is 100*0.5.
Median is the middle value in sorted (in an increasing or decreasing arrangement) outcomes. There is no
single formula to find the median for a binomial distribution. However, several special results have been
established: If np is an integer, then the mean, median, and mode coincide and equal np.
Standard deviation is a measure of dispersion of the data set from its mean. Dispersion help you to interpret
the variability of data i.e. to know how much homogenous or heterogenous the data is. In simple terms, it
shows how data approaches the mean. The greater is the standard deviation, the greater is the deviation of
the value of each data from the mean. For a binomial experiment the standard deviation, , is,
n p (1 p)
2. Activity 3: Skill-building Activities (with answer key) (18 mins + 2 mins checking) .
B. In the automobile spare part production of your company, 90% pass final inspection (and 10% fail and
need to be fixed). What is the mean and the standard deviation that will pass in the next 5 inspections?
a. From your binomial experiment game (page 1, introduction) and from 5 of your classmate’s data, fill
in the table below:
i) ii )
c. In tossing a dice, what is the probability that a non-zero number will appear?
C. LESSON WRAP-UP
Productivity Tip:
“In life, people tend to wait for good things to come to them, and by waiting, they miss out’” - Neil Strauss
A. LESSON PREVIEW/REVIEW
1) Introduction (2 mins)
The Poisson distribution is a discrete distribution. It is named after Simeon-Denis Poisson (1781-1840), a
French mathematician, who published its essentials in a paper in 1837.
The Poisson distribution is a special case of the Binomial distribution. Since, as n approaches infinity, the
binomial distribution also approaches the Poisson distribution. Poisson distribution is actually an important
type of probability distribution formula. Poisson distribution models rare events and is asymmetric —
meaning it is always skewed toward the right.
B. MAIN LESSON
Poisson Distribution gives the probability of a number of events in an interval generated by a Poisson
process. The Poisson distribution is defined by the rate parameter, λ, which is the expected number of
events in the interval and the highest probability number of events.
Applications of the Poisson distribution can be found in many fields including:
Asymptotic Poisson model of seismic risk for large earthquakes.
Number of decays in a given time interval in a radioactive sample.
The number of photons emitted in a single laser pulse.
The number of yeast cells used when brewing Guinness beer. This example was used by William Sealy
Gosset (1876–1937).
The number of phone calls arriving at a call centre within a minute. This example was described by A.K.
Erlang (1878–1929).
Failure of a machine in one month.
Example . A particular river overflows every 25 years on the average. Find the probability that there are
x = 2 overflows in a 25 year interval.
Here, λ = 1, x = 2, hence,
𝜆𝑥 𝑒 −𝜆 12 𝑒 −1
P (there are 2 overflows in a 25-year interval, X = 2) 𝑃(𝑋 = 𝑥) = 𝑥!
= 2!
= 0.1839
Examples. Some vehicles pass through a junction on a busy road at an average rate of 300 per hour. Find
out the probability that none passes in a given minute.
a. What is the average number of vehicles passing per minute?
b. What is the probability that no vehicle will pass in a given hour?
c. What is the expected number of vehicles passing in three minutes?
Solution.
a. The average number of vehicles passing per minute, λ = 300 /60 = 5,
x e
b. Using the formula; P( X x)
x!
50 e 5
P ( X 0) 0.00674
0!
c. Expected number of vehicles passing in three minutes = 3· λ = 3· 5 = 15
Activity 3: Skill-building Activities (with answer key) (18 mins + 2 mins checking)
1. Twenty cars were examined for defective surface coating. The frequency of the number of cars with a
given number of defective surface coating per were was as follows:
If a car is chosen at random, what is the probability that a car has 3 or more defective surface coating?
2. If electricity power failures occur according to a Poisson distribution with an average of 5 failures every
20 weeks, calculate the probability that there will not be more than one failure during a particular week.
A company makes electric motors. The defects of the motors follow a Poisson distribution. The probability
an electric motor is defective is 0.01. What is the probability that a sample of 100 electric motors will contain
exactly 3 defective motors?
C. LESSON WRAP-UP
KEY TO CORRECTION
Activity #3
1)
Total number of defective surface coating = 0 + 3 + 10 + 6 + 16 + 5 +6 = 46, hence, λ = 46 / 20 = 2.3.
You may use the property of complement of probability, here,
P (finding 3 defective surface coating or more) = 1 P (finding 3 defective surface coating or more) C
P (finding 3 defective surface coating or more) C = 1 P (finding less than 3 defective surface coating)
P (finding less than 3 defective surface coating) =
P( X 3) P( X 0) P( X 1) P( X 2) ;
x e
using the formula, P( X x)
x!
Productivity Tip: “Hard work keeps the wrinkles out of the mind and spirit.”– Helena Rubinstein
A. LESSON PREVIEW/REVIEW
1) Introduction (2 mins)
The normal distribution is the most commonly used probability distribution. This is also known as the
Gaussian distribution. A random variable that follows a normal distribution is said to be normally
distributed. If we know a random variable is normally distributed, then, you can use the known properties
of the normal distribution to calculate the probability of this variable on certain values. Random variables
representing height and intelligence are approximately normally distributed.
B. MAIN LESSON
The random variable X which has a normal distribution can be represented as,
A standard Normal distribution is when mean ( µ ) = 0 and standard deviation (σ ) =1, substituting these
values to the above equation gives the pdf of the standard normal distribution
( x )2
1
f ( x) e 2
2
The number of standard deviation from the mean is called as the standard score or the z- score. A positive
z-score indicates the raw score is higher than the mean average. For example, if a z-score is equal to +1, it
is 1 standard deviation above the mean. A negative z-score reveals the raw score is below the mean
average. For example, if a z-score is equal to - 2, it is 2 standard deviations below the mean.
1
( z )2
x
f ( z) e 2 where: z
2
Example 1. The heights of the male adults are normally distributed with a mean of 1.7 meter and a standard
deviation of 0.20. What is the corresponding standard score of if the heights of these adults are x1 = 1.4
meter and 1.6 meter.
The Standard Normal Probability Distribution Curve ( mean , µ = 0 and standard deviation, σ = 1.0 ).
Note that the total area under the probability distribution curve is equal to 1.
Solution: Referring to the Areas Under the Normal Curve ( Statistical Table).
a) The area between z = - 1.5 and z = - 0.5 is 0.24173. This is also the probability that male adults
have heights between x = 1.4 meter and x = 1.6 meter. Mathematically expressed as
b) Number of male adults having a height between x= 1.4 meter and 1.6 meter is 97 (400 x 0.24173).
The normal curve below shows the probability distribution of z (in percentage) given its standard deviation.
Source: https://www.mathsisfun.com/data/standard-normal-distribution.html
Example 2: A machine produces electrical components. 99.7 % of the components have lengths between
1.176 cm and 1.224 cm. Assuming the data is normally distributed, what are the mean and the standard
deviation ?
Solution :
At 99.7 %, z 2.97 .
x
z x z
Since
From eqn. 1 and eqn. 2 , solve for the mean, µ , and the standard deviation , σ.
This equations gives the following solution : µ = 1.20 cm; σ =0.008 cm
2) Activity 3: Skill-building Activities (with answer key) (18 mins + 2 mins checking).
b. A company makes parts for a machine. The lengths of the parts must be within certain limits or they will
be rejected. A large number of parts were measured and the mean and standard deviation were
calculated as 3.1 and 0.005 m respectively. Assuming this data is normally distributed and 99.7 % were
accepted, what are the limits ?
C. LESSON WRAP-UP
1) Activity 6: Thinking about Learning (5 mins)
You are done with the session! Let's track your progress.
Period 1 Period 2 Period 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
FAQs
Many events are normally distributed, or very close to it. For a large sample size, N, the distribution of
non – normal random variables approaches that of the normal distribution.
KEY TO CORRECTION
Activity #3
a)
Given: µ = 255 grams; σ = 2.5 grams; x = 250 grams
Solution:
x 250 255
Solving for the standard score, z 2.0
2.5
From the table at z = - 2.0, area = 0.02275. Mathematically, P( x 250) 0.02275 .
Thus, the percentage of coffee that are underweight is 2.275 % (0.02275 x 100).
b)
Given : µ = 3.1 meter; σ = 0.005 meter ; P(a x b) 0.997
Solution :
x
At 99.7 % probability of acceptance, z 2.97 z x →
z eqn.3
Using eqn 3, the limits are . . . . x = 3.1 2.97(0.005) = 3.085 meter
x = 3.1 + 2.97(0.005) = 3.115 meter
1) Introduction (2 mins)
Questions that concern the time you need to wait before a given event occurs and if this waiting time is
unknown, it is often appropriate to think of it as a random variable having an exponential distribution.
Further, the time you need to wait before an event occurs has an exponential distribution if the probability
that the event occurs during a certain time interval is proportional to the length of that time interval. For
example, you may ask, how long will a piece of machinery work without breaking down?
B. MAIN LESSON
The random variable X which has an exponential distribution can be represented as,
The exponential distribution refers to the probability distribution that is used to define the time between two
successive events that occur independently and continuously at a constant average rate. Here, the
exponential random variable has fewer large values and more small values.
The assumption of a constant rate is very rarely satisfied in the real world scenarios, however, if the time
interval is selected in such a way that the rate is roughly constant, then you can approximate the random
variable to follow an exponential distribution.
; for x ≥ 0
e x
f ( x)
where: X is a non-negative continuous random variable
λ = the rate parameter ( a constant )
0 ; for x < 0
e x dx 1 e x
x
and the probability of X = x is , P( X x) 0
.
The mean, median and the variance of the exponential random variables, X.
1 ; hence, λ = 1 /mean
The mean or the expected value of X is, E ( X ) mean
ln( 2)
the median , median( X ) , and
1
the variance of X is, var( X )
2
Sample Problem. On the average, a certain computer has a life time of 10 years. If the life of the
computer is exponentially distributed.
b. What is the probability that a computer has a life of less than 7 years?
1
Let X be the random variable representing the life of the computer. Here, 0.10
10
7
P ( X 7) 0.10e 0.10 x dx e 0.10 x
0
7
0 1 e 0.7 1 0.497 0.503
c. What is the probability that a computer has a life of more than 10 years?
P (X > 10) = 1 - P (X > 10) C ; P ( X > 10 )c = P ( X ≤ 10 )
P ( X 10)
10
0
0.10e 0.10 x dx e 0.10 x 10
0 1 e 1.0 1 0.368 0.632
d. What is the probability that a computer has a life of more than 7 years but less than 10 years?
3) Activity 3: Skill-building Activities (with answer key) (18 mins + 2 mins checking
Problem. Suppose that the lifetime (x) of certain model of car battery follows an exponential distribution
with a mean life of 5 years”
a. What is the probability distribution of the life of the car battery?
b. Plot the probability density function, f(x) versus the lifetime of the car battery (x).
c. What is the probability that the life of the battery will be greater than 2 years?
d. What is the probability that the life of the battery is greater than 2 years but less than 4 years?
e. What is the var(x)?
1 1
a.
b.
2 c. d. 2
3. A conversation follows an exponential distribution, f ( x) e x , with a mean time of 3 minutes.
a) Find the probability that the conversation will be more than 5 minutes.
e 5
5
1 3
5
a.
e b. e 15 c. e 3
d.
3 3
b) Find the probability that the conversation will be less than 5 minutes.
e 5
5
1
5
a.
1 e 3 b. 1 e 15 c. 1 e 3
d.
1
3 3
C. LESSON WRAP-UP
1) Activity 6: Thinking about Learning (5 mins)
You are done with the session! Let's track your progress.
Period 1 Period 2 Period 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Answer. The amount of time (beginning now) until an earthquake occurs; the amount of time, in months,
a car battery lasts. The exponential distribution is widely used in the study of the amount of time a
product lasts (field of reliability).
KEY TO CORRECTION
Activity #3
Solution:
0.2 x
b. Plot of f ( x) 0.20e versus x.
1) Introduction (2 mins)
When a parameter is being estimated, the estimate can be either a single number or it can be a range of
scores. When the estimate is a single number, the estimate is called a "point estimate"; when the estimate is
a range of scores, the estimate is called an interval estimate. Confidence intervals are used for interval
estimates.
Fill in the first column of what you know to answer the questions on the second column of the table below.
B. MAIN LESSON
ESTIMATION refers to the process by which one makes inferences about a population based on
information obtained from the sample.
STATISTIC refers to any measurable quantity calculated from the sample. A statistic could be the sample
mean, x ; the sample standard deviation, s; the sample variance, s2; . . .
PARAMETER refers to the descriptive measures of the population. For example, the population mean,
µ; the population standard deviation, σ; the population variance, σ2; . . .
ESTIMATOR is a quantity calculated from the sample data which are used to give information about the
unknown quantity in the population. For example, the sample mean, x , an estimator of the population
mean, µ.
ESTIMATE is the particular value of an estimator that is obtained from a particular sample of data and
used to estimate the value of the parameter. In the preceding example if the sample mean is, x 3.5 ,
then we may say that 3.5 is the estimate of the parameter, the population mean, µ.
Sampling Distribution
The distribution of the point estimator (statistic) is termed as the sampling distribution
Let each set of random variables X 1 , X 2 ,..., X n is normally distributed with mean µ and variance, θ2 .
X 1 X 2 ..... X n
The sample mean X
n
.... n
The mean of the sample distributions X
n n
2 2 .... 2 n 2 2
The variance of the sample distribution 2X
n2 n2 n
2
The sampling distribution of X , is X N , .
n
Example. An electronic company manufactures resistors that have mean resistance of 120 ohms and a
standard deviation of 12 ohms. If distribution of the resistance is normal, find the mean, the variance and
the standard deviation of the sampling distribution for n = 25 resistors.
The Central Limit Theorem states that the sampling distribution of the sample means (unknown
population) approaches a normal distribution as the sample size gets larger. This holds especially
true for sample sizes over 30.
A good estimator has a small bias. When the bias is zero then you may say that the point estimator
is unbiased.
2. Consistency
Consistency shows how close the point estimator to the value of the parameter as the sample size
increases.
^
E ( )
^ as n → ∞
Var ( ) → 0
3. Relative Efficiency
The absolute efficiency of an estimator is the ratio between the minimum variance and the actual
variance.
An unbiased estimator is called efficient if its variance coincides with the minimum variance for all
values of the population parameter.
If two competing estimators are both unbiased, the one with the smaller variance (for a given
sample size) is said to be relatively more efficient. An estimator θ is said to be more efficient than
another estimator θ2 for θ if the variance of the first is less than the variance of the second.
Standard Error
Standard error is a measure of accuracy of a statistic. This is equal to the standard deviation of the
sampling distribution of this statistic.
The standard error tells you how accurate the mean of any given sample from that population is likely to
be compared to the true population mean. When the standard error increases, i.e. the means are more
spread out, it becomes more likely that any given mean is an inaccurate representation of the true
population mean.
Example. In a certain property investment company with an international presence, workers have a
mean hourly wage of 125 pesos with a population standard deviation of 5 pesos. Given a sample size of
30, estimate and interpret the SE of the sample mean.
5
SE 0.90
n 30
Interpretation. If we draw several samples of size to from the population, we will end up with a mean
hourly wage of 125 pesos with a standard error of 0.90
2) Activity 3: Skill-building Activities (with answer key) (18 mins + 2 mins checking).
Multiple Choice. Encircle the best answer.
1. A sampling distribution is the probability for which of the . . .
a. sample b. sample statistic c. population d. d. population parameter
2. What is the best description of a point estimate?
a. any value from the sample to estimate a parameter.
b. a sample statistic used to estimate a parameter.
c. the margin of error to estimate the parameter.
d. the population mean.
3. What does the central limit theorem state?
a. If the sample size increases sampling distribution must approach normal distribution.
b. If the sample size decreases, then the sample distribution must approach normal distribution.
c. If the sample size increases, then the sampling distribution much approach an exponential
distribution.
d. If the sample size decreases, then the sampling distribution much approach an exponential
distribution.
4. The difference between the expected value of the sample and the estimates value of the parameter is
the . . .
a. bias b. error c. contradiction d. difference
5. A random sample of 100 engineering students are asked how much they spend a meal during week
days. The average spent is found to be P70. What is the point estimate of the population mean?
a. P 100 b. P 90 c. P80 d. P 70
6. Which of the following statements applies to a point estimate?
a. The point estimate is a parameter.
b. The point estimate will tend to be accurate if the sample size exceeds 30 for non-normal
populations.
c. The point estimate is subject to sampling error and will almost always be different than the
population value.
d. all of the above
7. In an application to estimate the mean number of kilometers students commute to school each day, the
following are given: n = 20; x 4.33 ; s 3.50
The point estimate for the true population mean is:
a. 1.638 b. 4.33 ± 1.638 c. 4.33 d. 3.50
8. s is the point estimate for the . . .
a. population variance c. sample variance
b. population standard deviation d. sample standard deviation
9. A random sample of 340 people in Carmen showed that 66 listened to an FM radio Station A. Based on
this sample information, what is the point estimate for the proportion of people in Carmen that listen to
Station A?
a. 340 b. 0.194 c. 66 d. 0.66
10. According to the Central Limit Theorem.
a. A sampling distribution is normally distributed even if the population in not.
b. A sampling distribution can be normally distributed only if the population is normally distributed also.
c. The population mean measures the sample mean.
d. The population mean and the sample mean of the distribution are equal.
Given the information above, what is the point estimate for the population mean?
C. LESSON WRAP-UP
The point estimate of the population is obtained from the sample. The smaller the bias of the point estimator
and as the sample size increases, the closer is its mean value to the parameter being estimated.
KEY TO CORRECTION
Activity #3
1) b
2) b
3) a
4) a
5) d
6) c
7) c
8) a
9) b
10) a
1) Introduction (2 mins)
When a parameter is being estimated, the estimate can be either a single number or it can be a range of
scores. When the estimate is a single number, the estimate is called a "point estimate"; when the estimate
is a range of scores, the estimate is called an interval estimate.
B. MAIN LESSON
Confidence Intervals
Prediction Intervals
Tolerance Intervals
Confidence Interval
A confidence interval is a range of values that probably contain the population mean.
The best known and often used statistical intervals. Confidence intervals are used to express the
uncertainty associated with the population parameter. The estimate of the interval should be
repeatable, meaning, if you do estimating the interval again and again, you will get the same result
and this could be express as the confidence level. Confidence levels are percentage o certainty.
Lower boundary of the confidence interval: x z / 2
n
Upper boundary of the confidence interval: x z / 2
n
Example 1. We have a sample of 20 observations from a Normal distribution with a standard deviation of
0.20 and a sample mean of 4.5. We want a 95 % level of significance. What are the lower and upper
boundary of the confidence interval?
From the Z – score table, at 95% confidence level, 0.05 the corresponding z value is 1.96.
Example 2. The sample mean result is 25%. For this estimate calculate a confidence interval if the margin
of error is 3.2% for this estimate.
Lower boundary of the confidence interval:
x MOE 25 % 3.2 21 .8%
Upper boundary of the confidence interval:
x MOE 25 % 3.2 28.2%
Prediction Intervals
Prediction interval is an estimate of an interval in which a future observation will fall, with a certain probability,
given what has already been observed. For example, in a 95% prediction interval of [10 15], you are 95%
confident that the next new observation will fall within this range.
Tolerance Intervals
A tolerance interval covers a specified proportion of the population for a given confidence level.
For example, 85% of the time, batteries will fall into the interval 100 to 120 hours, with 95% confidence.
2. Activity 3: Skill-building Activities (with answer key) (18 mins + 2 mins checking).
Problem Solving. The quality assurance (QA) manager of a light bulb factory needs to estimate the average
lifetime of a large shipment of bulbs made at the factory. The lifetime of these light bulbs is normally
distributed with a standard deviation of 100 hours. A random sample of 64 bulbs from the shipment results
in a sample mean lifetime of 350 hours.
Given: σ = 100 hrs.; x = 350 hrs.; n = 64
(a) Find a 95% confidence interval for the mean lifetime (µ) for the entire shipment.
(b) Suppose that the standard deviation was 80 rather than 100 hours. Recalculate your confidence interval
from Part (a). Is it narrower or wider than your solution to (a)?
You may now answer the third column of table in activity 1 based on what you know now.
What I Know QUESTION What I Learned
Problem Solving. A bottling company fills thousands of 12-ounce bottles with soda drink at the same level.
A random sample of bottles were taken from the processing line containing the following amount of soda
drink (in ounces.). 11.8; 12.1; 11.2; 12.0; 11.8; 11.7; 11.9. Assuming the distribution of the content is
normally distributed with a standard deviation of 0.01, find the 95% confidence interval the soda drink in the
bottles.
C. LESSON WRAP-UP
1) Activity 6: Thinking about Learning (5 mins)
You are done with the session! Let's track your progress.
Period 1 Period 2 Period 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Increasing the sample size decreases, the width of confidence intervals, because it decreases the
standard error.
The statement, "the 95% confidence interval for the population mean is (250, 300), is equivalent to
the statement, "there is a 95% probability that the population mean is between 250 and 300.
FAQs
Which is more accurate, a 95% confidence interval or a 99% confidence interval ?
The 99% confidence interval is more accurate than the 95%.
1) Introduction (2 mins)
When you conduct some researches, you are trying to discover of something new. Improved process?
More accessible raw material? . . . Along the process, several questions will come up. So you are trying
to make some hypotheses to answer your questions. This session, will guide you on how to test your
hypothesis.
Fill in the first column of what you know to answer the questions on the second column of the table below.
What is a z – test?
B. MAIN LESSON
HYPOTHESIS TESTING
Hypothesis Testing is a statistical test used to determine whether the hypothesis assumed for the
sample of data stands true for the entire population or not.
Hypothesis testing is also used when you are comparing two or more groups.
The purpose of hypothesis testing is to determine whether there is enough statistical evidence in
favor of a hypothesis about a parameter.
Hypothesis should be simple and specific. There are two types of statistical hypothesis, the null
hypothesis and the alternative hypothesis. The null and alternative hypotheses are contradictory.
Since they are contradictory, you must examine evidence to decide if you have enough evidence to
reject the null hypothesis or not.
NULL HYPOTHESIS
Denoted as Ho. H0 always has a symbol with an equal in it.
A statement that there is no relationship between two measured phenomena or no association
among groups.
A null hypothesis is a hypothesis that says there is no statistical significance between the two
variables in the hypothesis. It is the hypothesis that the researcher is trying to disprove.
It is a statement of no difference between sample means or proportions. It may also be a statement
of no difference between a sample mean and a population mean. In other words, the difference
equals 0.
ALTERNATIVE HYPOTHESIS
Denoted as H1.
It is a claim about the population that is contradictory to Ho.
H1 never has a symbol with an equal in it.
The hypothesis that one is trying to establish, and it can be “statistically proved” by a rejection of
the null hypothesis.
Example. Write the null hypothesis and the alternative hypothesis in the following statements.
2. The mean number of cars a person owns in his lifetime is not more than 5.
Null Hypothesis, Ho: µ ≤ 5 cars
Alternative Hypothesis, H1: µ > 5 cars
3. Seventy percent of the first year engineering students have no failing grades this school
year.
Null Hypothesis, Ho: p = 0.75
Alternative Hypothesis, H1: p 0.75
This document is a property of PHINMA EDUCATION
ECE 069: Engineering Data Analysis
Module #15 Student Activity Sheet
LEVEL OF SIGNIFICANCE
Denoted by α
Measures the strength of the evidence that must be present in your sample before rejecting the null
hypothesis.
It is the probability of rejecting the null hypothesis when in fact it is true, that is (Type 1 error = α).
Usual values of α are 0.05, 0.02, or 0.01.
TEST STATISTIC
A test statistic is a random variable that is calculated from sample data and used in a hypothesis
test.
Test statistics is used to determine whether to reject the null hypothesis. The test statistic compares
your data with what is expected under the null hypothesis.
The test statistic is used to calculate the p-value.
Examples of test statistic are: for a Z-test is the Z-statistic, for the T –test is the t – statistic.
P-VALUE
The probability that your sample could have been drawn from the population being tested given
that the null hypothesis is true.
A p-value of 0.05 indicates that you have only 5% chance of drawing the sample tested if the
null hypothesis was actually true.
If the p-value is less than the significance level, we reject the null hypothesis.
The p – value is the area under the curve at the rejection region.
Example. If the observed value of z = 1.51 (calculated value), then from the statistical table at z
= 1.5 is 0.93448, so the p – value of the sample is 0.06552.
Statistical significance plays a pivotal role in statistical hypothesis testing. It is used to determine
whether the null hypothesis should be rejected or retained. The null hypothesis is the default
assumption that nothing happened or changed.
For the null hypothesis to be rejected, an observed result has to be statistically significant, i.e. the
Source : https://www.google.com/
o If H1 contains the “<”, then conduct a left tailed test. Compare calculated test statistic
with the critical value of the test statistic at the given α. If calculated test statistic > critical
value of the test statistic, then you do not reject the null hypothesis.
o If H1 contains the “ “, then conduct a 2 tailed test. Compare calculated test statistic with
the critical value of the test statistic at the given α/2. If calculated test statistic (if negative)
< critical value of the test statistic, then you do not reject the null hypothesis and If
calculated test statistic ( if positive) > critical value of the test statistic, then you do not
reject the null hypothesis
This document is a property of PHINMA EDUCATION
ECE 069: Engineering Data Analysis
Module #15 Student Activity Sheet
P- value Approach
o Compare the p – value with α. If p – value < α, then you do not reject the null hypothesis.
Step 4: Drawing a conclusion. Whether to reject the null hypothesis or not to reject the null hypothesis.
Problem. A manufacturer of electric lamps is testing a new production method that will be considered
acceptable if the lamps produced by this method result in a normal population with an average life of 2,400
hours and a standard deviation equal to 300. A sample of 100 lamps produced by this method has an average
life of 2,320 hours. Can the hypothesis of validity for the new manufacturing process be accepted with a risk
equal to or less than 5%?
Step 1. State the Null Hypothesis and the Alternative Hypothesis
Null Hypothesis: Ho: µ = 2,400 hours
Alternative Hypothesis: H1: µ 2,400 hours
Step 2. Level of Significance, α = 0.05
Step 3. Calculate the test statistic, z – statistic.
Further, at z = 2.67, from the statistical table, the area is equal to (0.0038 x 2) 0.0076 .
Note that, this area is the p value of the sample, and since the p- value is less than the α/2 (0.025),
then we do not accept the null hypothesis.
2. Activity 3: Skill-building Activities (with answer key) (18 mins + 2 mins checking) .
A rental car company claims the mean time to rent a car on their website is 60 seconds with a standard
deviation of 30 seconds. A random sample of 36 customers attempted to rent a car on the website. The
mean time to rent was more than 70 seconds. Is there enough evidence that the sample mean time is more
than 60 seconds at 95% level of significance?
What is a z – test ?
2. A test is conducted; Ho: µ = 40, H1 : µ > 40 . The z- test statistic is 1.5. What is the p- value of
this test?
a. 0.9332 b. 0.0667 c. 0.05 d. 0.01
3. A test is conducted; Ho: µ = 40, H1 : µ < 40 . The z- test statistic is - 1.5. The correct decision is:
a. Reject H0 both α = 0.05 and α = 0.01.
b. Reject H0 at α = 0.05 but do not reject Ho at α = 0.01.
c. Reject H0 both α = 0.05 and α = 0.10.
d. Reject H0 at α = 0.05 but do not reject Ho at α = 0.10.
5. The z – test is used to test the sample mean in the following case. . .
a. sample standard deviation is known.
b. sample size is less than 30.
c. sample size is more than 30.
d. data are not normally distributed.
This document is a property of PHINMA EDUCATION
ECE 069: Engineering Data Analysis
Module #15 Student Activity Sheet
C. LESSON WRAP-UP
Statistics is about data and it is the interpretation of the data that we are interested in. In hypothesis testing
we are trying to interpret or draw conclusions about the population using data coming from the sample.
Further, hypothesis testing evaluates statements about a population to evaluate which statement is
supported by the sample data.
KEY TO CORRECTION
Activity #3
Conduct a one-sample z-test.
Step 4. The calculated z – statistic (2.0) is greater than the critical value of the z statistic at α (1.96), so
there’s enough evidence to reject the null hypothesis in favor of the alternative hypothesis,
suggesting that the mean time to rent a car is more than 60 seconds.
Further, if z = 2.0, the corresponding p value of the sample is 0.02275, and since the p - value is
less than the α (0.05), then we do not accept the null hypothesis or we reject the null
hypothesis in favor of the alternative hypothesis.
1) Introduction (2 mins)
⮚ A t-test is a statistical test used to determine if there is a significant difference between the means of
two groups.
B. MAIN LESSON
⮚ t -Test Assumptions
● The sample is collected from a representative of randomly selected portion of the total population.
● The data is normally distributed.
● Population means is known.
Types of t – test
⮚ One Sample t – test. This test the mean of a single group against a known population.
⮚ Independent Sample t – test. This test compares the mean for two groups of sample.
⮚ A Paired Sample t – test. This test compares the means of the same group at different times.
⮚ The One Sample t - Test is commonly used to test the statistical difference between a sample mean
and a known or hypothesized value of the mean in the population.
⮚ t-statistic.
where x = sample mean
x
t s = sample standard deviation
s µ = population mean
n n = sample size
Example. Test the hypothesis at α = 0.05 that taking a vitamin capsule makes an individual smarter. Average
IQ of an individual is 100. To test the hypothesis 12 engineering students take a the same vitamin capsule
for one year and then an IQ test was given to these students. The results are 116, 111, 101, 120, 99, 94,
106, 115, 107, 101, 110, and 92.
Further using the p - approach, at t = 2.35 and at degrees of freedom = 11, the p value is between
1% and 2.5%, hence lesser than the level of significance, α = 0.05 ( or 5 % ), so we reject the
null hypothesis.
⮚ This test compares the mean for two groups of sample that are independently selected from each
other.
⮚ There are two types of independent sample t - test.
● Equal Variance ( Pooled variance t – test) with degrees of freedom, df = n1 + n2 – 2.
● Unequal Variance (Separate variance t – test) with degrees of freedom,
𝑠2 𝑠2
( 1 + 2 )2
𝑛1 𝑛2
𝐷𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 = 2
𝑠 𝑠2
( 1 )2 ( 2 )2
𝑛1 𝑛2
𝑛1 − 1 + 𝑛2 − 1
⮚ The t - statistic
where :
( x1 x 2 ) Do
t x1 and x2 = mean of sample 1 and sample 2
(n1 1) s1 (n2 1) s2 1 1
2 2
s1 and s2 = variance of sample 1 and sample 2
n1 and n2 = size of sample 1 and sample 2
n1 n2 2 n1 n2 Do = - (a number that is deduced
from the statement of the situation).
Example. An experiment was performed to compare the abrasive wear of two materials. Ten pieces of
material 1 ( group 1) and ten pieces of material 2 (group 2) were tested. The test on material 1 gave an
average wear of 85 units with a sample standard deviation of 4, and the test on material 2 gave an average
wear of 81 with a sample standard deviation of 5. Can we conclude at 0.05 level of significance that abrasive
wear of material 1 is greater than that of material 2 ? Assume the populations are normally distributed and
with equal variances.
Step 4. The calculated or the observed value of the t – statistic (1.96 ) is greater than the critical value of
the t- statistic (at α = 0.05 and degrees of freedom = 18, 1.34), or we may say that the observed value
of the t – statistic is at the rejection region. Hence, we reject the null hypothesis in favor of the
alternative hypothesis. This suggests that the abrasive wear of material 1 is greater than the
abrasive wear of material 2.
Further, at t = 1.96 and at degrees of freedom = 18, the p value is between 5% and 2.5%, hence
lesser than the level of significance, α = 0.05 ( or 5 % ), so we reject the null hypothesis.
Example. Assume that we are taking a diagonal measurement of bill boards purchased by a company.
Group 1 of samples includes 20 bill boards, while group 2 includes 10 billboards.
Statistical Data : Group 1 : mean diagonal measurement = 21.6 inches ; variance = 17.1
Group 2 : mean diagonal measurement = 19.4 inches ; variance = 1.4
Can we conclude that the mean of group 1 is greater than group 2.
Step 1. State the Null Hypothesis and the Alternative Hypothesis
Null Hypothesis : Ho : µ1 = µ2
Alternative Hypothesis : H1 : µ1 > µ2 ( One tail test )
Step 2. At level of significance, α = 0.05, and
2
s12 s2 2
2
17.1 1.4
Degrees of freedom = n1 n2
20 10
24
2 2 2 2 2 2
s1 s2 17.1 1 .4
n n 20 10
1 2
n1 1 n2 1 20 1 10 1
Further using the p approach, at t = 2.194 and at degrees of freedom = 24, the p value is between
2.5% and 1.0%, hence lesser than the level of significance, α = 0.05 ( or 5 % ), so we reject the
null hypothesis.
Paired t - Test
⮚ A paired t-test is used when we are interested in the difference between two variables for the
same subject. Often the two variables are separated by time or something other than time.
⮚ Compares the means of two related groups of samples.
⮚ The t –statistic with degrees of freedom df = n-1
t
D
n D 2 D
2
Compare the fuel economy of the two cars , where the cars in each pair is operated using different
types of gasoline ( Type 1 gasoline & Type 2 gasoline)
t
D
1.3
2.6
nD D 9 (0.41) (1.3) 2
2 2
n 1 9 1
Step 4. The calculated or the observed value of the t – statistic (2.6 ) is greater than the critical value of the
t- statistic (at α = 0.05 and degrees of freedom = 8, 1.86), or we may say that the observed value of
the t – statistic is at the rejection region. Hence, we reject the null hypothesis in favor of the
alternative hypothesis. This suggests that the Type 1 gasoline is more economical fuel than the
Type 2 gasoline.
Further ( using the p-value approach), at t = 2.6 and at degrees of freedom = 8, the p value is
between 2.5% and 1.0%, hence lesser than the level of significance, α = 0.05 ( or 5 % ), so we
reject the null hypothesis.
2. Activity 3: Skill-building Activities (with answer key) (18 mins + 2 mins checking) .
The table gave the observations of the control group and the treatment group. Use paired t-test to at 0.05
level of significance to determine if there is a significant difference between the mean of the two groups.
Sample Control Treatment
No. Group Group
1 3 20
2 3 13
3 3 13
4 12 20
5 15 29
6 16 32
7 17 23
8 19 20
9 23 25
10 24 15
11 32 30
Answer Sheet:
2. In testing the differences between the means of two independent populations, the null hypothesis
is . . .
a. H o : 1 2 1 b. H o : 1 2 0 c. H o : 1 2 0 d. H o : 1 2 1
4. A one sample t – test was conducted to test the IQ of engineering students. The observe
t – statistic in the study with 15 samples at 0.05 level of significance is 2.0. What is the p – value
of this study?
a. within a value of 0.05 and 0.025.
b. greater than 0.05.
c less than 0.025.
d. none of the above
5. Two different alloys are being considered for making lead-free solder used in the wave soldering
process for printed circuit boards. A crucial characteristic of solder is its melting point, which is
known to follow a Normal distribution. A study was conducted using a random sample of 21 pieces
of solder made from each of the two alloys. In each sample, the temperature at which each of the
21 pieces melted was determined. The mean and standard deviation of the sample for Alloy 1 were
x1 = 218.9ºC and s1 = 2.7ºC; for Alloy 2 the results were x2 = 215.5ºC and s2 = 3.6ºC. If we were
to test H0: µ1 = µ2 against Ha: µ1 ≠ µ2. In this study what is the degrees of freedom equal to?
a. 21 b. 20 c. 40 d. 42
C. LESSON WRAP-UP
1) Activity 6: Thinking about Learning (5 mins)
You are done with the session! Let's track your progress.
Period 1 Period 2 Period 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
⮚ The t test is the commonly used statistical test to test the means of two groups of sample.
⮚ You can also use the data menu of Microsoft excel to find the critical value of t and the observe
value of t.
⮚ Let us use Microsoft excel to find the critical value of t and the observed t for the data in activity 3
above.
The table gave the observations of the control group and the treatment group. Use paired t-test to
at 0.05 level of significance to determine the significance of the mean of the two groups.
Sample Control Treatmen
No. Group t Group
1 3 20
2 3 13
3 3 13
4 12 20
5 15 29
6 16 32
7 17 23
8 19 20
9 23 25
10 24 15
11 32 30
Steps:
● Enter the control group and the treatment group columns in excel.
● Click Data, then Data Analysis , the t-test: Paired Two Sample for Means, then press OK.
● Input Variable 1 Range, then Variable 2 Range, enter Alpha, click New Worksheet Ply, then OK.
KEY TO CORRECTION
Activity #3
Follow the steps in hypothesis testing.
t
D
73
2.73
n D 2 D 11 (1,131) ( 73) 2
2
n 1 11 1
Control Treatment
Sample Group Group D
No.
1 3 20 -17 289
2 3 13 -10 100
3 3 13 -10 100
4 12 20 -8 64
5 15 29 -14 196
6 16 32 -16 256
7 17 23 -6 36
8 19 20 -1 1
9 23 25 -2 4
10 24 15 9 81
11 32 30 2 4
Step 4. The calculated or the observed value of the t – statistic (- 2.73 or 2.73 since two tailed test is
conducted) is greater than the critical value of the t- statistic (at α = 0.05 and degrees of freedom =
10, 2.228), or we may say that the observed value of the t – statistic is at the rejection region.
Hence, we reject the null hypothesis in favor of the alternative hypothesis. This suggests that the
control group and the treatment group do not have equal mean.
Further (using the p-value approach), at t = 2.73 and at degrees of freedom = 10, the p value is
between 2.5% and 1.0%, hence lesser than the level of significance, α = 0.05 (or 5 %), so we reject
the null hypothesis.
1) Introduction (2 mins)
Simple linear regression is a statistical method for obtaining a formula to predict the scores on one
variable from the scores on a second variable. The variable we are predicting is called the criterion
variable and is referred to as Y. The variable we are basing our predictions on is called the predictor
variable and is referred to as X. When there is only one predictor variable, the prediction method is
called simple regression.
In simple linear regression, the predictions of Y when plotted as a function of X form a straight line.
Linear regression consists of finding the best-fitting straight line through the points. The best-fitting line
is called a regression line.
B. MAIN LESSON
The residual error is, i = Yi Y0 , where Yi is the predicted value and Y0 is the observed
value. The error term is used to account for the variability in y that cannot be explained by the linear
relationship between x and y. If ε were not present, that would mean that knowing x would provide
enough information to determine the value of y.
The 0 ( the intercept of the regression line) and 1 ( the coefficient of X i or the slope of the
regression line ) is estimated by minimizing the sum of the square of the residual error. This
procedure is known as the Method of Least Square.
2
minimize ( i = (Yi Y.0 ) 2 )
1 n
n n n
n xi yi xi yi
n
1 i 1 i 1 i 1 and o yi 1 xi
n
n
2 n i 1 i 1
n xi xi
2
i 1 i 1
Equation 2
Equation 1
We then substitute the value of 0 and 1 and to the equation and have the regression line
equation.
Y 0 1 X Equation 3
The relationship between the independent and dependent variable is linear, that is, the line of
best fit through the data points is a straight line (rather than a curve)
Correlation Coefficient, r
One of the most commonly used correlation coefficient is the Pearson’s correlation coefficient, r.
The correlation coefficient, r, measures the strength of the linear relationship between the response
variable and the set of explanatory variable.
nx y x y
r
n x 2
x
2
n y 2
y
2
Equation 4
Coefficient of Determination, r2
The square of the correlation coefficient.
It is the proportion of variation in the response variable explained by the regression model.
The most common interpretation of the coefficient of determination is how well the regression model fits
the observed data. For example, a coefficient of determination of 60% shows that 60% of the data fit the
regression model. Generally, a higher coefficient indicates a better fit for the model.
Example 1. A research was done to study the effect of ambient temperature, x, on the electric power
consumed, y, by an industrial plant. Other factors were held constant. Below are data collected from
the experiment. Find the equation of the regression line and estimate the electric power consumption
when x = 70 0F.
y, x,
Trials
(BTU) (0 F )
1 250 27
2 285 45
3 320 72
4 295 58
5 265 31
6 298 60
7 267 31
8 321 74
From this table, we have Σ y i = 2,301; Σ x I = 398; Σ x i * y I = 117, 851; Σ xi 2 = 22,300 and
Σ yi 2 = 22,300. We then substitute these values to Equation 1, then to Equation 2 to solve 0 and
1
n n n
n xi yi xi yi
8 (117 ,851) 398 ( 2,301)
1 i 1 i 1 i 1
1.35
8 ( 22,300 ) 398
2 2
n
n
n xi xi
2
i 1 i 1
1 n n
o yi 1 xi
1
2,301 1.35 (398) 220 .5
n i 1 i 1 8
Substitute the values of 0 and 1 Equation 3, hence, the regression line equation is . . .
y = 220.5 + 1.35 x.
To predict the power consumption at x = 70 0F, we substitute this value to the regression line to
predict the power consumption, y.
The value of r =0.99, indicates that there is a very high positive relationship between the electric power
consumption and ambient temperature. That there is an increase in electric power consumption for an
increase in ambient temperature. Furthermore, the coefficient of determination of 0.98 (r2 = 0.992)
indicates that 98 % of the data fits into the regression line.
2) Activity 3: Skill-building Activities (with answer key) (18 mins + 2 mins checking)
Given below are data set on y and x. Let the y be the response variable and x be the predictor variable.
Find the equation of the regression line equation and the value of the correlation coefficient, r. Interpret
your result.
x y
0 2
1 3
2 5
3 4
4 6
2. If r 2 = 0.99, how confident are you in using the regression line to estimate the response variable given
the predictor variable?
a. not confident c. the relationship is weak to predict
b. very confident d. the relationship cannot be predicted
3. If the correlation coefficient is 0.90, the percentage of variation in the response variable explained by
the variation in the predictor variable is . . .
a. 0.90 % b. 90% c. 81% d. 0.81%
5. Larger values of r2 give us idea t hat the observations are more closely grouped about the . . ..
a. average value of the independent variables.
b. average value of the dependent variable
c. least squares line.
d. none of the above.
C. LESSON WRAP-UP
1) Activity 6: Thinking about Learning (5 mins)
You are done with the session! Let's track your progress.
Period 1 Period 2 Period 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Group yourselves by three. Search for a problem (with given data points) related to your profession
that uses regression analysis. Solve for the regression line and the correlation coefficient then interpret
your result.
KEY TO CORRECTION
Activity #3
Extending the columns of the preceding table.
0 2 0 0 4
1 3 3 1 9
2 5 10 4 25
3 4 12 9 16
4 6 24 16 36
SUM 10 20 49 30 90
i 1 i 1
1 n n
o yi 1 xi
1
20 0.9 (10) 0.20
n i 1 i 1 5