0% found this document useful (0 votes)
30 views90 pages

Basic Statistics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views90 pages

Basic Statistics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Statistical

Concepts in
Lean Six Sigma
ANTHONY JAMES H. VIZMANOS, PFT, RChT, SSGB

Derived from Six Sigma Green Belt course of the University System of Georgia – Kennesaw State University
Part 1
BASIC CONCEPTS OF
STATISTICS
Statistics

u Gathering of facts or data, typically numerical, that once tabulated


or organized, can present significant information about a given
subject
u Can be used to
u Describe what has happened
u Predict what may or will happen
u Make inferences about a larger population based on a smaller sample
of data
u Observation – individual piece of data collected
u Data set – collection of all observations
Descriptive vs Inferential Statistics
Descriptive Inferential (Analytical)

u Collecting, organizing, u Method of measuring the


summarizing, and presenting reliability of conclusions about
data a population based on the
u Calculating various properties information obtained from a
of the data sample of the population
u May use graphical tools u Taking sample data and using
u Charts
it to make inferences or
predictions
u Graphs
u Hypothesis testing
u Tables
u Measures of central tendency
u Regression

u Variation
u percentile
Collecting and Summarizing Data
Types of Data
u Nominal - has no numeric meaning or any numeric order
u Ordinal – there are no interval between the numbers but the numbers
represent a rank or an order of sort
u Interval – intended to be values where the scale is equal distance from
each other
u Ratio – there is a zero balance point in the data
u Locational – useful in a production line to capture the location where
defects occur
Other types of data
u Qualitative – non-numerical data
u Quantitative – numerical data, either continuous or discrete
u Continuous data – measured in a continuous scale
u Discrete data – limited and is typically used in counting
Data Collection Techniques

u Surveys
u Face-to-face interviews
u Focus group discussions (FGDs)
u Emails and websites
u Customer feedback
u Test marketing, mystery shopper
u Suggestion boxes
u Automatic data capture (barcodes, videos, etc)
Types of Statistical Errors
u Statistical error – errors occurred when collecting data
u Sampling error – errors related to the nature or size of the selected
sample
u Non-sampling error – errors related to the collection of data
u Measurement error – difference in the scale of measurement and difference
in the rounding off procedure
u Non-response error – errors derived when respondents do not respond to a
single or multiple questions in the survey
u Misinterpretation error – error derived when respondents misinterpret the
questions in the survey
u Sampling bias error – due to conflict of interest and some respondents may
not be included in the survey
u Instrumental error – error due to the limitations and/or sensitivity of the
instrument used in measuring an observation where data is derived
u Human error – due to human mistakes committed during data
gathering and interpretation
Types of Sampling

u Random sampling – the data you collect must have equal


probability to be picked/selected
u Sequential sampling
u Used in destructive and reliability testing and is done in sequence
u The desire is to pass a quality test and products are used until you
achieve a desirable result
u Stratified sampling
u Used when products or processes are mixed or non-homogeneity exists
u Breakdown the process to single out or stratify the lot based on the
machine, or subprocess, or a single lot size then pick samples from each
stratified group
Samples vs Population

Population - Parameters Samples - Statistics


u Complete set of items u Subset of the population you
collected data on
u Mean: x̄
u Mean: µ
u Standard Deviation: s
u Standard Deviation: 𝜎
u Sample size: n
u Sample Size: N
Measures of Central Tendency

u Mean – the sum of the observations divided by the number of


observations – in MS Excel: =AVERAGE()
u Median – determines the center of data, dividing the bottom 50% to
the top 50% – in MS Excel: =MEDIAN()
u Mode – determined by the most frequency of how many
observations are repeating – in MS Excel: =MODE()
Measures of Variability

u Range – difference between the maximum and minimum


observances of data – in MS Excel: =RANGE()
u Sample Variance
u Variation of the observed data points from its mean
u Applied only to a sample set
u Also the square of the Standard Deviation
̄ !
! ($%$)
u Formula: Sample Variance (S2) = (%)

u In MS Excel: =VAR() or =VAR.S()


Part 2
GRAPHICAL METHODS IN DATA
COLLECTING AND SUMMARIZING
Tally Sheets

u Provides a quick look at grouped data


u Provides a visual idea of the distribution shape
u Bimodal distribution
Frequency Distribution Charts

u Visual representation of tally sheets


u Most common examples of frequency distribution charts:
u Bar charts
u Vertical bar charts
u Horizontal bar charts
u Stacked column
u Line charts
u Pie charts
u Histograms
Frequency Distribution Charts:
Bar Charts
u Bar charts, also known as bar graphs, are visual representation of
the numerical values for given data categories.
u Vertical bar charts: categories listed in the x-axis while the values are
plotted in the y-axis
u Horizontal bar charts: categories listed in the y-axis while the values
are plotted in the x-axis
Frequency Distribution Charts:
Stacked Bar Charts
u Stacked bar charts are used to simplify graphical data presentation
when in one category, there are two possible values
u For example, when presenting data for the access of Google
Classroom by students, a standard bar chart with two bars
representing mobile users vs. desktop users can be simplified by
using stacked bar charts (which also shows the total of both)
Frequency Distribution Charts:
Line Charts
u Line charts are graphical representation of data that shows the
data arrangement in categories using a continuous line
u This representation shows the possible uniformity or erraticity of data
u This representation, when plotted in dated categories, also shows
movement or trends of data in a continuous manner
Frequency Distribution Charts:
Pie Charts
u Pie charts are visual representation of the frequency of data similar
to bar charts, but in this chart, the totality (100%) of the data is
shown to give clearer presentation of the comparison of each
category’s frequency vis-à-vis other categories
u This can show you the most and least frequencies in given
categories
Frequency Distribution Charts:
Histograms
u Histograms are similar to bar charts in the presentation of data, but
histograms show data in a continuous and numeric manner
u Instead of categorical in nature (for bar charts), histograms present
data in groups that are determined by the data analyst/software
which are called bin width or bin range
u Bin ranges should always follow the number of significant figures and/or
the number of decimal places in a data set
u MS Excel may calculate an automatic bin range depending on the
data set that is encoded, but bin ranges can be adjusted to be
smaller, bigger, or user defined.
u However, in adjusting or determining bin ranges, the same rule in
significant figures/decimal places should be followed
Frequency Distribution Charts:
Histograms
u Example: The following data show the number of overtime hours
worked during the past month by each of the twelve employees in
the Shipping and Receiving Department.
Hours Worked: 30, 20, 40, 50, 40, 30, 30, 60, 30, 30, 40, 20
u The histogram automatically generated by MS Excel is

This bar represents the


frequency of data that falls
within the group range from
20 to 38, which is 7

This bar represents the


frequency of data that falls
within the group range from
38 to 56, which is 4

These are the bin ranges of equal


width 18 (bin min, bin max)
Frequency Distribution Charts:
Histograms
u But since in the previous data set, the range min is 20, the range
max is 60, and the range width is 40, a histogram that have four bars
with bin range 10 can be generated as well (but manually through
format data series)
This bar represents the
frequency of data that falls
within the group range from
20 to 30, which is 7

This bar represents the


frequency of data that falls
within the group range from
30 to 40, which is 3

These are the bin ranges of equal


width 10 (bin min, bin max)
Scatter Diagram (Scatterplot)

u Scatter diagrams are visual representation of data useful for


continuous statistical analysis such as regression or correlation tests
u This diagram represents two numerical variables in a data and their
relationship to each other
u Drawing an imaginary line through the dots plotted in the diagram
can help visualize the correlation of data from each other

*More of this lesson will be tackled in


the multivariable studies (Part 4)
Run Charts

u Run charts are used to display continuous data to indicate a stability


of a process
u This is similar to control charts but without the lower control limit (LCL)
line and the upper control limit (UCL) line
u But the central line is included to denote process stability towards the CL
Stem and Leaf Approach

u Charting of stratifications of data/grouping of data


u Helps visualize better the mode of the data
Box and Whisker Approach
(Boxplots)
u A box on the number line with vertical lines indicating certain types
of information about the data set
Box and Whisker Approach
(Boxplots)
u Boxplot calculations:
u Calculate the min, Q1, Q2, Q3, IQR, LL, UL, and max
u Arrange data from lowest to highest
u Q2 is the same as the median
u Q1 is the first quartile or the median of the first half of the data set
u Q3 is the third quartile or the median of the second half of the data set
u Determine the Interquartile Range (IQR): IQR = Q3 – Q1
u Determine the LL and UL to determine any outlier in the data set
u LL = Q1 – (1.5 x IQR)
u UL = Q3 + (1.5 X IQR)
u The min is the lowest value in the data set, while the max is the highest value
in the data set
Part 3
PROBABILITY AND
DISTRIBUTIONS
Probability

u Probability – likelihood of something random occurring


u Sample space – collection of all possible outcomes
u Event – any collection of outcomes from a probability experiment
u Example: rolling a dice
u Sample space = {1,2,3,4,5,6}
u Events:
u Event A: rolling 3
u Event B: rolling 1,3, or 5
u Event C: rolling ≤ 2
Rules of Probability

1. All possibilities must be between 0 and 1;


2. Probability only discusses what happens in the long run;
3. The sum of probabilities of all possible outcomes of an event is
always equal to 1.

Complement rule
u The probability that Event A will not occur = N
u The probability that Event A will occur = Y
u N<1–Y
Mutually Exclusive and
Independent Events
Mutually Exclusive
u If occurrence of any one of these events excludes the occurrence
of others, these events are called mutually exclusive or disjoint
events.
u Events A and B are mutually exclusive or disjoint if they cannot occur
simultaneously.
Independent event
u If occurrence of one event does not change the probability of
another event occurring, the two events are said to be
independent events of each other.’
u Events A and B are independent if the probability of B is not affected by
Event A occurring.
Additional Rules for Probability

u Addition Rule
u Used to calculate the probability of A or B
u Written as P(A ∪ B) and can be generalized to more that two events such as
P(A ∪ B ∪ C ∪ D)

u Special addition rule


u Only if all of the events involved are mutually exclusive or disjoint
u P(A ∪ B) = P(A) + P(B)
u Example: what is the probability of rolling a 2 or a 4 on a die?
! ! # !
u Answer: P(2 ∪ 4) = " + " = " or $
Additional Rules for Probability

u Addition Rule
u General addition rule
u Works on all circumstances
u P(A ∪ B) = P(A) + P(B) – P(A&B)
u Example: The assembly of a product requires an electronic board which is
supplied by two suppliers. The probability of the board from supplier A
working is 0.8, the probability of the board from supplier B working is 0.7, and
the probability of both working is 0.6. what is the probability that either
supplier A or B’s board is working?
Answer: P(A ∪ B) = 0.8 + 0.7 – 0.6 = 0.9

A B
u

A&B
Additional Rules for Probability

u Conditional Probability
u Is the probability that an event happening given that another event has
happened
u Written as P(B|A)
u Probability of Event B happening given that Event A as happened
u P(B|A) = P(A&B) / P(A)
u NOTE: write down what Events A and B are before solving the problem
u Example: in your manufacturing plant, 70% of the employees are able to
work this Saturday for overtime, 40% are able to work on Sunday, and
20% could work either day. What is the probability that an employee
can work Saturday given that they can work on Sunday?
Additional Rules for Probability

u Conditional Probability
u Answer: this problem is looking for the probability of Saturday given that
Sunday is true.
u Let: Saturday = B, Sunday = A
u Need: Probability of Saturday and Sunday divided by that of Sunday
u Solution: P(B|A) = P(A&B) / P(A)
u P(SAT|SUN) = P(SAT & SUN) / P(SUN)
u P(B|A) = 20% / 40% = 0.2 / 0.4 = 0.5 or 50%
Additional Rules for Probability

u Multiplicative Rule
u Used to calculate the probability of both A and B
u Written as P(A ∩ B) and can be generalized to more than two events such as
P(A ∩ B ∩ C ∩ D)

u Special Multiplication Rule


u P(A ∩ B) = P(A) x P(B)
u Example: what is the probability of rolling a 2 on your first roll followed by a 2
on your second roll on a die?
! ! !
u Answer: P(2 ∩ 2) = " x " = $"
Additional Rules for Probability

u Multiplicative Rule
u General Multiplication Rule
u P(A ∩ B) = P(A) x P(B|A)
u Example: consider a bag of marbles containing 5 red, 2 blue, and 3 white
marbles. Supposed that two marbles are drawn without replacement (no
returning of marbles once drawn), what is the probability that both marbles
drawn are blue?
u Note: the two draws are NOT independent
u In this case, since the two draws are not independent, the probability of B
happening is dependent on the probability of A happening first.
# ! #
u Answer: P(blue on first) x P(blue on second ∩ blue on first) = x = or 0.022
!% & &%
u Note, the denominator of multiplier becomes 9 because the first draw already left
9 marbles in the bag, not 10. the numerator is 1 since the first blue marble has
already been drawn.
Combinations and Permutations

u Permutation
u All ordered arrangements of distinct objects
u nPr – n number of ways ordering the arrangement or r number objects
taken from a set of objects
*!
u 𝑛𝑃𝑟 = or as a formula in MS Excel as =permut()
*%, !

u Permutations are particular in the order of which sets are in


u Combination
u If there is no particular order required in the sets given
*!
u 𝑛𝐶𝑟 = or as a formula in MS Excel as =combin()
[,! *%, !]
Characteristics of Probability
Distribution
Probability distribution
u Is a mathematical function that provides the probability of
occurrences of different possible outcomes in an experiment
Continuous distribution
The Binomial Distribution

u Discrete
u Two states for the random variable x
u Set number of trials, determined in advance (n)
u Constant probability of success (P)
u Rule of thumb in sampling
u Population size: N > 50
u Number of trials is less than 10% of the population size (n > 0.10N)

u Other discrete distributions:


u Exponential distribution
u Poisson distribution
u Hypergeometric distribution
Binomial Distribution Calculations

u Binomial formula:
𝑛!
𝑃 𝑥 = 𝑝 ' 𝑞()'
𝑥! 𝑛 − 𝑥 !

where: x = number of successes in the experiment


(sampling defect number)
n = number of trials (sample size)
p = probability of success
(% defects in the population when sampling)
q=1-p
Binomial Distribution Calculations

u Example: suppose that a sample with size 4 is randomly chosen from


a batch size of 100 that is known to be 5% defective, what is the
probability that there is exactly one defective in your sample?

(! ' 𝑞 ()'
u Solution: 𝑃 𝑥 = '! ()' !
𝑝
+! , (1 − 0.05)+), +! -
= ,! +), !
0.05 = -!
(0.05)(0.95)
= 4 0.05 0.8574 = 0.1715 𝑜𝑟 17.15%
Poisson Distribution

u Discrete probability distribution over a continuous interval


u The outcomes are discrete or continuous
u Example: number of defects, number of customers
u The interval is continuous
u Most typically: time, length, area, volume
u To be Poisson, a rate must be given in a problem
u Poisson formula:
𝑒 '( 𝜆)
𝑃 𝑥 =
𝑥!
where x = number of successes in an outcome
𝜆 = rate for the problem
u It is also the mean for the probability distribution and the variance for the distribution
u May be given directly or calculated as 𝜆 = np
Poisson Distribution

u Example: the expected number of defects per meter of cable is


0.25. What is the probability that a 10m cable will have no more
than one (1) defect?
u Given: x = 0 or 1
𝜆 = np = 10m x 0.25 = 2.5
! +, "-
u Solution: 𝑃 𝑥 =
#!
! +..0 %.'1 ! +..0 (+)
𝑃 0 = = = 0.0821
(! +
! +..0 %.'2 ! +..0 (%.')
𝑃 1 = = = (0.2052)
+! +
𝑃 0,1 = 𝑃 0 + 𝑃 1 = 0.0821 + 0.2052 = 0.2873 𝑜𝑟 28.73%
Calculating Probabilities in the
Normal Distribution
.(0.1)3
! 343
𝑃 𝑥 =
" #$

Standard normal table (z-table)


u z-score – number of standard deviations away from the mean
(" $ %)
u 𝑧= '
, where: 𝜇 = mean of the population
𝜎 = population standard deviation
u Values of the z-scores in tables and calculations always show values at the
left of x. If values to the right is needed, use 1 – z
u If values of x in between are needed, subtract the left x from the right x
Chi-square Distribution and
Degrees of Freedom
Degrees of freedom (df)
u The amount of information your data provide that you can apply to
estimate the values of unknown population parameters
u Different distributions and hypothesis tests require a different df to be
used – based on sample size minus some value
Chi-square distribution and applications
u If w, x, y, and z are random variables with standard normal
distribution, then the random variable defined as
f = 𝑤 % + 𝑥 % + 𝑦 %+ 𝑧 %
has a distribution we call Chi-square (𝜒 % )
Chi-square Distribution and
Degrees of Freedom
Chi-square (𝜒 % )
u Obtained from the values of the ratio of the sample variance to the
population variance multiplied by the df.
u Applications of Chi-square:
u Comparing multiple proportions
u Testing a single population variance
u Chi-square properties:
u There are many different Chi-square distributions, one for each df
u As df increases, the Chi-square distribution approaches normal
u It is non-negative
u It is non-symmetric
The t-distribution and its
Applications
u If x is a random variable with a standard normal distribution, and y is a
random variable with 𝜒 ( distribution, then the random variable defined
5
as 𝑡 = is the t-distribution with k degrees of freedom (where K is the
)
*
df for the 𝜒2 variable)
u Applications:
u Stdev of the population is unknown
u Hypothesis test for one population mean or comparing two means
u Properties of t-distribution:
u Different for different df
u Bell-shaped like normal curve but wider
u Distribution is symmetric around the mean
u Variance is greater than one (1)
The F-distribution and its
Applications
u The F-distribution is the ratio of two 𝜒2 distributions with df v1 and v2
respectively, where each 𝜒2 has first been divided by its df
u It is commonly used for to test whether the variances of two or more
populations are equal
u Hypothesis test for two populations
u Analysis of Variance (ANOVA) for more than two populations
u Properties of F-distributions:
u Non-negative
u Different distributions for the different df
The Central Limit Theorem (CLT)

u One of the key concepts in statistics


u Enables us to from descriptive statistics to inferential statistics

1. The mean of the sampling distribution of means is equal to the


mean of the population from which the samples were drawn
u The sample mean is a random variable
u The sample mean targets the population mean
2. The variance of the sampling distribution of means is equal to the
variance of the population from which the samples were drawn
divided by the size of the samples.
The Central Limit Theorem (CLT)

2. The variance of the sampling distribution of means is equal to the


variance of the population from which the samples were drawn
divided by the size of the samples.
u Larger samples have less variability and are more likely to be close to
the true population mean
0!"
u Variance (σ2) 𝜎/ $= * where n = sample size
0"
u Standard deviation (σ) 𝜎$ = *
where n = sample size
The Central Limit Theorem (CLT)

3. If the original distribution is normally distributed, the sampling


distribution of the means will also be normal
u If the original distribution was NOT normally distributed, then the
sampling distribution will still approach normality
u The further the original distribution is from a bell curve, the larger the
sample will need to be to safely assume that the sampling distribution is
normally distributed
u Typically a sample size of 30 or more is significantly large
The Central Limit Theorem (CLT)
Applications in Statistics
Standard Error
u The variance of the sampling distribution of means is equal to the
variance of the population from which the samples were drawn divided
by the size of the samples
u The standard deviation of the means, called the standard error of the
'
means is equal to 𝜎 =
+
Confidence Interval
u A 95% confidence interval on the mean means that 95% of the time,
the true population mean should be within the interval
u Allows us to provide some confidence around the estimate
u If either the population is normally distributed or our sample size is at least 30,
3 3
the formula is: 𝑥̄ − 1.96 ≤ µ ≤ 𝑥̄ + 1.96
4 4
Part 4
MULTIVARIABLE STUDIES
Multivariable Study

u Is a tool that is used to capture process variations

u Types of data and process variations


u Cyclical – any variation between parts or pieces or batches within the
sample group
u Positional – any variation within each part or batch; multiple
measurements taken
u These types of variations can be captured at the same time
u These help to minimize variation by identifying the specific cause for
the variation
Procedures of the Multivariable
Study
1. Select the process and the characteristics you wish to investigate
2. Select the sample size (e.g. 3-5 parts) and time frequency (e.g. hourly)
3. Create a tabulation table to write down measurements
4. Record the time and values from each sample set into the table
5. Plot a graph with time along the horizontal scale and the measured
values on the vertical scale
6. Connect the observed values with lines
7. Observe and analyze the chart for variation within the sample, sample
to sample, and over time
8. Conduct additional studies to concentrate on the areas of apparent
maximum variation
9. Make process improvements and repeat the study to confirm the
results
Correlation

u Finding a relationship between two or more sets of data


u Measures the strength and direction of the relationship
u Correlation coefficient
u Is a quantitative measure of hoe closely the two variables are related
u Typically abbreviated by r
u Provides both the strength and direction of the relationship between the
independent and dependent variables
u Always between -1 and +1, where -1 is a perfectly negative correlation and
+1 is a perfectly positive correlation; zero means no correlation
u The closer the value of r is to either -1 or +1 means the stronger the correlation
is between two variables
Types of Correlation

Note:
Correlation does not guarantee causation
Procedures for Calculating the
Correlation Coefficient
1. Calculate the mean of all x values (x̄) and the mean of all y values
(ȳ).
2. Calculate for the stdev of all x values (Sx) and the stdev or all y
values (Sy)
3. Calculate x - x̄ and y- ȳ for each pair (x,y) and then multiply all the
differences together
4. Get the sum by adding all of these products of difference together
5. Divide the sum by Sx times Sy
6. Divide the results of step 5 by n-1, where n=number of (x,y) pairs

-#. 01#. 21#1.


Formula: 𝑟 = =
-#. /-.. [01#. 2 1# . ][01.. 2 1. . ]
Sample Problem in Calculating the
Correlation Coefficient
Hours of Weight Loss
xy x2 y2
Exercise (x) (y)
3 2 6 9 4
5 4 20 25 16
10 6 60 100 36
15 8 120 225 64
2 1 2 4 1
3 3 9 9 9
𝜮x=38 𝜮y=24 𝜮xy=217 𝜮x2=372 𝜮y2=130
*n=6

1$2 *!$2 %!$!2 4(/)5) %(67)(/8) 6:9


Solution: 𝑟= = = = = 𝟎. 𝟗𝟕𝟐
1$ !312 ! [*!$ !% !$ !][*!2 !% !2 !] [4(65/)% 67 !][4()69)% /8 !] (577)(/98)
Using MS Excel in Calculating the
Correlation Coefficient
u Function to get r: =CORREL(data range x, data range y)
u Graphical method in getting r
u Create a scatterplot using the data with x and y
u Right click on the dots in the plot and select add trendline
u Right click on the trendline and click on format trendline
Using MS Excel in Calculating the
Correlation Coefficient
u Graphical method in getting r, continued
u In the format trendline screen, check on the display equation on chart
and display R-squared value on chart
u The R-square will be displayed, so to get r, use the formula
=SQRT(R-square value on chart)

Same answer as
the manually
calculated value
of r in previous
section
Testing for Significance (t-test)

u Procedure for using the t-test to test the significance of r


1. The initial conditions for this t-test is that the means of the response
variables and the distribution of the y-values is considered normal and
independent with equal stdev
2. Decide the level of significance (ɑ = 0.05, 0.1, 0.01, etc)
3. Develop a hypothesis to be tested to H0 and H1 that can be tested as
either left tailed, right tailed, or two tailed
4. The critical values are obtained from the t-table, use n-1 degrees of
freedom, ±tɑ/2 for two tailed, -tɑ for left tailed, and +tɑ for right tailed.
<
5. The test statistics is given by: 𝑡 =
5678
968
Testing for Significance (t-test)

u Procedure for using the t-test to test the significance of r, continued


6. Now compare the test statistic with the critical value of the t-table.
Reject H0 is the test statistics is -tɑ < critical value (for left tail) or +tɑ >
critical value (for right tail), otherwise do not reject H0
7. State the confusion in terms of problem context
Recall this problem from the
correlation example
Hours of Weight Loss
xy x2 y2
Exercise (x) (y)
3 2 6 9 4
5 4 20 25 16
10 6 60 100 36
15 8 120 225 64
2 1 2 4 1
3 3 9 9 9
𝜮x=38 𝜮y=24 𝜮xy=217 𝜮x2=372 𝜮y2=130

*n=6
Sample Problem in t-test

u From r = 0.972
u H0: t ≤ tc or ”exercise does not help weight loss or may have a negative
effect” (tc is the test statistic value)
u H1: t > tc or “maybe exercise hours contribute to weight loss”
u Solution:
: (.;<%
𝑡= = = 8.273
2 + 6. 2 + (1.9:.).
7 +. < +.

u tc = 3.182 based on df = 4 and ɑ = 0.05, and a one-tailed test because


H0: t ≤ tc
u Comparing t = 8.273 and tc = 2.132, therefore H0 is rejected since t > tc
Linear Regression

u Used to describe a straight line that best fits a series of ordered pairs
(x,y).
u The equation for linear regression is ŷ = a + bx
where: ŷ = the predicted value of y given a value of x
x = the independent variable
a = the y intercept of the straight line
b = the slope of the straight line
Least Squares Method
u Is a mathematical procedure to identify the linear equation that best
fits a set of ordered pairs by finding values for a, the y-intercept, and b,
the slope.
u The goal of the least squares method is to minimize the total square
error between the values of y and ŷ
u Procedures for the least squares method:
1. Create a table with your x and y values in the columns
2. Calculate xy, x2, y2, x̄, and ȳ
3. Calculate the sums for x, y, xy, x2, y2, x̄, and ȳ
4. Find the linear equation that best fits the data by determining the value of
a, the y intercept, and b, the slope, using the following equations:

𝑛Σ𝑥𝑦 − (Σ𝑥)(Σ𝑦)
𝑏= ; a = ȳ − 𝑏𝑥̄
𝑛Σ𝑥 ( − (Σ𝑥)(
Sample Problem in Least Squares
Method
Complaints
Month (xi) xi2 xiyi yi2
(yi)
1 8 1 8 64
2 6 4 12 36
3 10 9 30 100
4 6 16 24 36
5 10 25 50 100 *n = 8
6 13 36 78 169
7 9 49 63 81
8 11 64 88 121
Σxi = 36 Σyi = 73 Σxi2 = 204 Σxiyi = 353 Σyi2 = 707
x̄ = 4.5 ȳ = 9.125
Sample Problem in Least Squares
Method
u Solution:
01#. 2(1#)(1.) =(>'>) 2(>?)(<>)
𝑏= = = 0.5833
01#. 2(1#). =(%(@)2(>?).

a = ȳ − 𝑏𝑥̄ = 9.125 − 0.5833 4.5 = 6.50015

Regression equation: ŷ = 6.50015 + 0.5833𝑥


Using MS Excel in Calculating the
Correlation Coefficient
u Graphical method in getting the linear regression equation
u Create a scatterplot using the data with x and y
u Right click on the dots in the plot and select add trendline
u Right click on the trendline and click on format trendline
Using MS Excel in Calculating Linear
Regression
u Graphical method in getting the linear regression equation,
continued
u In the format trendline screen, check on the display equation on chart
and display R-squared value on chart
Testing for Significance through
Linear Regression and t-test
1. The initial conditions is that the population regression equation
y=mx+b and that for a given specific value of x, the distribution of y
values is normal and independent and has equal stdev
2. Decide the level of significance (ɑ = 0.05, 0.1, 0.01, etc)
3. Develop the hypothesis to be tested
H0: β1 = 0 (the equation cannot be used as a predictor of y values)
H1: β1 ≠ 0 (the equation is useful to predict y values)
4. Find the critical value for ±tɑ/2, use n-2 df
Testing for Significance through
Linear Regression and t-test
5. The test statistic is given by the following formula:
𝑏𝑖 𝑏𝑖
𝑡= =
𝑆𝑒 Σ(𝑦 − ŷ)/
𝑆𝑥𝑦 𝑛−2
Σ𝑥𝑖 / − (Σ𝑥𝑖)/
𝑛
6. Now compare the test statistic with the critical value obtained in
step 4. Reject H0 if the test statistic is > +tɑ/2 or < - tɑ/2; if not, do not
reject H0
7. State the conclusion in terms of the problem context
Sample Problem: Recall Previous
Linear Regression Example
u Given: ŷ = 6.50015 + 0.5833𝑥
H0: β1 = 0 (equation cannot be used to predict y)
H1: β1 ≠ 0 (equation is useful to predict y)
u Calculating the t-statistic:
𝑏𝑖 𝑏𝑖 0.5833
𝑡= = = = 0.348
𝑆𝑒 Σ(𝑦 − ŷ)( (26.583)(
𝑆𝑥𝑦 𝑛−2 8−2
Σ𝑥𝑖 ( − (Σ𝑥𝑖)( 204 − (36)(
𝑛 8
u The critical t-statistic tc = ±2.447 from the t-table based in df=6 and
ɑ=0.025 and a two-tailed test
u Comparing t=0.348 and tc = ±2.447 and t<2.447, and t is within ±2.447,
therefore do not reject H0 and this equation is not a good predictor of
y-values
Using MS Excel in Computing for t-
test Statistics
u Formula: =T.TEST(array1, array2, tails, type)
u Where: Array1 = first data set
Array2 = second data set
Tails = Specifies the number of distribution tails. If tails = 1,
T.TEST uses the one-tailed distribution. If tails = 2, T.TEST
uses the two-tailed distribution.
Type = The kind of t-test to be performed
1: paired t-test
2: two-sample equal variance
3: two-sample unequal variance
u The result of the t-test calculation can be compared to the t-test
limits in the t-table based on the df and the ɑ
Part 5
HYPOTHESIS TESTING
Hypothesis Testing: z-test

u Used to test the mean of a sample when the population stdev is


known
u Formula:
𝑥̄ − 𝜇(
𝑧= 𝜎
#
𝑛
Where: 𝑥̄ = sample average
𝜇( = population mean
𝜎# = population stdev
n = number of samples
Hypothesis Testing: z-test

u Conditions that must exist for this test to be meaningful:


u The population must follow a normal distribution
u The sample size is >30
u The population stdev is known
u Procedure for hypothesis z-test
1. Set your hypothesis: H0: 𝜇 = 𝜇9
H1: 𝜇 ≠ 𝜇9 OR 𝜇 < 𝜇9 OR 𝜇 > 𝜇9
2. Determine the ɑ value
3. Determine the critical values. For a 2-tailed test, use ɑ/2 and will have ±
values. For a 1-tailed test, use + ɑ for a right tailed test and - ɑ for a left
tailed test. Note that z-table will give you areas to the right.
Hypothesis Testing: z-test

u Procedure for hypothesis z-test, continued:


̄ %B
"#
4. Calculate the z-statistic: 𝑧= CD
E
5. If the test statistic is in the reject region, reject the H0, otherwise do not
reject H0
6. State the conclusion in terms of the problem
Sample Problem in Hypothesis z-test

u Suppose a vendor claims that the average weight if a shipment of parts


is 1.84. the customer randomly chooses 64 parts and finds that the
sample has average 1.88. Also, suppose that the stdev of the
population is known to be 0.03
u Solution:
Vendor claim: 1.84 H0: 𝜇 = 1.84
Sample claim: 1.88 H1: 𝜇 ≠ 1.84
n = 64
𝜎 = 0.03
Critical value using a 2-tailed test of ɑ=0.05, ɑ/2=0.025: ±1.94
̄ %#
"$ -.//$-./0
𝑧= $" = #.#' = 10.7 à and since 10.7 is outside ±1.94, reject H0
% ()
Using MS Excel in Computing for z-
test Statistics
u Formula: =Z.TEST(array, x, sigma)
u Where: Array1 = range of data against which to test x
x = the value to test
Sigma = The population stdev
Hypothesis Testing: t-test

u Used to make inferences about a population mean when the


population variance σ2 is unknown and the sample size is relatively
small
u The t-test statistic formula is:
𝑥̄ − 𝜇(
𝑡= 𝑠
𝑛
where: 𝑥̄ = sample mean
𝑛 = number of samples
𝑠 = sample stdev (not population)
𝜇( = population mean
Hypothesis Testing: t-test

u Conditions that must exist for this test to be meaningful:


u The population must follow a normal distribution
u The sample size is ≤30
u The population stdev is unknown
u Df is n– 1
u Procedure for t-test
1. Set your hypothesis same as z-test
2. Determine the ɑ value
3. Determine the critical values. For a 2-tailed test, use ɑ/2 and will have ±
values. For a 1-tailed test, use + ɑ for a right tailed test and - ɑ for a left
tailed test. Use df=n-1. Note that tables show values to the right
Hypothesis Testing: t-test

u Procedure for t-test, continued


̄ J.
HI
u Calculate the test statistic: 𝑡 = /
0
u If the test statistic is in the reject region, reject H0. Do not reject otherwise
u State your conclusion in terms of the problem
Sample Problem in Hypothesis t-test

u Suppose a cut saw operation has been producing parts with 𝑥̄


length of 4.125. A new blade was installed and we want to know
whether the mean has decreased. We select a random sample of
20, measure the length of each parts, and find that the average
length is 4.123, and the sample stdev is 0.008. Assume that the
population is normally distributed. Use a significance level of ɑ=0.10
to determine whether the mean has decreased.
u Solution:
Parts before: 4.125 H0: 𝜇 = 4.125
Parts after: 4.123 H1: 𝜇 < 4.125
n = 20
s = 0.008
ɑ = 0.10
Sample Problem in Hypothesis t-test

u Critical values using a left tailed test of ɑ = 0.10 to the left in the t-
table, go to df=n-1=19 and ɑ = 0.10, the value is 1.328, but since it is
left-tailed, we use -1.328
̄ B1
#2 @.+%>2@.+%'
u Solving the equation: 𝑡 = = = 1.11> = −1.1
7 .1

u Since -1.1 is not in the reject region, do not reject H0


u At 90% confidence level, data does not indicate that the average
length has decreased
Part 6
CONFIDENCE INTERVALS
Part 7
PAIRED COMPARISON TESTS
Part 8
ANALYSIS OF VARIANCE
Part 9
ANOVA POST-HOC ANALYSIS

You might also like