0% found this document useful (0 votes)
2 views

Data Analysis Notes

The document provides comprehensive notes on data analysis, covering topics such as defining and collecting data, organizing and visualizing variables, and numerical descriptive measures. It explains various sampling methods, types of survey errors, and ethical issues related to surveys, alongside practical Excel guides for data manipulation. Additionally, it discusses central tendency, variation, skewness, kurtosis, and methods for exploring numerical data, including quartiles and boxplots.

Uploaded by

sentimaongjamir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Analysis Notes

The document provides comprehensive notes on data analysis, covering topics such as defining and collecting data, organizing and visualizing variables, and numerical descriptive measures. It explains various sampling methods, types of survey errors, and ethical issues related to surveys, alongside practical Excel guides for data manipulation. Additionally, it discusses central tendency, variation, skewness, kurtosis, and methods for exploring numerical data, including quartiles and boxplots.

Uploaded by

sentimaongjamir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

lOMoARcPSD|51735606

Data Analysis Notes

B.A. Economics (Hons.) (University of Delhi)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Sun kissed ([email protected])
lOMoARcPSD|51735606

DATA ANALYSIS

Chapter-1 Defining And Collecting Data

 A population contains all the items or individuals of interest that you seek to study.
 A sample contains only a portion of a population of interest. You analyse a sample to
estimate characteristics of an entire population
 A statistic summarizes the value of a specific variable for sample data. Correspondingly, a
parameter summarizes the value of a population for a specific variable.
 Non-probability sample, you select the items or individuals without knowing their
probabilities of selection.
▪ Convenience sample, you select items that are easy, inexpensive, or convenient to
sample
▪ Judgment sample, you collect the opinions of preselected experts in the subject
matter.
 Probability sample, you select items based on known probabilities.
▪ Simple random sample, every item from a frame has the same chance of selection
as every other item, and every sample of a fixed size has the same chance of
selection as every other sample of that size.
▪ In a systematic sample, you partition the N items in the frame into n groups of k
items, where k = N/n and then:
 e.g. n = 40 , N = 800, each group contains 20 employees. Select a random
number from the first 20 individuals and include every twentieth individual after
the first selection in the sample. If the first random number you select is 8,
subsequent selections are 028, 048, 068, 088, 108, 768, and 788.
▪ In a stratified sample, you first subdivide the N items in the frame into separate
sub-populations, or strata. A stratum is defined by some common characteristic, such
as gender or year in school. You select a simple random sample within each of the
strata and combine the results from the separate simple random samples.
▪ In a cluster sample, you divide the N items in the frame into clusters that contain
several items. Clusters are often naturally occurring groups, such as counties,
election districts, city blocks, households, or sales territories. You then take a random
sample of one or more clusters and study all items in each selected cluster.
 Data Cleaning:
 Deleting Duplicates:
▪ Select Data
▪ Go to Data > Remove Duplicates
 To Remove Blank Cells
▪ Go to ‘find and select’ -> ‘go to special’
▪ Select blanks and click OK

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

▪ Click delete -> delete sheet rows


 Types Of Survey Error:
▪ Coverage error occurs if certain groups of items are excluded from the frame so
that they have no chance of being selected in the sample or if items are included
from outside the frame
▪ Non-response error arises from failure to collect data on all items in the sample and
results in a non-response bias.
▪ Sampling error reflects the variation, or “chance differences,” from sample to
sample, based on the probability of particular individuals or items being selected in
the particular samples.
▪ When surveys rely on self-reported information, the mode of data collection, the
respondent to the survey, and or the survey itself can be possible sources of
measurement error.
 Ethical issues about surveys:
▪ Coverage error -> selection bias
▪ Non-response error -> if the sponsor knowingly designs the survey so that particular
groups or individuals are less likely than others to respond.
▪ Sampling error -> findings are purposely presented without reference to sample size
and margin of error so that the sponsor can promote a viewpoint that might otherwise
be inappropriate.
▪ Measurement error -> (1) a survey sponsor chooses questions that guide the
respondent in a particular direction; (2) a respondent willfully provides false
information.
▪ Results of non-probability samples are used to form conclusions about the entire
population.
 EXCEL GUIDE:
▪ RANDBETWEEN(smallest integer, largest integer) -> generate a random integer
that can then be used to select an item from a frame.
▪ Recoding Variables:
 =IF(G2 < 3.3, “No”, “Yes”)
 =IF(G2<9, “outstanding”, IF(G2>7, “good”, IF(G2>5, “pass”, “fail”))))
 =VLOOKUP(G2,$I$11:$J$15, 2) (where I11 and J15 is the first and last box of our
vlookup table respectively. And 2= no. of columns.)
 =IF(ISNUMBER(G2), “NUMERICAL”, “NON-NUMERICAL”)
 =ROUND(G2, 2) (where 2= round the value to two decimal places)

Chapter-2 Organizing And Visualizing Variables

 Organizing Categorical Data:

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

i. A Summary Table tallies the set of individual values as frequencies or percentages for
each category. A summary table helps you see the differences among the categories by
displaying the frequency, amount, or percentage of items in a set of categories in a
separate column.
ii. A Contingency Table cross-tabulates, or tallies jointly, the data of two or more
categorical variables, allowing you to study patterns that may exist between the
variables.
 Organizing Numerical Variables
i. A Frequency Distribution tallies the values of a numerical variable into a set of
numerically ordered classes. Each class groups a mutually exclusive range of values,
called a class interval. Interval width = (highest value - lowest value)/number of classes
ii. A Bin is a range of values defined by a bin number, the upper boundary of the range.
E.g. bin numbers 4.99, 9.99, and 14.99. The first bin represents all values up to 4.99, the
second bin all values greater than 4.99 through 9.99, and the third bin all values greater
than 9.99 through 14.99.
iii. A Relative Frequency Distribution presents the relative frequency, or proportion, of
the total for each group that each class represents. A Percentage Distribution presents
the percentage of the total for each group that each class represents. Proportion = relative
frequency = (number of values in each class)/total number of values
iv. The Cumulative Percentage Distribution provides a way of presenting information
about the percentage of values that are less than a specific amount.
 Visualizing Categorical Variables
i. A Bar Chart visualizes a categorical variable as a series of bars, with each bar
representing the tallies for a single category.
ii. A Pie Chart and a Doughnut Chart both use parts of a circle to represent the tallies of
each category of a categorical variable.
iii. In a Pareto Chart, the tallies for each category are plotted as vertical bars in descending
order, according to their frequencies, and are combined with a cumulative percentage (at
the midpoint of each category) line on the same chart.
 Visualizing Numerical Variables
i. A Stem-And-Leaf Display visualizes data by presenting the data as one or more row-
wise stems that represent a range of values. For stems with more than one leaf, the
leaves are arranged in ascending order.
For eg: 7.42, 7.10, 7.90 stem = 7; leaves = 4(round of of 42), 1, 9
7 | 149 (leaves in ascending order)
ii. A Histogram visualizes data as a vertical bar chart in which each bar represents a class
interval from a frequency or percentage distribution.
iii. Percentage Polygon uses the midpoints of each class interval to represent the data of
each class and then plots the midpoints, at their respective class percentages, as points on
a line along the X axis.
iv. The Cumulative Percentage Polygon, or ogive, uses the cumulative percentage
distribution.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

v. A Scatter Plot explores the possible relationship between two numerical variables by
plotting the values of one numerical variable on the horizontal, or X, axis and the values
of a second numerical variable on the vertical, or Y, axis.
vi. A Time-Series Plot plots the values of a numerical variable on the Y axis and plots the
time period associated with each numerical value on the X axis.
 Excel Guide
i. Insert ➔ PivotTable
ii. The Frequency Distribution
▪ Establish Bins
▪ =FREQUENCY (untallied data cell range, bins cell range) array function to tally
data.

Chapter 3: Numerical Descriptive Measures


 Central tendency is the extent to which the values of a numerical variable group around a
typical, or central, value. Variation measures the amount of dispersion, or scattering, away
from a central value that the values of a numerical variable show. The Shape of a variable is
the pattern of the distribution of values from the lowest value to the highest value.
 Central Tendency
i. Mean = X(x bar) = sum of the values/number of values. Affected by extreme values.
ii. The Median = The median is the middle value in an ordered array of data that has been
ranked from smallest to largest. Not affected by extreme values.
iii. Mode = The mode is the value that appears most frequently.
iv. Geometric Mean = To measure the rate of change of a variable over time.

v. The Geometric Mean Rate of Return measures the mean percentage return of an
investment per time period.

 Variation and Shape


i. The Range is the difference between the largest and smallest value.
ii. Variance and the Standard Deviation measures the “average” scatter around the
mean.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

iii. The Coefficient of Variation (CV) measures the scatter in the data relative to the mean.
CV = (S/X) 100% where: S= S bar = sample standard deviation ; X=X bar = sample mean
iv. The Z Score for a value is equal to the difference between the value and the mean,
divided by the standard deviation: Z = (X - mean)/S.
 A Z score of 0 indicates that the value is the same as the mean.
 If a Z score is a positive or negative number, it indicates whether the value is above
or below the mean and by how many standard deviations.
 Z scores help identify outliers, the values that seem excessively different from most
of the rest of the values. Values that are very different from the mean will have either
very small (negative) Z scores or very large (positive) Z scores.
 As a general rule, a Z score that is less than -3.0 or greater than +3.0 indicates an
outlier value.
v. Skewness measures the extent to which the data values are not symmetrical around the
mean. The three possibilities are:
 Mean < median: negative, or left-skewed distribution
 Mean = median: symmetrical distribution (zero skewness)
 Mean > median: positive, or right-skewed distribution
 In a symmetrical distribution, the values below the mean are distributed in exactly
the same way as the values above the mean, and the skewness is zero.
 In a skewed distribution, there is an imbalance of data values below and above the
mean, and the skewness is a nonzero value (less than zero for a left-skewed
distribution, greater than zero for a right-skewed distribution).

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 Panel A displays a left-skewed distribution. Most of the values are in the upper
portion of the distribution. Some extremely small values cause the long tail and
distortion to the left and cause the mean to be less than the median. Because the
skewness statistic for such a distribution will be less than zero, some use the term
negative skew to describe this distribution.
 Panel B displays a symmetrical distribution. Values are equally distributed in the
upper and lower portions of the distribution.
 Panel C displays a right-skewed distribution. Most of the values are in the lower
portion. Some extremely large values cause the long tail and distortion to the right
and cause the mean to be greater than the median. Because the skewness statistic for
such a distribution will be greater than zero, some use the term positive skew to
describe this distribution.
vi. Kurtosis measures the peakedness of the curve of the distribution—that is, how sharply
the curve rises approaching the centre of the distribution.
 A distribution that has a sharper-rising center peak than the peak of a normal
distribution has positive kurtosis, a kurtosis value that is greater than zero, and is
called lepokurtic. Higher concentration of values near the mean of the distribution
compared to a normal distribution.
 A distribution that has a slower-rising (flatter) center peak than the peak of a normal
distribution has negative kurtosis, a kurtosis value that is less than zero, and is called
platykurtic. Lower concentration of values near the mean of the distribution
compared to a normal distribution.
 Exploring Numerical Data
i. Quartiles split the values into four equal parts
 The First Quartile divides the smallest 25.0% of the values from the other 75.0%
that are larger. Q1 = (n + 1)/4 ranked value (where n = sample size)
 The Second Quartile is the median; 50.0% of the values are smaller than or equal
to the median, and 50.0% are larger than or equal to the median.
 The Third Quartile divides the smallest 75.0% of the values from the largest
25.0%. Q3 = 3(n + 1)/4 ranked value.
ii. The Interquartile Range (also called the midspread) measures the difference in the
center of a distribution between the third and first quartiles (Q3 – Q1).
(Descriptive statistics such as the median, Q1, Q3, and the interquartile range, which are not influenced by
extreme values, are called resistant measures.)
iii. The Five-Number Summary: Xsmallest Q1 Median Q3 Xlargest

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

iv. The Boxplot uses a five-


number summary to
visualize the shape of the
distribution for a variable.

 Panels A and D are symmetrical. In these distributions, the mean and median are
equal. In addition, the length of the left tail is equal to the length of the right tail, and
the median line divides the box in half.
 Panel B left-skewed. The few small values distort the mean toward the left tail.
There is a long left tail that contains the smallest 25% of the values, demonstrating
the lack of symmetry in this data.
 Panel C right-skewed. There is a long right tail that contains the largest 25% of the
values, demonstrating the lack of symmetry in this data set.
 Numerical Descriptive Measures for a Population
i.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

ii.

iii. The Empirical Rule states that for population data from a symmetric mound-shaped
distribution such as the normal distribution, the following are true:
 Approximately 68% of the values are within + - 1 standard deviation from the mean.
 Approximately 95% of the values are within + - 2 standard deviations from the mean
 Approximately 99.7% of the values are within + - 3 standard deviations from the mean.
 e.g. mean = 2.06; standard deviation = 0.02
 mean + - SD = 2.06 + - 0.02 = (2.04, 2.08) i.e. 68% data is btw 2.04 & 2.08
 2.06 + - 2(0.02) = (2.02, 2.10) i.e. 95% data fall btw 2.02 and 2.10
 2.06 + - 3(0.02) = (2.00, 2.12) i.e. 99.7% data is btw 2.00 & 2.12
iv. Chebyshev’s Theorem states that for any data set, regardless of shape, the percentage of
values that are found within distances of k standard deviations from the mean must be at
least = (1-1/k2)*100%. The theorem indicates at least what percentage of the values fall
within a given distance from the mean.

 The Covariance And The Coefficient Of Correlation


i. The covariance measures the strength of the linear
relationship between two numerical variables (X and Y).
ii. The Coefficient Of Correlation (rho “p”) measures the relative strength of a linear
relationship between two numerical variables. (r = sample coefficient of correlation)

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

i. In Panel A , there is a perfect negative linear relationship between X and Y. Thus, the
coefficient of correlation, r, equals -1, and when X increases, Y decreases in a perfectly
predictable manner.
ii. Panel B shows a situation in which there is no relationship between X and Y. In this
case, the coefficient of correlation, r, equals 0, and as X increases, there is no tendency
for Y to increase or decrease.
iii. Panel C illustrates a perfect
positive relationship where r
equals +1. In this case, Y increases
in a perfectly predictable manner
when X increases.
(Correlation alone cannot prove that there is a
causation effect—that is, that the change in the value
of one variable caused the change in the other
variable. Causation implies correlation, but
correlation alone does not imply causation )
The coefficient of correlation indicates the
linear relationship, or association, between
two numerical variables. When the coefficient of correlation gets closer to +1 or -1, the linear relationship
between the two variables is stronger, near 0, little or no linear relationship exists,

 Excel Guide
i. =AVERAGE(variable cell range)
ii. =MEDIAN(variable cell range)
iii. =MODE(variable cell range)
iv. =GEOMEAN((1 + (R1)), (1 + (R22)), . . . (1 + (Rn2 ))) − 1 to compute the geometric mean
rate of return.
v. =MIN(variable cell range)
vi. =MAX(variable cell range)
vii. =VAR.S(variable cell range) ;to compute the sample variance.
viii. =STDEV.S(variable cell range) ;to compute the sample standard deviation.
ix. =STANDARDIZE(value, mean, standard deviation) ;to compute Z scores.
x. =SKEW(variable cell range)
xi. =KURT(variable cell range)
xii. =AVERAGE(variable cell range) ;for population
xiii. =VAR.P(variable cell range) ;for population
xiv.=STDEV.P(variable cell range) ;for population
xv. =COVARIANCE.S(variable 1 cell range, variable 2 cell range)
xvi. =CORREL(variable 1 cell range, variable 2 cell range) to compute coefficient of
correlation.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

Chapter 4 – Basic Probability Concept

1. Simple probability refers to the probability of occurrence of a simple event, P(A) = X/T.

Chapter 5 – Binomial Distribution

Binomial Distribution summarizes the number of trials, or observations when each trial has
the same probability of attaining one particular value. It determines the probability of
observing a specified number of successful outcomes in a specified no. of trials.

 Poisson Distribution

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 It is a probability distribution that is used to show how many times an event is likely to
occur over a specified period. It is often used to understand independent events that
occur at a constant rate within a
given interval of time.
 Poisson distribution for
computing the probability of X =
x events, given that lambda
events are expected.

 Excel Guide
 =SUMPRODUCT(X cell range, P(X=x1) cell range) to compute the expected value.
 =SUMPRODUCT(squared differences cell range, P(X =x1) cell range) to compute
the variance.

 =BINOM.DIST(number of events of interest, sample size, probability of an event of


interest, FALSE) to compute binomial distribution.
 =POISSON.DIST(number of events of interest, the average or expected number of
events of interest, FALSE) to compute poisson distribution.

Chapter 6 – The Normal Distribution

 Normal Distribution Important Theoretical Properties


◦ Symmetrical distribution. Its mean and median are therefore equal.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

◦ Bell-shaped. Values cluster around the mean.


◦ Interquartile range is roughly 1.33 standard deviations. Therefore, the middle 50% of the
values are contained within an interval that is approximately two-thirds of a standard
deviation below and two-thirds of a standard deviation above the mean.
◦ The distribution has an infinite range ( - ∞ < X < ∞ ). In a normal distribution, the range
is approximately 6 standard deviations.)

◦ In a quantile–quantile plot, the Z values are plotted on the X axis, and the corresponding
values of the variable are plotted on the Y axis.

 Excel Guide

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

◦ =NORM.DIST(X value, mean, standard deviation, True) function to compute normal


probabilities
◦ =NORM.S.INV(percentage) function and the =STANDARDIZE function to compute
the Z value.

Chapter 7 – Sampling Distributions


 The sampling distribution of the mean is the distribution of all possible sample means if
you select all possible samples of a given size.
 The sample mean is unbiased because the mean of all the possible sample means (of a
given sample size, n) is equal to the population mean.
 Although the sample means vary from sample to
sample, the sample means do not vary as much
as the individual values in the population. The
sample means are less variable than the
individual values in the population follows
directly from the fact that each sample mean
averages together all the values in the sample.
 The value of the standard deviation of all
possible sample means, called the standard
error of the mean. As the sample size increases,
the standard error of the mean decreases
by a factor equal to the square root of the
sample size.
 If sampling from a population that is
normally distributed with mean and
standard deviation , then regardless of the
sample size, n, the sampling distribution
of the mean is normally distributed.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 Sampling from Non-normally Distributed Populations— The Central Limit Theorem


 The Central Limit Theorem: As the sample size (the number of values in each
sample) gets large enough, the sampling distribution of the mean is approximately
normally distributed. This is true regardless of the shape of the distribution of the
individual values in the population.
 For most distributions, regardless of shape of the population, the sampling distribution
of the mean is approximately normally distributed if samples of at least size 30 are
selected.
 If the distribution of the population is fairly symmetrical, the sampling distribution of
the mean is approximately normal for samples as small as size 5.
 If the population is normally distributed, the sampling distribution of the mean is
normally distributed, regardless of the sample size.
 Sampling Distribution of the Proportion
 pie = proportion of items in the entire population with the characteristic of interest.
 p = proportion of items in the sample with the characteristic of interest.
▪ p = X/n = Number of items having the characteristic of interest/Sample size.
 The statistic p is an unbiased estimator of the population proportion, pie.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 Excel Guide
 =RAND() ;to create lists of random numbers.

Chapter 8 – Confidence Interval Estimation

 Confidence Interval Estimate for the Mean ( population standard deviation Known)

 Confidence Interval Estimate for the Mean (population standard deviation Unknown)
◦ Student’s t distribution. This expression has the same form as the Z statistic,
except that S is used to estimate the unknown population SD.
▪ You find the critical values of t for the appropriate degrees of freedom from
the table of the t distribution (see Table E.3).
▪ For example, with 99 degrees of freedom, if you want 95% confidence,
you find the appropriate value of t. The 95% confidence level means that
2.5% of the values (an area of 0.025) are in each tail of the distribution.
Looking in the column for a cumulative probability of 0.975 and an
upper-tail area of 0.025 in the row corresponding to 99 degrees of
freedom gives you a critical value for t of 1.9842.
◦ Degree of Freedom = n – 1

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

◦ The Confidence Interval Statement

 E.g. Thus, with 95% confidence, you conclude that the mean amount of all the sales
invoices is between $104.53 and $116.01. The 95% confidence level indicates that if you
selected all possible samples of 100 (something that is never done in practice), 95% of the
intervals developed would include the population mean somewhere within the interval. The
validity of this confidence interval estimate depends on the assumption of normality for the
distribution of the amount of the sales invoices. With a sample of 100, the normality
assumption is valid, and the use of the t distribution is likely appropriate.
(The interpretation of the confidence interval when population SD is unknown is the same as when it is known )

 Confidence Interval Estimate for the Proportion


 If you want to estimate the proportion of items in a population having a certain
characteristic of interest. The unknown population proportion is represented by the
Greek letter p. The point estimate for p is the sample proportion, p = X/n.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 Determining Sample Size


 Sample Size Determination For The Mean

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 Sample Size Determination for the Proportion

 Excel Guide
 Confidence Interval Estimate for the Mean (population SD Known)
 =NORM.S.INV(cumulative percentage) to compute the Z value for one-half of the (1 - alpha)
value.

 =CONFIDENCE(1−confidence level, population standard deviation, sample size)


function to compute the half-width of a confidence interval.

 Confidence Interval Estimate for the Mean (popuation SD Unknown)


 =T.INV.2T(1−confidence level, degrees of freedom) function to determine the critical value
from the t distribution.

 Confidence Interval Estimate for the Proportion


 =NORM.S.INV((1−confidence level)/2) function to compute the Z value.
 Determining Sample Size Sample Size
 Determination for the Mean
 =NORM.S.INV((1−confidence level)/2) function to compute the Z value
 =ROUNDUP(calculated sample size, 0) function to round up the computed sample size to the
next higher integer.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 Sample Size Determination for the Proportion


 =NORM.S.INV and =ROUNDUP functions discussed previously to help determine the sample
size needed for estimating the proportion.

Chapter 9 – Fundamentals Of Hypothesis Testing: One-Sample


Tests

 Fundamentals of Hypothesis-Testing Methodology


◦ The hypothesis that the population
parameter is equal to the company
specification is referred to as the null
hypothesis. A null hypothesis is often
one of status quo and is identified by
the symbol H0.
◦ The alternative hypothesis, H1, is the
opposite of the null hypothesis and
represents a research claim or specific
inference you would like to prove.
◦ The Critical Value of the Test Statistic
▪ Hypothesis testing uses sample data to determine how likely it is that the null
hypothesis is true.
▪ Even if the null hypothesis is true, the sample statistic X(bar) is likely to differ from
the value of the parameter (the population mean) because of variation due to
sampling.
▪ If the sample statistic is close to the population parameter, you have insufficient
evidence to reject the null hypothesis.
▪ If there is a large difference between the value of the sample statistic and the
hypothesized value of the population parameter, you might conclude that the null
hypothesis is false.
◦ Regions of Rejection and Non-rejection

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

◦ Risks in Decision Making Using Hypothesis Testing

◦ Z Test for the Mean (SD Known)

◦ Hypothesis Testing Using the Critical Value Approach


(a two-tail test in which the rejection region is divided into the two tails of the distribution)

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

SUMMARY:
Step 1: State the null hypothesis, H0, and the alternative hypothesis, H1.
Step 2: Choose the level of significance, a, and the sample size, n. The level of significance is based
on the relative importance of the risks of committing Type I and Type II errors in the problem.
Step 3: Determine the appropriate test statistic and sampling distribution.
Step 4: Determine the critical values that divide the rejection and non-rejection regions.
Step 5: Collect the sample data, organize the results, and compute the value of the test statistic.
Step 6: Make the statistical decision, determine whether the assumptions are valid, and state the
managerial conclusion in the context of the theory, claim, or assertion being tested. If the test statistic
falls into the nonrejection region, you do not reject the null hypothesis. If the test statistic falls into the
rejection region, you reject the null hypothesis.

◦ Hypothesis Testing Using the p-Value Approach

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

▪ The p-value is the probability of getting a test statistic equal to or more extreme than
the sample result, given that the null hypothesis, H0, is true. The p-value is also
known as the observed level of significance.
 If the p-value is greater than or equal to alpha i.e. level of significance, do not
reject the null hypothesis.
 If the p-value is less than alpha, reject the null hypothesis.

SUMMARY:
Step 1 State the null hypothesis, H0, and the
alternative hypothesis, H1.
Step 2 Choose the level of significance, a, and the
sample size, n. The level of significance is based on
the relative importance of the risks of committing
Type I and Type II errors in the problem.
Step 3 Determine the appropriate test statistic and
the sampling distribution.
Step 4 Collect the sample data, compute the value of
the test statistic, and compute the p-value.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

Step 5 Make the statistical decision and state the managerial conclusion in the context of the theory, claim, or
assertion being tested. If the p-value is greater than or equal to a, do not reject the null hypothesis. If the p-
value is less than a, reject the null hypothesis.

 t Test of Hypothesis for the Mean (population SD Unknown)


 If you assume that


the population is normally distributed, then the sampling distribution of the mean will follow a t
distribution with n - 1 degrees of freedom and you can use the t test for the mean. If the population is not
normally distributed, you can still use the t test if the population is not too skewed and the sample size is
not too small.
 The Critical Value Approach

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 p-Value Approach

 One-Tail Tests
 The Critical Value Approach

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 The p-Value Approach

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 Z Test of Hypothesis for the Proportion


 If the number of events of interest (X) and the number of events that are not of interest
(n - X) are each at least five, the sampling distribution of a proportion approximately
follows a normal distribution, and you can use the Z test for the proportion.

 The Critical Value Approach

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 The p-Value Approach

 Excel Guide
 Fundamentals of Hypothesis-Testing Methodology
 =NORM.S.INV(level of significance/2) and =NORM.S.INV(1 – level of
significance/2) functions to compute the lower and upper critical values.
 =NORM.S.DIST (absolute value of the Z test statistic, True) to compute the p-value.
 t Test of Hypothesis for the Mean (population SD Unknown)
 =T.INV.2T(level of significance, degrees of freedom) function to compute the lower and
upper critical values and use
 =T.DIST.2T(absolute value of the t test statistic, degrees of freedom) to compute the p-
value

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 One-Tail Tests
(Z test for the mean)
 =NORM.S. INV(level of significance) and =NORM.S.INV(1 – level of significance)
functions to compute the lower and upper critical values.
 =NORM.S.DIST(Z test statistic, True) and 1 – NORM.S.DIST(Z test statistic, True)
to compute the lower-tail and upper-tail p-values.
(t test for the mean)
 =–T.INV .2T(2 * level of significance, degrees of freedom) and =T.INV.2T(2 * level
of significance, degrees of freedom) functions to compute the lower and upper
critical values.
 Use an IF function that tests the t test statistic to determine how
=T.DIST.RT(absolute value of the t test statistic, degrees of freedom) or =1 –
T.DIST.RT(absolute value of the t test statistic, degrees of freedom) are used to
compute the lower-tail and upper-tail p-values.
 Test of Hypothesis for the Proportion
 =NORM.S.INV(level of significance/2) and NORM.S.INV(1 – level of significance/
2) functions to compute the lower and upper critical values.
 =NORM.S.DIST(absolute value of the Z test statistic, True) as part of a formula to
compute the p-value.

Chapter 10 - Two Sample Tests

 Pooled-Variance t Test for the Difference Between Two Means


◦ If you assume that the random samples are independently selected from two populations
and that the populations are normally distributed and have equal variances, you can use
a pooled-variance t test to determine whether there is a significant difference between
the mean.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

◦ For a given level of significance, a, in a two-tail test, you reject the null hypothesis if the
computed tSTAT test statistic is greater than the upper-tail critical value from the t
distribution or if the computed tSTAT test statistic is less than the lower-tail critical value
from the t distribution.
◦ E.g.

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

◦ EXCEL CODE:

 Confidence Interval Estimate for the Difference Between Two Means

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

◦ t Test for the Difference Between Two Means, Assuming Unequal Variances

 F Test for the Ratio of Two Variances

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

◦ Excel Code:

 Example:

Downloaded by Sun kissed ([email protected])


lOMoARcPSD|51735606

 Excel Guide:
◦ Comparing the Means of Two Independent Populations
▪ Pooled-Variance t Test for the Difference Between Two Means
 =T.INV.2T(level of significance, total degrees of freedom) to compute the lower and
upper critical values.
 =T.DIST.2T(absolute value of the t test statistic, total degrees of freedom) to
compute the p-value.
▪ t Test for the Difference Between Two Means, Assuming Unequal Variances
 =T.INV.2T(level of significance, degrees of freedom) to compute the lower and upper
critical values.
 =T.DIST.2T(absolute value of the t test statistic, degrees of freedom) to compute
the p-value.
◦ F Test for the Ratio of Two Variances
 =F.INV.RT(level of significance / 2, population 1 sample degrees of freedom,
population 2 sample degrees of freedom) function to compute the upper critical value.
 =F.DIST.RT(F test statistic, population 1 sample degrees of freedom, population
2 sample degrees of freedom) function to compute the p-values.

Downloaded by Sun kissed ([email protected])

You might also like