Data Analysis Notes
Data Analysis Notes
DATA ANALYSIS
A population contains all the items or individuals of interest that you seek to study.
A sample contains only a portion of a population of interest. You analyse a sample to
estimate characteristics of an entire population
A statistic summarizes the value of a specific variable for sample data. Correspondingly, a
parameter summarizes the value of a population for a specific variable.
Non-probability sample, you select the items or individuals without knowing their
probabilities of selection.
▪ Convenience sample, you select items that are easy, inexpensive, or convenient to
sample
▪ Judgment sample, you collect the opinions of preselected experts in the subject
matter.
Probability sample, you select items based on known probabilities.
▪ Simple random sample, every item from a frame has the same chance of selection
as every other item, and every sample of a fixed size has the same chance of
selection as every other sample of that size.
▪ In a systematic sample, you partition the N items in the frame into n groups of k
items, where k = N/n and then:
e.g. n = 40 , N = 800, each group contains 20 employees. Select a random
number from the first 20 individuals and include every twentieth individual after
the first selection in the sample. If the first random number you select is 8,
subsequent selections are 028, 048, 068, 088, 108, 768, and 788.
▪ In a stratified sample, you first subdivide the N items in the frame into separate
sub-populations, or strata. A stratum is defined by some common characteristic, such
as gender or year in school. You select a simple random sample within each of the
strata and combine the results from the separate simple random samples.
▪ In a cluster sample, you divide the N items in the frame into clusters that contain
several items. Clusters are often naturally occurring groups, such as counties,
election districts, city blocks, households, or sales territories. You then take a random
sample of one or more clusters and study all items in each selected cluster.
Data Cleaning:
Deleting Duplicates:
▪ Select Data
▪ Go to Data > Remove Duplicates
To Remove Blank Cells
▪ Go to ‘find and select’ -> ‘go to special’
▪ Select blanks and click OK
i. A Summary Table tallies the set of individual values as frequencies or percentages for
each category. A summary table helps you see the differences among the categories by
displaying the frequency, amount, or percentage of items in a set of categories in a
separate column.
ii. A Contingency Table cross-tabulates, or tallies jointly, the data of two or more
categorical variables, allowing you to study patterns that may exist between the
variables.
Organizing Numerical Variables
i. A Frequency Distribution tallies the values of a numerical variable into a set of
numerically ordered classes. Each class groups a mutually exclusive range of values,
called a class interval. Interval width = (highest value - lowest value)/number of classes
ii. A Bin is a range of values defined by a bin number, the upper boundary of the range.
E.g. bin numbers 4.99, 9.99, and 14.99. The first bin represents all values up to 4.99, the
second bin all values greater than 4.99 through 9.99, and the third bin all values greater
than 9.99 through 14.99.
iii. A Relative Frequency Distribution presents the relative frequency, or proportion, of
the total for each group that each class represents. A Percentage Distribution presents
the percentage of the total for each group that each class represents. Proportion = relative
frequency = (number of values in each class)/total number of values
iv. The Cumulative Percentage Distribution provides a way of presenting information
about the percentage of values that are less than a specific amount.
Visualizing Categorical Variables
i. A Bar Chart visualizes a categorical variable as a series of bars, with each bar
representing the tallies for a single category.
ii. A Pie Chart and a Doughnut Chart both use parts of a circle to represent the tallies of
each category of a categorical variable.
iii. In a Pareto Chart, the tallies for each category are plotted as vertical bars in descending
order, according to their frequencies, and are combined with a cumulative percentage (at
the midpoint of each category) line on the same chart.
Visualizing Numerical Variables
i. A Stem-And-Leaf Display visualizes data by presenting the data as one or more row-
wise stems that represent a range of values. For stems with more than one leaf, the
leaves are arranged in ascending order.
For eg: 7.42, 7.10, 7.90 stem = 7; leaves = 4(round of of 42), 1, 9
7 | 149 (leaves in ascending order)
ii. A Histogram visualizes data as a vertical bar chart in which each bar represents a class
interval from a frequency or percentage distribution.
iii. Percentage Polygon uses the midpoints of each class interval to represent the data of
each class and then plots the midpoints, at their respective class percentages, as points on
a line along the X axis.
iv. The Cumulative Percentage Polygon, or ogive, uses the cumulative percentage
distribution.
v. A Scatter Plot explores the possible relationship between two numerical variables by
plotting the values of one numerical variable on the horizontal, or X, axis and the values
of a second numerical variable on the vertical, or Y, axis.
vi. A Time-Series Plot plots the values of a numerical variable on the Y axis and plots the
time period associated with each numerical value on the X axis.
Excel Guide
i. Insert ➔ PivotTable
ii. The Frequency Distribution
▪ Establish Bins
▪ =FREQUENCY (untallied data cell range, bins cell range) array function to tally
data.
v. The Geometric Mean Rate of Return measures the mean percentage return of an
investment per time period.
iii. The Coefficient of Variation (CV) measures the scatter in the data relative to the mean.
CV = (S/X) 100% where: S= S bar = sample standard deviation ; X=X bar = sample mean
iv. The Z Score for a value is equal to the difference between the value and the mean,
divided by the standard deviation: Z = (X - mean)/S.
A Z score of 0 indicates that the value is the same as the mean.
If a Z score is a positive or negative number, it indicates whether the value is above
or below the mean and by how many standard deviations.
Z scores help identify outliers, the values that seem excessively different from most
of the rest of the values. Values that are very different from the mean will have either
very small (negative) Z scores or very large (positive) Z scores.
As a general rule, a Z score that is less than -3.0 or greater than +3.0 indicates an
outlier value.
v. Skewness measures the extent to which the data values are not symmetrical around the
mean. The three possibilities are:
Mean < median: negative, or left-skewed distribution
Mean = median: symmetrical distribution (zero skewness)
Mean > median: positive, or right-skewed distribution
In a symmetrical distribution, the values below the mean are distributed in exactly
the same way as the values above the mean, and the skewness is zero.
In a skewed distribution, there is an imbalance of data values below and above the
mean, and the skewness is a nonzero value (less than zero for a left-skewed
distribution, greater than zero for a right-skewed distribution).
Panel A displays a left-skewed distribution. Most of the values are in the upper
portion of the distribution. Some extremely small values cause the long tail and
distortion to the left and cause the mean to be less than the median. Because the
skewness statistic for such a distribution will be less than zero, some use the term
negative skew to describe this distribution.
Panel B displays a symmetrical distribution. Values are equally distributed in the
upper and lower portions of the distribution.
Panel C displays a right-skewed distribution. Most of the values are in the lower
portion. Some extremely large values cause the long tail and distortion to the right
and cause the mean to be greater than the median. Because the skewness statistic for
such a distribution will be greater than zero, some use the term positive skew to
describe this distribution.
vi. Kurtosis measures the peakedness of the curve of the distribution—that is, how sharply
the curve rises approaching the centre of the distribution.
A distribution that has a sharper-rising center peak than the peak of a normal
distribution has positive kurtosis, a kurtosis value that is greater than zero, and is
called lepokurtic. Higher concentration of values near the mean of the distribution
compared to a normal distribution.
A distribution that has a slower-rising (flatter) center peak than the peak of a normal
distribution has negative kurtosis, a kurtosis value that is less than zero, and is called
platykurtic. Lower concentration of values near the mean of the distribution
compared to a normal distribution.
Exploring Numerical Data
i. Quartiles split the values into four equal parts
The First Quartile divides the smallest 25.0% of the values from the other 75.0%
that are larger. Q1 = (n + 1)/4 ranked value (where n = sample size)
The Second Quartile is the median; 50.0% of the values are smaller than or equal
to the median, and 50.0% are larger than or equal to the median.
The Third Quartile divides the smallest 75.0% of the values from the largest
25.0%. Q3 = 3(n + 1)/4 ranked value.
ii. The Interquartile Range (also called the midspread) measures the difference in the
center of a distribution between the third and first quartiles (Q3 – Q1).
(Descriptive statistics such as the median, Q1, Q3, and the interquartile range, which are not influenced by
extreme values, are called resistant measures.)
iii. The Five-Number Summary: Xsmallest Q1 Median Q3 Xlargest
Panels A and D are symmetrical. In these distributions, the mean and median are
equal. In addition, the length of the left tail is equal to the length of the right tail, and
the median line divides the box in half.
Panel B left-skewed. The few small values distort the mean toward the left tail.
There is a long left tail that contains the smallest 25% of the values, demonstrating
the lack of symmetry in this data.
Panel C right-skewed. There is a long right tail that contains the largest 25% of the
values, demonstrating the lack of symmetry in this data set.
Numerical Descriptive Measures for a Population
i.
ii.
iii. The Empirical Rule states that for population data from a symmetric mound-shaped
distribution such as the normal distribution, the following are true:
Approximately 68% of the values are within + - 1 standard deviation from the mean.
Approximately 95% of the values are within + - 2 standard deviations from the mean
Approximately 99.7% of the values are within + - 3 standard deviations from the mean.
e.g. mean = 2.06; standard deviation = 0.02
mean + - SD = 2.06 + - 0.02 = (2.04, 2.08) i.e. 68% data is btw 2.04 & 2.08
2.06 + - 2(0.02) = (2.02, 2.10) i.e. 95% data fall btw 2.02 and 2.10
2.06 + - 3(0.02) = (2.00, 2.12) i.e. 99.7% data is btw 2.00 & 2.12
iv. Chebyshev’s Theorem states that for any data set, regardless of shape, the percentage of
values that are found within distances of k standard deviations from the mean must be at
least = (1-1/k2)*100%. The theorem indicates at least what percentage of the values fall
within a given distance from the mean.
i. In Panel A , there is a perfect negative linear relationship between X and Y. Thus, the
coefficient of correlation, r, equals -1, and when X increases, Y decreases in a perfectly
predictable manner.
ii. Panel B shows a situation in which there is no relationship between X and Y. In this
case, the coefficient of correlation, r, equals 0, and as X increases, there is no tendency
for Y to increase or decrease.
iii. Panel C illustrates a perfect
positive relationship where r
equals +1. In this case, Y increases
in a perfectly predictable manner
when X increases.
(Correlation alone cannot prove that there is a
causation effect—that is, that the change in the value
of one variable caused the change in the other
variable. Causation implies correlation, but
correlation alone does not imply causation )
The coefficient of correlation indicates the
linear relationship, or association, between
two numerical variables. When the coefficient of correlation gets closer to +1 or -1, the linear relationship
between the two variables is stronger, near 0, little or no linear relationship exists,
Excel Guide
i. =AVERAGE(variable cell range)
ii. =MEDIAN(variable cell range)
iii. =MODE(variable cell range)
iv. =GEOMEAN((1 + (R1)), (1 + (R22)), . . . (1 + (Rn2 ))) − 1 to compute the geometric mean
rate of return.
v. =MIN(variable cell range)
vi. =MAX(variable cell range)
vii. =VAR.S(variable cell range) ;to compute the sample variance.
viii. =STDEV.S(variable cell range) ;to compute the sample standard deviation.
ix. =STANDARDIZE(value, mean, standard deviation) ;to compute Z scores.
x. =SKEW(variable cell range)
xi. =KURT(variable cell range)
xii. =AVERAGE(variable cell range) ;for population
xiii. =VAR.P(variable cell range) ;for population
xiv.=STDEV.P(variable cell range) ;for population
xv. =COVARIANCE.S(variable 1 cell range, variable 2 cell range)
xvi. =CORREL(variable 1 cell range, variable 2 cell range) to compute coefficient of
correlation.
1. Simple probability refers to the probability of occurrence of a simple event, P(A) = X/T.
Binomial Distribution summarizes the number of trials, or observations when each trial has
the same probability of attaining one particular value. It determines the probability of
observing a specified number of successful outcomes in a specified no. of trials.
Poisson Distribution
It is a probability distribution that is used to show how many times an event is likely to
occur over a specified period. It is often used to understand independent events that
occur at a constant rate within a
given interval of time.
Poisson distribution for
computing the probability of X =
x events, given that lambda
events are expected.
Excel Guide
=SUMPRODUCT(X cell range, P(X=x1) cell range) to compute the expected value.
=SUMPRODUCT(squared differences cell range, P(X =x1) cell range) to compute
the variance.
◦ In a quantile–quantile plot, the Z values are plotted on the X axis, and the corresponding
values of the variable are plotted on the Y axis.
Excel Guide
Excel Guide
=RAND() ;to create lists of random numbers.
Confidence Interval Estimate for the Mean ( population standard deviation Known)
Confidence Interval Estimate for the Mean (population standard deviation Unknown)
◦ Student’s t distribution. This expression has the same form as the Z statistic,
except that S is used to estimate the unknown population SD.
▪ You find the critical values of t for the appropriate degrees of freedom from
the table of the t distribution (see Table E.3).
▪ For example, with 99 degrees of freedom, if you want 95% confidence,
you find the appropriate value of t. The 95% confidence level means that
2.5% of the values (an area of 0.025) are in each tail of the distribution.
Looking in the column for a cumulative probability of 0.975 and an
upper-tail area of 0.025 in the row corresponding to 99 degrees of
freedom gives you a critical value for t of 1.9842.
◦ Degree of Freedom = n – 1
E.g. Thus, with 95% confidence, you conclude that the mean amount of all the sales
invoices is between $104.53 and $116.01. The 95% confidence level indicates that if you
selected all possible samples of 100 (something that is never done in practice), 95% of the
intervals developed would include the population mean somewhere within the interval. The
validity of this confidence interval estimate depends on the assumption of normality for the
distribution of the amount of the sales invoices. With a sample of 100, the normality
assumption is valid, and the use of the t distribution is likely appropriate.
(The interpretation of the confidence interval when population SD is unknown is the same as when it is known )
Excel Guide
Confidence Interval Estimate for the Mean (population SD Known)
=NORM.S.INV(cumulative percentage) to compute the Z value for one-half of the (1 - alpha)
value.
SUMMARY:
Step 1: State the null hypothesis, H0, and the alternative hypothesis, H1.
Step 2: Choose the level of significance, a, and the sample size, n. The level of significance is based
on the relative importance of the risks of committing Type I and Type II errors in the problem.
Step 3: Determine the appropriate test statistic and sampling distribution.
Step 4: Determine the critical values that divide the rejection and non-rejection regions.
Step 5: Collect the sample data, organize the results, and compute the value of the test statistic.
Step 6: Make the statistical decision, determine whether the assumptions are valid, and state the
managerial conclusion in the context of the theory, claim, or assertion being tested. If the test statistic
falls into the nonrejection region, you do not reject the null hypothesis. If the test statistic falls into the
rejection region, you reject the null hypothesis.
▪ The p-value is the probability of getting a test statistic equal to or more extreme than
the sample result, given that the null hypothesis, H0, is true. The p-value is also
known as the observed level of significance.
If the p-value is greater than or equal to alpha i.e. level of significance, do not
reject the null hypothesis.
If the p-value is less than alpha, reject the null hypothesis.
SUMMARY:
Step 1 State the null hypothesis, H0, and the
alternative hypothesis, H1.
Step 2 Choose the level of significance, a, and the
sample size, n. The level of significance is based on
the relative importance of the risks of committing
Type I and Type II errors in the problem.
Step 3 Determine the appropriate test statistic and
the sampling distribution.
Step 4 Collect the sample data, compute the value of
the test statistic, and compute the p-value.
Step 5 Make the statistical decision and state the managerial conclusion in the context of the theory, claim, or
assertion being tested. If the p-value is greater than or equal to a, do not reject the null hypothesis. If the p-
value is less than a, reject the null hypothesis.
p-Value Approach
One-Tail Tests
The Critical Value Approach
Excel Guide
Fundamentals of Hypothesis-Testing Methodology
=NORM.S.INV(level of significance/2) and =NORM.S.INV(1 – level of
significance/2) functions to compute the lower and upper critical values.
=NORM.S.DIST (absolute value of the Z test statistic, True) to compute the p-value.
t Test of Hypothesis for the Mean (population SD Unknown)
=T.INV.2T(level of significance, degrees of freedom) function to compute the lower and
upper critical values and use
=T.DIST.2T(absolute value of the t test statistic, degrees of freedom) to compute the p-
value
One-Tail Tests
(Z test for the mean)
=NORM.S. INV(level of significance) and =NORM.S.INV(1 – level of significance)
functions to compute the lower and upper critical values.
=NORM.S.DIST(Z test statistic, True) and 1 – NORM.S.DIST(Z test statistic, True)
to compute the lower-tail and upper-tail p-values.
(t test for the mean)
=–T.INV .2T(2 * level of significance, degrees of freedom) and =T.INV.2T(2 * level
of significance, degrees of freedom) functions to compute the lower and upper
critical values.
Use an IF function that tests the t test statistic to determine how
=T.DIST.RT(absolute value of the t test statistic, degrees of freedom) or =1 –
T.DIST.RT(absolute value of the t test statistic, degrees of freedom) are used to
compute the lower-tail and upper-tail p-values.
Test of Hypothesis for the Proportion
=NORM.S.INV(level of significance/2) and NORM.S.INV(1 – level of significance/
2) functions to compute the lower and upper critical values.
=NORM.S.DIST(absolute value of the Z test statistic, True) as part of a formula to
compute the p-value.
◦ For a given level of significance, a, in a two-tail test, you reject the null hypothesis if the
computed tSTAT test statistic is greater than the upper-tail critical value from the t
distribution or if the computed tSTAT test statistic is less than the lower-tail critical value
from the t distribution.
◦ E.g.
◦ EXCEL CODE:
◦ t Test for the Difference Between Two Means, Assuming Unequal Variances
◦ Excel Code:
Example:
Excel Guide:
◦ Comparing the Means of Two Independent Populations
▪ Pooled-Variance t Test for the Difference Between Two Means
=T.INV.2T(level of significance, total degrees of freedom) to compute the lower and
upper critical values.
=T.DIST.2T(absolute value of the t test statistic, total degrees of freedom) to
compute the p-value.
▪ t Test for the Difference Between Two Means, Assuming Unequal Variances
=T.INV.2T(level of significance, degrees of freedom) to compute the lower and upper
critical values.
=T.DIST.2T(absolute value of the t test statistic, degrees of freedom) to compute
the p-value.
◦ F Test for the Ratio of Two Variances
=F.INV.RT(level of significance / 2, population 1 sample degrees of freedom,
population 2 sample degrees of freedom) function to compute the upper critical value.
=F.DIST.RT(F test statistic, population 1 sample degrees of freedom, population
2 sample degrees of freedom) function to compute the p-values.