Lesson 6 Advanced Statistics
Lesson 6 Advanced Statistics
DISCUSSION:
A measure of central tendency is a value used to represent the typical or “average” value in a
data set.
Mean – the sum of all data values divided by the number of values in the data set. The
mean of a sample data set is denoted by x and the mean of a population data set by the
Greek letter .
x
x x
n N
Exercise: Find the mean of the following data set:
Median – the value which separates the largest 50% of data values from the lowest 50%.
To calculate the median, place data values in number order. If n is odd, the middle value
is the median. If n is even, the mean of the two middle values is the median.
Exercise: Find the median value for the set of quiz scores.
Find the median if the low score of 1 is dropped.
Mode – the data value (or values) which appears the largest number of times in the set.
If no data value is repeated, we say that there is no mode.
One drawback of the mean is that it is heavily influenced by a few very high or very low
data values. In these cases it is more common to use the median.
The mode has the advantage that it can be used to measure data sets even if they contain
only qualitative data. A disadvantage is that a data set may not have a mode.
Weighted Means
A weighted mean is used when we want some data values in a set to factor more often into the
calculation of the mean than others.
UNIVERSITY OF CAGAYAN VALLEY
(Formerly Cagayan Colleges Tuguegarao)
Tuguegarao City, Cagayan, Philippines
SCHOOL OF LIBERAL ARTS AND TEACHER EDUCATION
In this case, we attach a numerical weight to each value and calculate the mean as follows:
x
( x w)
w
Note: This is equivalent to counting each data value the number of times given by its weight.
Examples:
Grade point average. We assign the letter grades the number values A=4, B=3, C=2,
D=1, F=0, and then each grade value is counted into the GPA according to the number of
credits earned with that grade.
Course grade. The final grade in this course is calculated according to the following
scale: Homework counts for 15%, 3 exams count 20% each, and the final exam is worth
25%. We can weight the score for each component of the final grade with its percentage
to calculate the final grade
Summary
The Mean is used in computing other statistics (such as the variance) and does not exist for open
ended grouped frequency distributions (1). It is often not appropriate for skewed distributions
such as salary information.
The Median is the center number and is good for skewed distributions because it is resistant to
change.
The Mode is used to describe the most typical case. The mode can be used with nominal data
whereas the others can't. The mode may or may not exist and there may be more than one value
for the mode (2).
The Midrange is not used very often. It is a very rough estimate of the average and is greatly
affected by extreme values (even more so than the mean).
Now that we understand the three different ways to calculate the center of the
distribution, the natural question is which should be used to describe the center of a particular
distribution. For normal distributions, it turns out that the mean, median and mode are all the
same, so it does not really matter which you use. However, for a normal distribution, we usually
use the mean to represent the center of the distribution.
UNIVERSITY OF CAGAYAN VALLEY
(Formerly Cagayan Colleges Tuguegarao)
Tuguegarao City, Cagayan, Philippines
SCHOOL OF LIBERAL ARTS AND TEACHER EDUCATION
For distributions that are close to normal, the mean is still the appropriate value because it
will best represent the center of the distribution. Because most of the distributions in the social
sciences are relatively normal, the mean is by far the most common measure of the center of a
distribution.
For skewed distributions, however, the mean is not the best indicator of the center of the
distribution. Instead, for a skewed distribution, either the median or mode better represent the
center of the distribution. Typically, the mode is used to indicate the center of strongly skewed
distributions. The median is appropriate when the skew is less severe.
Range
The range represents one method for describing the dispersion of the data. It is calculated by
subtracting the smallest value from the largest value in the data set. In general, the range
provides some useful information but tells us relatively little about the data except for the two
most extreme scores in the data set. Therefore, the range tells us only about the two most
extreme scores. We would prefer to consider all the values in the data set when determining how
spreads out the scores are. Therefore, the range is infrequently used to describe the dispersion of
data sets.
The range is the simplest measure of variation to find. It is simply the highest value minus the
lowest value.
Since the range only uses the largest and smallest values, it is greatly affected by extreme values,
that is - it is not resistant to change.
Variance
Much more commonly, the dispersion of a set of data is represented using the variance
because this descriptive statistic considers all the scores in the data set. The variance examines
how far, on average, each score is away from the mean.
The variance is symbolized 𝑠 2 . There are two methods for calculating the variance. The
first method uses the conceptual/definitional formula which clearly demonstrates the underlying
logic of calculating the variance.
With this formula, however, there are many more calculations and a dramatic increase in
the possibility that we make a simple calculation error and end up with an incorrect variance.
Because of this, we do not actually use the conceptual formula on data sets. This should help
clarify the conceptual process of identifying the variance. After this example, though, we will
UNIVERSITY OF CAGAYAN VALLEY
(Formerly Cagayan Colleges Tuguegarao)
Tuguegarao City, Cagayan, Philippines
SCHOOL OF LIBERAL ARTS AND TEACHER EDUCATION
only use the second formula. The second formula is the calculation formula and should be used
whenever the variance needs to be calculated for a set of data.
Instead, we will always use the calculation formula to compute the variance. This
formula requires fewer calculations, is faster, and is less influenced by rounding. While the
benefits of the conceptual formula are significant, they come with a price. With the calculation
formula it is not as clear why each step is necessary. Be assured, however, that both formulas
will produce exactly the same variance values. The calculation formula is shown in the following
formula:
In this formula, sigma x2 is the sum of the squared scores, (sigma x)2 is the squared sum
of the scores and N represents the total number of scores in the data set. There are four steps for
calculating the variance using the calculation formula.
One would expect the sample variance to simply be the population variance with the
population mean replaced by the sample mean. However, one of the major uses of statistics is to
estimate the corresponding parameter. This formula has the problem that the estimated value isn't
the same as the parameter. To counteract this, the sum of the squares of the deviations is divided
by one less than the sample size.
Let's try an example of calculating the variance and standard deviation using the formula.
If you were interested in determining the depression level of the people you interact with each
day, you could have each of the eight people complete the Goldberg Depression Inventory
(GDI). This measure yields scores that range between 0 (not depressed) and +7 (extremely
depressed). Using the scores of our eight participants, we can calculate the variance and standard
deviation of our sample
First, we should organize the data into a table. In this step, we should also note that N = 8.
5
UNIVERSITY OF CAGAYAN VALLEY
(Formerly Cagayan Colleges Tuguegarao)
Tuguegarao City, Cagayan, Philippines
SCHOOL OF LIBERAL ARTS AND TEACHER EDUCATION
4
x x2
7 49
6 36
5 25
4 16
4 16
3 9
1 1
0 0
The next step is to sum the score (x) and squared score (x2) columns. The sum of the
scores is 30 and the sum of the squared scores is 154. These are shown at the bottom of the
following table:
x x2
7 49
6 36
5 25
4 16
4 16
3 9
1 1
0 0
sigma 30 154
UNIVERSITY OF CAGAYAN VALLEY
(Formerly Cagayan Colleges Tuguegarao)
Tuguegarao City, Cagayan, Philippines
SCHOOL OF LIBERAL ARTS AND TEACHER EDUCATION
Step 4: Calculate the Variance
The next step is to use the calculation formula to compute the variance. To do this, we
will need the three values we have calculated: N = 8, sigma x = 30, and sigma x2 = 154. The
resulting variance, 5.19, is shown in the following calculation:
This tells us that, on average, each score is 5.19 squared units away from the mean.
However, in most cases the squared units make the variance difficult to understand. Therefore,
we usually calculate the standard deviation which is a measure of dispersion expressed in the
original units of measurement.
The next step is to use the calculation formula to compute the variance. To do this, we
will need the three values we have calculated: N = 8, sigma x = 30, and sigma x2 = 154. The
resulting variance, 5.19, is shown in the following calculation:
The final step is to take the square root of the variance to calculate the standard deviation,
2.28, as shown in the calculation below. The primary reason for calculating the standard
deviation is that our measure of dispersion is now expressed in the same units of measurement as
the original data. If you think about it, the variance produces a squared value of our original
measurement units. Therefore, the standard deviation is commonly reported as the measure of
dispersion for a data set.
UNIVERSITY OF CAGAYAN VALLEY
(Formerly Cagayan Colleges Tuguegarao)
Tuguegarao City, Cagayan, Philippines
SCHOOL OF LIBERAL ARTS AND TEACHER EDUCATION
So, now we know that based on the GDI, the average person in our data set was 2.28
units away from the mean score of 3.75. This is a fairly spread out data set because the possible
range of scores is 0 to +7.
The standard deviation indicates the spread of scores away from the mean. If SD is large,
it means the scores have a wide scatter away from the mean. It indicates that there is a wide
variation of scores among the group, suggesting heterogeneity of group composition.
If the SD is small, it indicates that there is a narrow spread of scores from the mean. It
means there is a little scatter of scores from the mean, suggesting that group members are
homogenous, that is, they have almost similar abilities.
Excel makes calculating statistics much easier today than ever before. It literally takes a few keys
strokes and clicks to get just about any type of statistical measurement or graph from a data set.
Excel is preloaded with statistical functions that can help you find the mean, median, mode,
variance and many more statistical measurements. Aside from Excel's functions, the program
also allows users the option to install a Data Analysis ToolPak Add-in that is used to perform
many types of calculations at once. This tutorial shows an excel user how to use the Data
Analysis tool to find descriptive statistics and explains the results.
If you have never used the Data Analysis ToolPak, it is probably inactive on your Excel
program. You can check to see if you have it by first clicking on the data tab. Next, look for the
analysis group on the far-right side of your screen. If data analysis option does not exist use the
following steps to activate this add-in.
1. Click on the File tab, followed by clicking on options. Next, click on “Add-Ins.”
2. Next, click on the “Go” button to the manage add-ins section.
3. Lastly, check the “Analysis Pak” box and click “OK.”
4. You should now be ready to use the Data Analysis ToolPak from the data tab in the
analysis group.
If following along with this example with an excel worksheet type this data set into Excel
vertically in individual cells.
Click on “Data Analysis” in the data tab and then click on Descriptive Statistics in the dialog
box. Click the OK button.
UNIVERSITY OF CAGAYAN VALLEY
(Formerly Cagayan Colleges Tuguegarao)
Tuguegarao City, Cagayan, Philippines
SCHOOL OF LIBERAL ARTS AND TEACHER EDUCATION
Next, the range of the data needs to be typed in the input range section of the dialog box. Choose
the output range option and choose a cell for the output to display by typing that cell location in
the blank field. Lastly, click in the Summary statistics checkbox and click OK to display the
results.
UNIVERSITY OF CAGAYAN VALLEY
(Formerly Cagayan Colleges Tuguegarao)
Tuguegarao City, Cagayan, Philippines
SCHOOL OF LIBERAL ARTS AND TEACHER EDUCATION
The Results
The results print in two columns. The first column represents a the descriptive statistic and
second column shows the results for those statistics. In the following sections I will describe
what these descriptive statistics represent.
WEIGHTED AVERAGE
20 + 40 + 40 + 90 + 90 + 90
Weighted Average =
6
370
= = 61.67
6
3. We can use the SUMPRODUCT function in Excel to calculate the number above the fraction line (370).
Note: the SUMPRODUCT function performs this calculation: (20 * 1) + (40 * 2) + (90 * 3) = 370.
4. We can use the SUM function in Excel to calculate the number below the fraction line (6).
5. Use the functions at step 3 and step 4 to calculate the weighted average of these scores in Excel.
UNIVERSITY OF CAGAYAN VALLEY
(Formerly Cagayan Colleges Tuguegarao)
Tuguegarao City, Cagayan, Philippines
SCHOOL OF LIBERAL ARTS AND TEACHER EDUCATION
EXERCISE 5
1. 87, 23, 22, 35, 25, 12, 24, 55, 34, 62, 88, 80, 79, 60, 62
a) Find the mean, median and mode using excel using EXCEL.
b) Find the range, variance and standard deviation using EXCEL.
3. Sarah has a supermarket and she earns a profit of P7,000 from his groceries, P12,000
from vegetables, P5,000 from dairy products and P3,000 fruits.
She wants to predict his profit for the next month. She assigns weights of 8 to groceries,
5 to vegetables, 8 to dairy products and 6 to fruits.