100% found this document useful (1 vote)
150 views18 pages

Descriptive Statistics

The document discusses descriptive statistics and methods for organizing and presenting quantitative data, including frequency distribution tables, histograms, stem-and-leaf plots, and line graphs. Numerical data can be grouped into classes and frequencies tallied. Graphs provide visual representations of patterns in data.

Uploaded by

Jet jet Gonzales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
150 views18 pages

Descriptive Statistics

The document discusses descriptive statistics and methods for organizing and presenting quantitative data, including frequency distribution tables, histograms, stem-and-leaf plots, and line graphs. Numerical data can be grouped into classes and frequencies tallied. Graphs provide visual representations of patterns in data.

Uploaded by

Jet jet Gonzales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

3 DESCRIPTIVE STATISTICS

By the end of the learning experience, students must be able to:


1. Recognize the different methods of data presentation
2. Organize data by constructing frequency distribution table
3. Identify the most appropriate method of data presentation for a given set of data.
4. Identify and compute appropriate numerical descriptive measures.
5. Interpret these numerical descriptive measures.
6. Construct and interpret a boxplot.
7. Make use of Microsoft Excel in computing numerical descriptive measures.

Figure 3.1 When you have large amounts of data, you will need to organize it in a way that makes
sense. These ballots from an election are rolled together with similar ballots to keep them
organized. (credit: William Greeson)

Once you have collected data, what will you do with it? Data can be described and
presented in many different formats. They can be presented in three forms: (1) textual method;
(2) tabular method; and (3) graphical method.

A textual presentation of data is an expository form describing a set of information. This


is a useful manner of presenting limited amounts of information.

Example 3.1
A total of 22.4 million children aged 5-17 years old in 9.6 million households were estimated
from the 1995 National Survey of Working Children (NSWC). Sixteen percent (16%) or 3.6 million
children were reported engaged in economic activities anytime in 1995. Boys were more likely to
work than girls with a national sex ratio of children of 187…
Tabular Presentation

In the tabular presentation, information is entered into the appropriate row and/or
column categories. Summary table helps you see the differences among the categories by
displaying the frequency, amount, or percentage of items in a set of categories in a separate
column.

Example 3.2 Categorical Data

Table 3.1 shows a summary table that tallies the responses of congressmen on the renewal of
ABS-CBN franchise (hypothetical data).

Table 3.1 Congressmen opinion on the renewal of ABS-CBN franchise.

Opinion Frequency

In favor 11

Not in favor 70

From the above table, you can conclude that more than ¾ of the congressmen (77%) does not
favor the renewal of the ABS-CBN franchise.

Numerical data are organized by creating ordered arrays or distributions. One way of doing it is
the frequency distribution table (FDT). A frequency distribution summarizes numerical values
by tallying them into a set of numerically ordered classes. Classes are groups that represent a
range of values, called a class interval. Each value can be in only one class and every value must
be contained in one of the classes. The steps in constructing frequency distribution table are the
following:

1. Arrange the data in an ordered array. An ordered array arranges the values of a numerical
variable in rank order, from the smallest value to the largest value. An ordered array helps
you get a better sense of the range of values in your data and is particularly useful when
you have more than a few values.
2. Determine the range, 𝑅, of the data set, where 𝑅 is defined as:
𝑅 = highest observe value – lowest observe value
3. Determine the number of classes, 𝑘, using the formula 𝑘 = √𝑁, where 𝑁 is the number of
observations in the data set. Round off 𝑘 to the nearest integer.
4. Determine the class size, 𝑐, by the formula 𝑐 = 𝑅/𝑘. Round up c to the nearest integer.
5. Construct the classes as follows. Each class is an interval of the values defined by its lower
and upper class limits.
The lower limit (LL) of the lowest class conventionally takes the lowest value. The lower
limit of the succeeding class is obtained by simply adding c to the lower limit of the preceding
class. Example: 𝐿𝐿2 = 𝐿𝐿1 + 𝑐
The upper limit (UL) of the lowest class is obtained by subtracting one unit of measure from
the lower limit of the next class. Example: 𝑈𝐿1 = 𝐿𝐿1 + 𝑜𝑛𝑒 𝑢𝑛𝑖𝑡 𝑜𝑓 𝑚𝑒𝑎𝑠𝑢𝑟𝑒. (Note: unit of
measure for whole numbers, sat 45, is 1; for numbers like 45.6 and 24.3, unit of measure is o.1).
The upper limit of the succeeding class is obtained by simply adding c to the upper limit of the
preceding class. Example: 𝑈𝐿2 = 𝑈𝐿1 + 𝑐
6. Tally the observations to determine the class frequencies.
Therefore, the quantitative FDT consists of at least two columns: the first column
defining the classes and the second column showing the frequency of each class. However,
additional columns may be added to show additional information. These are:
a. True- Class Boundaries (TCB) – it is a more precise expression of the class limits. It
removes the discontinuity between classes, the upper class boundary of a particular class
being the lower class boundary of the next higher interval.
𝐿𝑇𝐶𝐵 = 𝐿𝐿 − (0.5)(𝑢𝑛𝑖𝑡 𝑜𝑓 𝑚𝑒𝑎𝑠𝑢𝑟𝑒)
b. Class Mark (CM) - is the midpoint of the class limit/class boundaries. It is considered to
be the representative value of the class interval.
𝐿𝐿 + 𝑈𝐿 𝐿𝑇𝐶𝐵 + 𝑈𝑇𝐶𝐵
𝐶𝑀 = 𝑜𝑟 𝐶𝑀 =
2 2
c. Relative Frequency (RF) – this is the frequency of a class expressed in proportion of the
total number of observation
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑐𝑙𝑎𝑠𝑠
𝑅𝐹 =
𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
d. Cumulative Frequency (CF) – this is the accumulated frequency of a class. There are two
kinds of cumulative frequencies. These are: “less than” CF and the “greater than” CF.
The <CF of a given class is the number of observations less than or equal to the upper
limit of the class. The >CF of a given class is the number of observations greater than the
lower limit of the class.

Example 3.3
The total cost ($) for four tickets, two beers, four soft drinks, four hot dogs, two game
programs, two baseball caps, and parking for one vehicle at each of the 30 Major League Baseball
parks during the 2009 season. These costs were 164, 326, 224, 180, 205, 162, 141, 170, 411, 187,
185, 165, 151, 166, 114, 158, 305, 145, 161, 170, 210, 222, 146, 259, 220, 135, 215, 172, 223, 216.
Construct an FDT.
Source: Data extracted from teammarketing.com, April 1, 2009

Solution:
Step 1. Arranged the data in an ordered array.
114, 135, 141, 145, 146, 151, 158, 161, 162, 164, 165, 166, 170, 170, 172, 180, 185, 187, 205,
210, 215, 216, 220, 222, 223, 224, 259, 305, 326, 411
Step 2. Determine the range.
𝑅 = highest observe value – lowest observe value
𝑅 = 411 − 114 = 297
Step 3. Determine the number of classes, 𝑘.
𝑘 = √30 = 5.47 = 5
Step 4: Determine the class size, 𝑐.
𝑅 297
𝑐 = 𝑘 = 5 = 59.4 = 60

Table 3.2 Frequency distribution of the total cost on the 30 Major League Baseball parks during
the 2009 season.

Class Frequency TCB CM Relative <CF >CF


Interval Frequency

114 - 173 15 113.5 – 173.5 143.5 50.0 15 30

174 - 233 11 173.5 – 233.5 203.5 36.67 26 15

234 - 293 1 233.5 – 293.5 263.5 3.33 27 4

294 - 353 2 293.5 – 353.5 323.5 6.67 29 3

354 - 413 1 353.5 – 413.5 383.5 3.33 30 1


Graphical Method
The graphical method allows for the creativity of the researcher; is a tool that helps you
learn about the shape or distribution of a sample or a population. A graph can be a more effective
way of presenting data than a mass of numbers because we can see where data clusters and
where there are only a few data values. Some of the types of graphs that are used to summarize
and organize data are the dot plot, the bar graph, the histogram, the stem-and-leaf plot, the
frequency polygon (a type of broken line graph), the pie chart, and the box plot.

Stem-and-Leaf Graphs (Stemplots)

The stem-and-leaf graph or stemplot, comes from the field of exploratory data analysis.
It is a good choice when the data sets are small. To create the plot, divide each observation of
data into a stem and a leaf. The leaf consists of a final significant digit. The stemplot is a quick
way to graph data and gives an exact picture of the data. You want to look for an overall pattern
and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is
sometimes called an extreme value. When you graph an outlier, it will appear not to fit the
pattern of the graph.

Example 3.4
For Susan Dean's spring pre-calculus class, scores for the first exam were as follows (smallest to
largest):
33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80; 83; 88; 88; 88; 90; 92; 94;
94; 94; 94;96; 100

The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the
8
31 scores or approximately 26% (31)were in the 90s or 100, a fairly high number of As.

Line graph

It is useful for a specific data values. The x-axis (horizontal axis) consists of data values
and the y-axis (vertical axis) consists of frequency points. The frequency points are connected
using line segments.

Example 3.5
In a survey, 40 mothers were asked how many times per week a teenager must be
reminded to do his or her chores. The results are shown in Table 3.3 and in Figure 3.1.
Table 3.3 Frequency distribution of the number of times teenager were reminded of of their
chore per week.

Number of times teenager is reminded Frequency

0 2

1 5

2 8

3 14

4 7

5 4

Figure 3.2 Line graph of the number of times teenager were reminded of their chore per week.

Bar Graphs
Bar graphs consist of bars that are separated from each other. The bars can be rectangles
or they can be rectangular boxes (used in three-dimensional plots), and they can be vertical or
horizontal.

Example 3.6
By the end of 2011, Facebook had over 146 million users in the United States. Table 3.4
shows three age groups, the number of users in each age group, and the proportion (%) of users
in each age group. Construct a bar graph using this data.

Table 3.4 Facebook users in the US.


Age groups Number of Facebook Proportion (%) of Facebook users
users

13–25 65,082,280 45%

26–44 53,300,200 36%


45–64 27,885,100 19%

Figure 3.3 Bar graph of the Facebook users in the US.

Pie Chart
A pie chart (or a circle graph) is a circular chart divided into sectors, illustrating relative
magnitudes or frequencies. In a pie chart, the arc length of each sector (and consequently its
central angle and area), is proportional to the quantity it represents. Together, the sectors create
a full disk. It is named for its resemblance to a pie which has been sliced.

Figure 3.4 Pie chart of the levels of risk of bond mutual funds.

Reviewing Figure 3.3, you see that approximately a little more than one-third of the
funds are average risk, about one-third are above average risk, and fewer than one-third are
below-average risk.

Histograms, Frequency Polygons, and Ogive


A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a
vertical axis. The horizontal axis is labeled with what the data represents (for instance, distance
from your home to school). The vertical axis is labeled either frequency or relative frequency (or
percent frequency or probability). The graph will have the same shape with either label. The
histogram (like the stemplot) can give you the shape of the data, the center, and the spread of
the data.
Steps:
1. Label the X axis
a. It is best to label the X axis with the real limits, beginning with the lowest real limit
b. Remember also to provide a label that tells the units
2. Label the Y axis
a. Begin at 0 if you can and label up to the largest frequency
b. Remember to also put the label "Frequency" next to the numeric labels
3. Next we will plot the first bar
a. The height of the bar is equal to the frequency in the interval
b. The width of the bar is equal to the class width or class size
4. Then we finish the histogram by plotting the rest of the bars

Hint:
• Remember that the graph is unreadable without proper labels.
• If you leave off just one of the labels the entire graph is wrong

Example 3.7: Consider the example on frequency distribution in Table 3.2

Frequency

15
15

11
10

5
1 2
1
0
113.5 173.5 233.5 293.5 353.5 413.5

Total Cost ($)

Figure 3.5 Frequency histogram of the total cost on the 30 Major League Baseball parks during
the 2009 season.

Frequency Polygon

A frequency polygon is almost identical to a histogram, which is used to compare sets of data or
to display a cumulative frequency distribution. It uses a line graph to represent quantitative data.

Steps:
1. To create a frequency polygon we use only the frequency and the midpoint
2. We need to create two extra class intervals
a. An extra interval at the top with frequency = 0
This allows us to bring the graph back down to the X axis.
b. An extra interval at the bottom with frequency = 0
This allows us to start the graph at the X axis.
3. A. Label the X axis
a. It is best to label the X axis with the midpoints beginning with the midpoint of that
extra low interval that you created.
b. Remember also to provide a label that tells the units
B. Label the Y axis
a. Begin at 0 if you can and label up to the largest frequency
b. Remember to also put the label "Frequency" next to the numeric labels

Example 3.8: Consider the example on frequency distribution in Table 3.2

Frequency
16
14
12
10
8
6

4
2
0
83.5 143.5 203.5 263.5 323.5 383.5 443.5

Total Cost ($)

Figure 3.6 Frequency polygon of the total cost on the 30 Major League Baseball parks during
the 2009 season.

Ogive

Ogive is also used to graph cumulative frequency. An ogive is constructed by placing a point
corresponding to the upper end of each class at a height equal to the cumulative frequency of the
class. These points then are connected. An ogive also shows the relative cumulative frequency
distribution on the right side axis.
• less than” ogive - shows how many items in the distribution have a value less than the
upper limit of each class
• “greater than” ogive - shows how many items in the distribution have a value greater
than or equal to the lower limit of each class

Numerical Descriptive Measure

There are many ways of describing a given set of data. A good number of descriptive
measures exist in Statistics whose use depends largely on the nature of the data and the intended
purpose of the description. More commonly used descriptive measures are the following:
measure of location, measure of dispersion, measure of skewness and kurtosis.

Measure of Location
A measure of location is a set of data, which describes its location or position relative to the entire
set of data. This includes the following: central tendency – mean, median and mode; and
quantiles – percentile, decile and percentile.

1. Measure of Central Tendency – Average or measure of central tendency is a single value


about which the set of observations tend to cluster. They provide summary and base for
comparison. The most common measures of central tendency are the arithmetic mean,
median, and mode.
a. Arithmetic Mean or simply the mean, is defined as the sum of all observations divided
by the total number of observations, denoted by μ.
∑𝑁𝑖=1 𝑥𝑖
𝜇=
𝑁
Where: 𝑥𝑖 – is the ith observed value of the variable X
N = total number of observations

Some properties of the mean:


• Applicable only to quantitative variables.
• All observations contribute to the mean.
• Easily affected by extreme values.
• Amenable to further mathematical manipulations.
• Total deviation of observations from  is zero.
∑𝑁𝑖=1(𝑥𝑖 − 𝜇) = 0

Example 3.9: Nutritional data about a sample of seven breakfast cereals includes the
number of calories per serving:
Cereals Calories

Kellogg’s All Bran 80

Kellogg Corn Flakes 100

Wheaties 100

Cereals Nature’s Path Organic Multigrain Flakes 110


Compute the
mean Kellogg Rice Krispies 130 number of
calories in
these Post Shredded Wheat Vanilla Almond 190 breakfast
cereals.
Kellogg Mini Wheats 200

SOLUTION The mean number of calories is 130, computed as follows:


∑𝑁
𝑖=1 𝑥𝑖 80+100+100+110+130+190+200 190
𝝁= 𝑁
= 7
= 7
= 130

b. Median is the middle value of an array, denoted by Md. (An array is an arrangement
of data from lowest to highest)

𝑁+1
𝑀𝑑 = 𝑟𝑎𝑛𝑘𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
2

You compute the median by following one of two rules:


• Rule 1 If the data set contains an odd number of values, the median is the
measurement associated with the middle-ranked value.
• Rule 2 If the data set contains an even number of values, the median is the
measurement associated with the average of the two middle-ranked values.
Some properties of the median:
1. A positional value hence unaffected by extreme values
2. Sum of absolute deviations of observations from a point 𝑎 is minimum if a = Md
 is ∑|𝑥𝑖 − 𝑀𝑑| minimum if a = Md.
3. Not amenable to further computations

Example 3.10:
Nutritional data about a sample of seven breakfast cereals includes the number of
calories per serving in Example 3.9. Compute the median number of
calories in breakfast cereals.

7+1
SOLUTION Because the result of dividing 𝑁 + 1 by 2 is 2 = 4 for this sample of
seven, using Rule 1, the median is the measurement associated with fourth ranked
value. The number of calories per serving data are ranked from the smallest to the
largest:

Ranked Values: 80 100 100 110 130 190 200

Ranks: 1 2 3 4 5 6 7

Median = 4th observation = 110

The median number of calories is 110. Half the breakfast cereals have equal to or
less than 110 calories per serving, and half the breakfast cereals have equal to or
more than 110 calories.
c. Mode denoted by Mo, is the value occurring the most frequent in the given data set.

Example 3.11. Consider the information in Example 3.10, compute for the mode.

Mo = 100
2. Quantiles – numbers, which divide the array into 𝑛 equal parts
a. Percentiles – numbers, which divide the array of the data into 100 equal parts
- Best applied to large data sets
b. Deciles – numbers, which divide the array of the data into 10 equal parts
c. Quartiles - numbers, which divide the array of the data into 4 equal parts

In General: Quantile = n.p (ranked value)


Where n = no. of observations
p = proportion

Remarks: Rule 1: If np = whole no./integer (e.g k),


𝑥 +𝑥
 𝑘 2 𝑘+1
Rule 2: If np = real no. w/ decimal  round up!

Measure of Dispersion
A quantitative measure that describes the extent to which data are dispersed or spread
out, and the statistical measure which provides this information is called the measure of variation
or dispersion. The commonly used measures are:

1. Range is the difference between the maximum (Max) and the minimum (Min) values of
the data set. It is a quick but rough measure of variability.

𝑅𝑎𝑛𝑔𝑒 = 𝑀𝑎𝑥 − 𝑀𝑖𝑛

Properties:
• simple to calculate
• a rough estimate but considers only the min and max of a data set
• easily affected by extreme values.

2. Interquartile Range (IQR) is the difference between the 3rd quartile (Q3) and the first
quartile (Q1) values.

𝐼𝑄𝑅 = 𝑄3 − 𝑄1

3. Semi-Interquartile Range (SIR) is half of the IQR. It tells the average deviation of
observations from the median

𝑆𝐼𝑅 = 1/2(𝑄3 − 𝑄1 )
4. Mean Absolute Deviation (MAD) is the mean of the absolute differences of the
observations from the mean

𝑁
|𝑥𝑖 − 𝜇|
𝑀𝐴𝐷 = ∑
𝑁
𝑖=1

Where: N = total no. of observations


Xi = ith observed value
 = mean from ungrouped data

Properties:
• nonnegative
• not amenable to further computations
5. Variance (𝝈𝟐 ) is the mean of the squared deviations of the observations from the mean.
2
∑𝑁𝑖=1(𝑥𝑖 − 𝜇)
2 ∑𝑁
𝑖=1 𝑥𝑖
2
𝜎 = = − 𝜇2
𝑁 𝑁

- This measure will be relatively large for highly variable data and relatively
small for less variable data.

Properties:
• nonnegative
• larger variance means higher dispersion
• all data points contribute to the computation of the variance
• amenable to further computation (many application)
• unit of measurement is the square of the observation’s unit

6. Standard Deviation (σ) is the positive square root of the variance.


-  is large if the data are widely spread about the mean.
-  is small if the data are close to the mean
∑𝑁 (𝑥𝑖 − 𝜇)2
𝜎 = √ 𝑖=1
𝑁
7. Coefficient of Variation (CV) is the ratio of the standard deviation to the mean expressed
in percent.
𝜎
𝐶𝑉 = ( ) ∗ 100%
𝜇

Properties:
• unitless quantity expressed in %
• compares the dispersion of 2 or more populations measured in the same or
different units; the higher the CV the more variable is the data set relative to its
mean
• a relative measure of dispersion
Example 3.12. Nutritional data about a sample of seven breakfast cereals includes the number
of calories per serving in Example 3.9. Compute the range, IQR, SIR, Variance
and Standard Deviation.
SOLUTION: Ranked from smallest to largest, the calories for the seven cereals are
80 100 100 110 130 190 200

a. 𝑅𝑎𝑛𝑔𝑒 = 𝑀𝑎𝑥 − 𝑀𝑖𝑛 = 200 − 80 = 120


b. Interquartile Range (IQR)
Solving for Q1 and Q3:
𝑄1 = 7(0.25) = 1.75 𝑟𝑎𝑛𝑘𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 = 2
𝑄1 = 100

𝑄3 = 7(0.75) = 5.25 𝑟𝑎𝑛𝑘𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 = 6


𝑄3 = 130

𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 130 − 100 = 30


c. Semi-Interquartile Range
𝑆𝐼𝑅 = 1/2(𝑄3 − 𝑄1 ) = 1/2(30) = 15

d. Variance (𝝈𝟐 )
Computing for the variance:

𝜇 = 130

Cereal Calories (X) Step 1: Step 2:

𝑿𝒊 − 𝝁 (𝑿𝒊 − 𝝁)𝟐

80 - 50 2500

100 -30 900

100 -30 900

110 -20 400

130 0 0

190 60 3600

200 70 4900

Step 3:
Using the equation:
Sum
∑𝑁
𝑖=1(𝑥𝑖 −𝜇)
2 (80−130)2 +(100−130)2 +⋯+(200−130)2 13200
𝜎2 = 𝑁
= 7
=13, 200
7
= 1885.71
e. Standard Deviation
Step 4: Divide the sum in step 3 by N = 1885.71
∑𝑁 (𝑥𝑖 − 𝜇)2
𝜎 =𝟐√ 𝑖=1 = √1885.71 ≅ 43.42
𝝈 = 1885.71 𝑁
On the average, the cereal calories deviate from the mean score of 130 by 43.42, on
the average.
Example 3.13. Comparing Two Coefficients of Variation When the Two Variables Have
Different Units of Measurement
Which varies more from cereal to cereal, the number of calories or the amount of sugar
(in grams)?

SOLUTION Because calories and the amount of sugar have different units of
measurement; you need to compare the relative variability in the two measurements.
For calories, from Example 3.9, the coefficient of variation is
43.42
𝐶𝑉𝑐𝑎𝑙𝑜𝑟𝑖𝑒 = ( ) 100% = 33.4%
130
For the amount of sugar in grams, the values for the seven cereals are

6 2 4 4 4 11 10

For these data 𝜇 = 5.8571 𝜎 = 3.1364.

3.1364
𝐶𝑉𝑠𝑢𝑔𝑎𝑟 = ( ) 100% = 53.55%
5.8571

Thus, relative to the mean, the amount of sugar is much more variable than the calories.

Shape
Shape is the pattern of the distribution of data values throughout the entire range of all
the values. A distribution is either symmetrical or skewed. In a symmetrical distribution, the
values below the mean are distributed in exactly the same way as the values above the mean. In
this case, the low and high values balance each other out. In a skewed distribution, the values
are not symmetrical around the mean. This skewness results in an imbalance of low values or
high values. Shape also can influence the relationship of the mean to the median. In most cases:
• Mean < median: negative, or left-skewed
• Mean = median: symmetric, or zero skewness
• Mean > median: positive, or right-skewed

Figure 3.1 depicts three data sets, each with a different shape.

Figure 3.7 A comparison of three data sets that differ in shape.

The data in Panel A are negative, or left-skewed. In this panel, most of the values are in
the upper portion of the distribution. A long tail and distortion to the left is caused by some
extremely small values. These extremely small values pull the mean downward so that the mean
is less than the median.
The data in Panel B are symmetrical. Each half of the curve is a mirror image of the other
half of the curve. The low and high values on the scale balance, and the mean equals the median.
The data in Panel C are positive, or right-skewed. In this panel, most of the values are in
the lower portion of the distribution. A long tail on the right is caused by some extremely large
values. These extremely large values pull the mean upward so that the mean is greater than the
median.
Skewness and kurtosis are two shape-related statistics. The skewness statistic
measures the extent to which a set of data is not symmetric. The kurtosis statistic measures the
relative concentration of values in the center of the distribution of a data set, as compared with
the tails.
A symmetric distribution has a skewness value of zero. A right-skewed distribution has a
positive skewness value, and a left-skewed distribution has a negative skewness value. A bell-
shaped distribution has a kurtosis value of zero. A distribution that is flatter than a bell-shaped
distribution has a negative kurtosis value. A distribution with a sharper peak (one that has a
higher concentration of values in the center of the distribution than a bell-shaped distribution)
has a positive kurtosis value.

The Boxplot
A boxplot provides a graphical representation of the data based on the five-number
summary. To further analyze the sample of 10 times to get ready in the morning, you can
construct a boxplot, as displayed in Figure 3.8.

Figure 3.8 Boxplot for the getting ready time.

The vertical line drawn within the box represents the median. The vertical line at the left
side of the box represents the location of 𝑄1 , and the vertical line at the right side of the box
represents the location of 𝑄3 . Thus, the box contains the middle 50% of the values. The lower
25% of the data are represented by a line connecting the left side of the box to the location of the
smallest value, 𝑥𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 . Similarly, the upper 25% of the data are represented by a line
connecting the right side of the box to 𝑥𝑙𝑎𝑟𝑔𝑒𝑠𝑡 .
The boxplot of the getting-ready times in Figure 3.8 indicates slight right-skewness
because the distance between the median and the highest value is slightly greater than the
distance between the lowest value and the median. Also, the right tail is slightly longer than the
left tail.

EXCEL GUIDE

Using Analysis ToolPak

The Analysis ToolPak is an Excel add-in program that provides data analysis tools for
financial, statistical and engineering data analysis.
To load the Analysis ToolPak add-in, execute the following steps.

1. On the File tab, click Options.


2. Under Add-ins, select Analysis ToolPak and click on the Go button.
3. Check Analysis ToolPak and click on OK.

4. On the Data tab, in the Analysis group, you can now click on Data Analysis.

5. For example, select Histogram and click OK to create a Histogram in Excel.


Central Tendency (The Mean, Median, Mode)

Finding the Mean


In-Depth Excel Enter the scores in one of the columns on the Excel spreadsheet (see the example
below). After the data have been entered, place the cursor where you wish to have the mean
(average) appear and click the mouse button. Select Insert Function (fx) from
the FORMULAS tab. A dialog box will appear. Select AVERAGE from the Statistical category
and click OK. (Note: If you want the Median, select MEDIAN. If you want the Mode,
select MODE.SNGL. Excel only provides one mode. If a data set had more than one mode, Excel
would only display one of them.)

Enter the cell range for your list of numbers in the Number 1 box. For example, if your data were
in column A from row 1 to 13, you would enter A1:A13. Instead of typing the range, you can also
move the cursor to the beginning of the set of scores you wish to use and click and drag the cursor
across them. Once you have entered the range for your list, click on OK at the bottom of the
dialog box. The mean (average) for the list will appear in the cell you selected.
Analysis ToolPak
1. Enter the scores in one of the columns on the Excel spreadsheet. Select Analysis ToolPak
from the DATA tab.

2. Select Descriptive Statistics and click OK to create to create a list that includes measures of
central tendency.

VARIATION and SHAPE


The Range, Variance, Standard Deviation, Coefficient of Variation, and Shape

In-Depth Excel Enter the scores in one of the columns on the Excel spreadsheet (see the example
below). After the data have been entered, place the cursor where you wish to have the mean
(average) appear and click the mouse button. Select Insert Function (fx) from
the FORMULAS tab. A dialog box will appear. Select VAR.P (population variance) or VAR.S
(sample variance) from the Statistical category and click OK. (Note: If you want the Standard
Deviation, select STDEV.P (population standard deviation) or STDEV.S (sample standard
deviation). If you want skewness, select SKEW. If you want kurtosis, select KURT. Select MAX
(maximum value) and MIN (minimum value). Take the difference between MAX and MIN to
compute the range. To compute the Coefficient of Variation, divide standard deviation by the
mean and multiply it by 100%)

Analysis ToolPak Use the data in Example 3.9. Use Descriptive Statistics to create a list that
contains measures of variation and shape along with central tendency.
1. Select Data ➔ Data Analysis.
2. In the Data Analysis dialog box, select Descriptive Statistics from the Analysis Tools list
and then click OK. In the Descriptive Statistics dialog box (shown below):

3. Enter A1:A8 as the Input Range.


4. Click New Worksheet Ply, check Summary statistics, and then click OK.

In the new worksheet:


To add the coefficient of variation to this worksheet, first enter Coefficient of variation in
cell A16. Then, enter the formula =B7/B3 *100 in cell B16.

REFERENCES

You might also like