Data Representation and Analysis Notes, Math
Data Representation and Analysis Notes, Math
WHAT IS DATA?
Data is a collection of facts (such as numbers, words, measurements, observations or even just descriptions of things) from
which conclusions may be drawn.
WHY IS DATA IMPORTANT?
The collection of data is an important thing in statistical data analysis. Data can be collected from sources or through
observation, surveys, or by doing experiments.
TYPES OF DATA
Qualitative data deals with characteristics and descriptors that can't be easily
measured, but can be observed subjectively—such as smells, tastes, textures,
attractiveness, and colour. These observations fall into separate distinct categories.
E.g. Colour of eyes : blue, green, brown etc., Exam result : pass or fail, Socio-economic
status : low, middle or high.
Quantitative data deals with numbers and things you can measure objectively: e.g.
dimensions (such as height, width, and length), temperature, humidity, prices, area
and volume. These numerical responses may be discrete or continuous.
Discrete data is a count that can't be made more precise. Typically it involves integers. For instance, the number of children
(or adults, or pets) in your family is discrete data, because you are counting whole, indivisible entities: you can't have 2.5 kids,
or 1.3 pets.
Continuous data, on the other hand, could be divided and reduced to finer and finer levels. For example, you can measure the
height of your kids at progressively more precise scales—meters, centimetres, millimetres, and beyond—so height is
continuous data.
1
REPRESENTING DATA WITH DIAGRAMS
You know the saying, “A picture is worth a thousand words?” After collecting and organizing data, the next step is to display it
in a manner that makes it easy to read—highlighting similarities, disparities, trends, and other relationships, or the lack of, in
the data set. Using visual representations to present data collected makes them easier to understand. In selecting how best to
present your data, think about the purpose, what you want to present, then decide which variables you want to include and
whether they should be expressed as frequencies, percentages, or categories. We now focus on 2 diagrams – the stem-and-
leaf diagram and the box-and-whisker diagram.
STEM-AND-LEAF DIAGRAMS
A stem-and-leaf diagram, also called a stem-and-leaf plot, is a diagram that quickly organizes and
summarizes data while maintaining the individual data points. In such a diagram, the "stem" is a
column of the unique elements of data after removing the last digit. The final digits ("leaves") of
each column are then placed in a row next to the appropriate column and sorted in numerical order.
In general, stems may have as many digits as needed , but each leaf should contain only a single
digit. This diagram was invented by John Tukey. Look at the example below:
2
Elements of a good stem and leaf plot
shows the first digits of the number (thousands, hundreds or tens) as the stem and shows the last digit (ones) as the leaf.
usually uses whole numbers. Anything that has a decimal point is rounded to the nearest whole number. For example, test
results, speeds, heights, weights, etc.
looks like a bar graph when it is turned on its side.
shows how the data are spread—that is, highest number, lowest number, most common number and outliers (a number
that lies outside the main group of numbers).
Once you have decided that a stem and leaf plot is the best way to show your data, draw it as follows:
On the left hand side of the page, write down the thousands, hundreds or tens (all digits but the last one). These will be
your stems.
Draw a line to the right of these stems.
On the other side of the line, write down the ones (the last digit of a number). These will be your leaves.
For example, if the observed value is 25, then the stem is 2 and the leaf is the 5. If the observed value is 369, then the stem is 36
and the leaf is 9. Where observations are accurate to one or more decimal places, such as 23.7, the stem is 23 and the leaf is 7. If
the range of values is too great, the number 23.7 can be rounded up to 24 to limit the number of stems.
In stem and leaf plots, tally marks are not required because the actual data are used.
3
Example 1 - Making a stem and leaf plot
Each morning, a teacher quizzed his class with 20 geography questions. The class marked them together and everyone kept a
record of their personal scores. As the year passed, each student tried to improve his or her quiz marks. Every day, Elliot
recorded his quiz marks on a stem and leaf plot. This is what his marks looked like plotted out:
Table 1. Elliot's scores on the Analyse Elliot's stem and leaf plot. What is his most common score on the
basic facts quiz last year geography quizzes? What is his highest score? His lowest score? Rotate the
stem and leaf plot onto its side so that it looks like a bar graph. Are most of
Stem Leaf Elliot's scores in the 10s, 20s or under 10? It is difficult to know from the plot
whether Elliot has improved or not because we do not know the order of those
0 365 scores.
1 014356568979
2 0000
A teacher asked 10 of her students how many books they had read in the last 12 months. Their answers were as follows:
Tip: The number 6 can be written as 06, which means that it has a stem of 0 and a leaf of 6.
4
The stem and leaf plot should look like this:
1 29052
2 351
Usually, a stem and leaf plot is ordered, which simply means that the leaves are arranged in ascending order from left to right.
Also, there is no need to separate the leaves (digits) with punctuation marks (commas or periods) since each leaf is always a
single digit. Using the data from Table 2, we made the ordered stem and leaf plot shown below:
Stem Leaf
0 67
1 02259
2 135
5
Example 3 – Splitting stems using decimal values
The weights (to the nearest tenth of a kilogram) of 30 students were measured and recorded as follows:
59.2, 61.5, 62.3, 61.4, 60.9, 59.8, 60.5, 59.0, 61.1, 60.7, 61.6, 56.3, 61.9,
65.7, 60.4, 58.9, 59.0, 61.2, 62.1, 61.4, 58.4, 60.8, 60.2, 62.7, 60.0, 59.3, Table 8. Weights of 30 students
56 3
58 449
Answer: In this case, the stems will be the whole number values and
the leaves will be the decimal values. The data range from 56.3 to 65.7, 59 00238
61 124456799
62 1237
63
64
65 7
6
Outliers
An outlier is an extreme value of the data. It is an observation value that is significantly different from the rest of the data. There
may be more than one outlier in a set of data.
Sometimes, outliers are significant pieces of information and should not be ignored. Other times, they occur because of an error
or misinformation and should be ignored.
In the previous example, 56.3 and 65.7 could be considered outliers, since these two values are quite different from the other
values.
By ignoring these two outliers, the previous example's stem and leaf plot could be redrawn as below:
Table 9. Weights of 30
students except for outliers
Stem Leaf
58 449
59 00238
60 0245789
61 124456799
62 1237
7
What is a Back-to-Back Stem plot?
On a normal plot, the stem is on the left and all the leaves are on the right. There is a vertical line separating the two. On a back
to back plot, the stem remains the same. But to add another set of data points, we begin adding leaves to the LEFT side.
Just like on a typical plot, the smallest leaves are placed closest to the stem, and larger leaves are further away. The stem now
serves a double purpose. It anchors both sets of data points, keeping them separate but it still organizes both.
In a back to back stem and leaf plot, you can compare two sets of data, and still be able to find the statistical measurements of
each set. It also retains the same pros and cons of a normal plot. In the picture above, you can see that the stems work for each
side of the plot, yet the data is separate. We added the points for Team B, but we started building at the centre of the plot. It is
pointed out on the last line of Team B’s points that the smaller leaves are closest to the stem and the larger leaves are farther
away. In this picture you can see that Team B scored 92, 92, 92, 93, 96, and 96, respectively. Note the locations of the “2″ leaf
and the “6″ leaf.
8
Using stem and leaf plots as graphs
A stem and leaf plot is a simple kind of graph that is made out of the numbers themselves. It is a means of displaying the main features of a
distribution. If a stem and leaf plot is turned on its side, it will resemble a bar graph or histogram and provide similar visual information.
The results of 41 students' math tests (with a best possible score of 70) are recorded below:
31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48, 55, 60, 40, 35, 54, 26, 57, 37, 43, 65, 50, 55, 18, 53, 41, 50, 34, 67, 56, 44, 4, 54, 57, 39, 52, 45, 35,
51, 63, 42
Stem Leaf
0 4
1 89
9
2 346
3 1245579
4 012345589
5 00011234455677
6 02357
Since there are 41 observations, the distribution centre (the median value) will occur at the 21st observation. Counting 21 observations up
from the smallest, the centre is 48. (Note that the same value would have been obtained if 21 observations were counted down from the
highest observation.)
1. The main advantage of a stem-and-leaf diagram is that the data are grouped and all the original data are shown, too.
2. Easy to construct
3. Shows range, minimum & maximum, gaps & clusters, and outliers easily
11
BOX-AND-WHISKER PLOTS
Box plots provide a visual representation of a five-number summary of data, consisting of the median (the midpoint of the
data range), the upper and lower quartiles (the numbers below the highest quarter of the data and above the lowest quarter,
respectively) and the largest and smallest values (the extremes). Box plots are particularly useful for comparing distributions
of the results from several experimental conditions. A box and whisker plot is a good way to summarize large amounts of
data. It is usually drawn alongside a number line, as shown:
Example
The oldest person in Mathsminster is 90. The youngest person is 15.
The median age of the residents is 44, the lower quartile is 25, and the upper quartile is 67.
Represent this information with a box-and-whisker plot.
12
Solution
13
14
15
16
17
18
ANALYZING DIAGRAMS
1. Features of distributions
When you assess the overall pattern of any distribution (which is the pattern formed by all values of a particular variable), look
for these features:
number of peaks
general shape (skewed or symmetric)
centre
spread
Number of peaks
Line graphs are useful because they readily reveal some characteristic of the data.
The first characteristic that can be readily seen from a line graph is the number of high points or peaks the distribution has.
While most distributions that occur in statistical data have only one main peak(unimodal), other distributions may have two
peaks (bimodal) or more than two peaks(multimodal).
Examples of unimodal, bimodal and multimodal line graphs are shown below:
19
General shape
A perfectly symmetric curve is one in which both sides of the distribution would exactly match the other if the figure were
folded over its central point. An example is shown below:
A symmetric, unimodal, bell-shaped distribution—a relatively common occurrence—is called a normal distribution. In a
normal distribution, mean, median and mode are identical in value.
A distribution is said to be skewed to the right, or positively skewed, when most of the data are concentrated on the left of the
distribution. Distributions with positive skews are more common than distributions with negative skews.
Income provides one example of a positively skewed distribution. Most people make under $40,000 a year, but some make
quite a bit more, with a smaller number making many millions of dollars a year. Therefore, the positive (right) tail on the line
20
graph for income extends out quite a long way, whereas the negative (left) skew tail stops at zero. The right tail clearly extends
farther from the distribution's centre than the left tail, as shown below:
A distribution is said to be skewed to the left, or negatively skewed, if most of the data are concentrated on the right of the
distribution. The left tail clearly extends farther from the distribution's centre than the right tail, as shown below:
Locating the centre (median) of a distribution can be done by counting half the observations up from the smallest. Obviously,
this method is impracticable for very large sets of data. A stem and leaf plot makes this easy, however, because the data are
arranged in ascending order. The mean is another measure of central tendency.
The amount of distribution spread and any large deviations from the general pattern (outliers) can be quickly spotted on a graph.
21
2. Measures of Central Tendency from Raw Data
22
3. Measures of Dispersion from Raw Data
While measures of central tendency indicate what value of a variable is (in one sense or other) “average” or
“central” or “typical” in a set of data, measures of dispersion (or variability or spread) indicate (in one sense or
other) the extent to which the observed values are “spread out” around that centre — how “far apart” observed
values typically are from each other and therefore from some average value (in particular, the mean). Thus:
if all cases have identical observed values (and thereby are also identical to [any] average value), dispersion is
zero;
if most cases have observed values that are quite “close together” (and thereby are also quite “close” to the
average value), dispersion is low (but greater than zero); and
if many cases have observed values that are quite “far away” from many others (or from the average value),
dispersion is high.
A measure of dispersion provides a summary statistic that indicates the magnitude of such dispersion and, like a
measure of central tendency, is a univariate statistic.
Because dispersion is concerned with how “close together” or “far apart” observed values are (i.e., with the
magnitude of the intervals between them), measures of dispersion are defined only for interval (or ratio)
variables,
or, in any case, variables we are willing to treat as interval .
There is one exception: a very crude measure of dispersion called the variation ratio, which is defined for ordinal
and even nominal variables.
There are two principal types of measures of dispersion: range measures and deviation measures. Range
measures are based on the distance between pairs of (relatively) “extreme” values observed in the data.
23
They are conceptually connected with the median as a measure of central tendency.
The (“total” or “simple”) range is the maximum (highest) value observed in the data [the value of the case at the
100th percentile] minus the minimum (lowest) value observed in the data [the value of the case at the 0th
percentile]
That is, it is the “distance” or “interval” between the values of the two most extreme cases,
e.g., range of test scores
The problem with the [total] range as a measure of dispersion is that it depends on the values of just two cases,
which by definition have (possibly extraordinarily) atypical values.
In particular, the range makes no distinction between a polarized distribution in which almost all observed values
are close to either the minimum or maximum values and a distribution in which almost all observed values are
bunched together but there are a few extreme outliers.
Recall Ideological Dispersion bar graphs =>
Also the range is undefined for theoretical distributions that are “open-ended,” like the normal distribution (that
we will take up in the next topic) or the upper end of an income distribution type of curve (as in previous slides).
Therefore other variants of the range measure that do not reach entirely out to the extremes of the frequency
distribution are often used instead of the total range.
The interquartile range is the value of the case that stands at the 75th percentile of the distribution minus the
value of the case that stands at the 25th percentile.
The first quartile is the median observed value among all cases that lie below the overall median and the third
quartile is the median observed value among all cases that lie above the overall median.
In these terms, the interquartile range is third quartile minus the first quartile.
24
Deviation measures are based on average deviations from some average value.
Since dispersion measures pertain to with interval variables, we can calculate means, and deviation measures are
typically based on the mean deviation from the mean value.
Thus the (mean and) standard deviation measures are conceptually connected with the mean as a measure of
central tendency.
25
26
PERCENTILES
Percentiles are like quartiles, except that percentiles divide the set of data into 100 equal parts while quartiles divide the set
of data into 4 equal parts. Percentiles measure position from the bottom.
Percentiles are most often used for determining the relative standing of an individual in a population or the rank position of
the individual. Some of the most popular uses for percentiles are connected with test scores and graduation standings.
Percentile ranks are an easy way to convey an individual's standing at graduation relative to other graduates.
About Percentile Ranks:
• percentile rank is a number between 0 and 100 indicating the percent of cases falling at or below that score.
• percentile ranks are usually written to the nearest whole percent: 74.5% = 75% = 75th percentile
• scores are divided into 100 equally sized groups
• scores are arranged in rank order from lowest to highest
• there is no 0 percentile rank - the lowest score is at the first percentile
• there is no 100th percentile - the highest score is at the 99th percentile.
• you cannot perform the same mathematical operations on percentiles that you can on raw scores. You cannot, for
example, compute the mean of percentile scores, as the results may be misleading.
Definition 1: A percentile is a measure that tells us what percent of the total frequency scored at or below that
measure. A percentile rank is the percentage of scores that fall at or below a given score.
Formula:
To find the percentile rank of a score, x, out of a set of n scores, where x is included:
Where B = number of scores below x
27
E = number of scores equal to x
n = number of scores
See this formula in more detail in the Examples section.
Example: If Jason graduated 25th out of a class of 150 students, then 125 students were ranked below Jason. Jason's
percentile rank would be:
Jason's standing in the class at the 84th percentile is as higher or higher than 84% of the graduates. Good job, Jason!
1. The math test scores were: 50, 65, 70, 72, 72, 78, 80, 82, 84, 84, 85, 86, 88, 88, 90, 94, 96, 98, 98, 99. Find the percentile rank
for a score of 84 on this test.
Since there are 2 values equal to 84, assign one to the group "above 84" and the other to the group "below 84".
50, 65, 70, 72, 72, 78, 80, 82, 84, | 84, 85, 86, 88, 88, 90, 94, 96, 98, 98, 99
28
The score of 84 is at the 45th percentile for this test.
2. The math test scores were: 50, 65, 70, 72, 72, 78, 80, 82, 84, 84, 85, 86, 88, 88, 90, 94, 96, 98, 98, 99. Find the percentile rank
for a score of 86 on this test.
Since there is only one value equal to 86, it will be counted as "half" of a data value for the group "above 86" as well as the group
"below 86".
50, 65, 70, 72, 72, 78, 80, 82, 84, 84, 85, 8|6, 88, 88, 90, 94, 96, 98, 98, 99
29
The score of 86 is at the 58th percentile for this test.
3. Quartiles can be thought of as percentile measure. Remember that quartiles break the data set into 4 equal parts. If 100% is
broken into four equal parts, we have subdivisions at 25%, 50%, and 75% creating the:
Cumulative For the table at the left, find the intervals in which the first, second and third quartiles
Test Scores Frequency
Frequency lie.
76-80 3 3
If there are a total of 20 scores, the first quartile will be located (25% · 20 = 5) five
81-85 7 10 values up from the bottom. This puts the first quartile in the interval 81-85.
86-90 6 16
91-95 4 20
In a similar fashion, the second quartile will be located (50% · 20 = 10) ten values up from the bottom in the interval 81-85.
The third quartile will be located (75% · 20 = 15) fifteen values up from the bottom in the interval 86-90.
Practice with
Percentiles and Quartiles
30
1. The Final Exam test scores were: 62, 66, 71, 75, 75, 78, 81, 83, 84, 85,
85, 87, 89, 89, 91, 92, 93, 94, 95, 99. Find the percentile rank for a score
of 85 on this test.
Choose:
25th percentile
50th percentile
75th percentile
85th percentile
Explanation
2. The heights of students in inches in Block 3 math class are 55, 59, 59, 60,
61, 63, 64, 64, 65, 68, 68, 69, 72, 74. Find the percentile rank for a height
of 61 inches.
Choose:
28th percentile
29th percentile
30th percentile
32nd percentile
31
Explanation
Interval Frequency
69 - 76 1
77 - 84 4
85 - 92 4
93 - 100 1
32
4. The following data represents the heights ( in inches) of 14 students in Mrs. Schultzkie's
math class: 65, 63, 68, 59, 74, 59, 68, 61, 64, 60, 69, 72, 55, 64.
Interval Frequency
55-58
59-62
63-66
67-70
71-74
33
STANDARD DEVIATION
Standard Deviation Standard Deviation (often abbreviated as "Std Dev" or "SD") provides an indication of how far the individual
responses to a question vary or "deviate" from the mean. SD tells the researcher how spread out the responses are -- are they
concentrated around the mean, or scattered far & wide? Did all of your respondents rate your product in the middle of your scale,
or did some love it and some hate it?
34
How to Interpret Standard Deviation in a Statistical Data Set
Standard deviation can be difficult to interpret as a single number on its own. Basically, a small standard deviation means that
the values in a statistical data set are close to the mean of the data set, on average, and a large standard deviation means that the
values in the data set are farther away from the mean, on average.
The standard deviation measures how concentrated the data are around the mean; the more concentrated, the smaller
the standard deviation.
A small standard deviation can be a goal in certain situations where the results are restricted, for example, in product
manufacturing and quality control. A particular type of car part that has to be 2 centimetres in diameter to fit properly had better not
have a very big standard deviation during the manufacturing process. A big standard deviation in this case would mean that lots of parts
end up in the trash because they don’t fit right; either that or the cars will have problems down the road.
But in situations where you just observe and record data, a large standard deviation isn’t necessarily a bad thing; it just reflects a
large amount of variation in the group that is being studied. For example, if you look at salaries for everyone in a certain company,
including everyone from the student intern to the CEO, the standard deviation may be very large. On the other hand, if you narrow the
group down by looking only at the student interns, the standard deviation is smaller, because the individuals within this group have
salaries that are less variable. The second data set isn’t better, it’s just less variable.
Here are some properties that can help you when interpreting a standard deviation:
The standard deviation can never be a negative number, due to the way it’s calculated and the fact that it measures a distance
(distances are never negative numbers).
The smallest possible value for the standard deviation is 0, and that happens only in contrived situations where every single
number in the data set is exactly the same (no deviation).
The standard deviation is affected by outliers (extremely low or extremely high numbers in the data set). That’s because the
standard deviation is based on the distance from the mean. And remember, the mean is also affected by outliers.
The standard deviation has the same units as the original data.
35