Basic Stat - Chapter 2 Visual Description of Data
Basic Stat - Chapter 2 Visual Description of Data
Sources of Data
• Primary data
– data measured or collect by the investigator or the user directly
from the source
– the data you collect is unique to you and your research and,
until you publish, no one else has access to it
– The primary sources of data are objects or persons from
which we collect the figures used for first hand information.
• Secondary data
– second-hand information and data or information that was
either gathered by someone else
– The secondary sources are either published or unpublished
materials or records.
– Few of sources of secondary data are
2
Sources of Data
3
Methods of Data Collection
4
Methods of Data Collection
• There are three major methods of data collection.
1) Observational or measurement.
2) Interview with questionnaires.
a. Face to face interview.
b. Telephone interview.
c. Self administered questionnaires returned by mail (mailed
questionnaire).
3) The use of documentary sources
Observational or measurement ( direct personal observation)
• In this case data can be obtained through direct observation or
measurement. This requires training and monitoring of the measurer
to ensure the use of standard procedure.
• Provides accurate information but it is expensive and inconvenient.
• Example: laboratory tests, clinical measurements and physical 5
examination etc.
• Interview with questionnaires: Hear one drafts a detailed
questionnaire. These questionnaires can either be mailed to the
respondent for filling and returning, or can put in charge of the
enumerators who go around and fill them after obtaining the
desired information.
– Diagrams, and
– Graphs
11
Tabular presentation of data
12
Frequency Distribution
13
• When data have been collected, they are of little use until they have
been organized and represented in a form that helps us understand the
information contained.
• We’ll discuss how raw data are converted to frequency distributions and
visual displays that provide us with a ―big picture‖ of the information
collected.
• By so organizing the data, we can better identify trends, patterns, and
other characteristics that would not be apparent during a simple shuffle
through a pile of questionnaires or other data collection forms.
• Such summarization also helps us compare data that have been
collected at different points in time, by different researchers, or from
different sources.
• It can be very difficult to reach conclusions unless we simplify
the mass of numbers contained in the original data.
• Variables are either quantitative or qualitative.
• In turn, the appropriate methods for representing the data will
depend on whether the variable is quantitative or qualitative.
• The frequency distribution, histogram, stem-and-leaf display,
dot plot, and scatter diagram techniques of this chapter are
applicable to quantitative data, while the contingency table is
used primarily for counts involving qualitative data.
Categorical Frequency Distribution
A B C D
Class Tally Frequency Percent
16
• Example: Data on smoking status by gender of a sample of 20
health workers in Jimma Hospital 1986 E.C was given. Construct
categorical frequency distribution.
Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Gender M F M M F F F M M M F F F F M F M F M M
Smoking Y N N Y N N Y N N N N N N Y Y Y N N Y Y
status
17
Ungrouped Frequency Distribution
30 25 23 41 39 27 41 24 32 29 29 35 31 36 33 36 42
35 37 41
Age(xj) 23 24 25 27 29 30 31 32 33 35 36 37 39 41 42
Tally / / / / // / / / / // // / / /// /
Frequency(f) 1 1 1 1 2 1 1 1 1 2 2 1 1 3 1
19
THE FREQUENCY DISTRIBUTION
AND THE HISTOGRAM
• Raw data have not been manipulated or treated in any way beyond
their original collection. As such, they will not be arranged or organized
in any meaningful manner.
• When the data are quantitative, two of the ways we can address this
problem are the frequency distribution and the histogram.
• The frequency distribution is a table that divides the data values into
classes and shows the number of observed values that fall into each
class.
• By converting data to a frequency distribution, we gain a perspective
that helps us see the forest instead of the individual trees.
• A more visual representation, the histogram describes a frequency
distribution by using a series of adjacent rectangles, each of which has a
length that is proportional to the frequency of the observations within
the range of values it represents.
• In either case, we have summarized the raw data in a condensed form
that can be readily understood and easily interpreted.
The Frequency Distribution
• We’ll discuss the frequency distribution in the context of a
research study that involves both safety and fuel-efficiency
implications. Data are the speeds (miles per hour) of 105 vehicles
observed along a section of highway where both accidents and
fuel-inefficient speeds have been a problem.
EXAMPLE
Raw Data and Frequency Distribution
• Part A of Table 2.1 lists the raw data consisting of measured
speeds (mph) of 105 vehicles along a section of highway. There
was a wide variety of speeds.
• If we want to learn more from this information by visually
summarizing it, one of the ways is to construct a frequency
distribution like the one shown in part B of the table.
TABLE 2.1: Raw data and frequency distribution for observed speeds of 105 vehicles.
Key Terms
• In generating the frequency distribution in part B of Table 2.1,
several judgmental decisions were involved, but there is no single
―correct‖ frequency distribution for a given set of data.
• There are a number of guidelines for constructing a frequency
distribution. Before discussing these rules of thumb and their
application, we’ll first define a few key terms upon which they
rely:
• Class Each category of the frequency distribution.
– Too few intervals are undesirable because of the loss of information. On the
other hand, if too many intervals are used, the objective of summarization is
not being met.
– A commonly followed rule of thumb states 5≤N.C ≤15.
– The formula K 1 3.322log n is a formula by ―Sturges‖.
• K=number of class intervals.
• n=number of values in the data.
– But this should not be regarded as final answer.
• Frequency The number of data values falling within each class.
• Class limits The boundaries for each class. These determine
which data values are assigned to that class.
• Class interval The width of each class. This is the difference
between the lower limit of the class and the lower limit of the
next higher class.
– When a frequency distribution is to have equally wide classes, the
approximate width of each class is
• The figure to the left of the divider ( | ) is the stem, and the digits
to the right are referred to as leaves.
• By using the digits in the data values, we have identified five
different categories (30s, 40s, 50s, 60s, and 70s) and can see that
there are three data values in the 30s, two in the 40s, one in the
60s, and one in the 70s.
• Like the frequency distribution, the stem-and-leaf display allows
us to quickly see how the data are arranged.
The Dotplot
• The dotplot displays each data value as a dot and allows us to
readily see the shape of the distribution as well as the high and
low values.
Pie Chart
• The pie chart is a circular display divided into sections based on
either the number of observations within or the relative values of
the segments.
• If the pie chart is not computer generated, it can be constructed by
using the principle that a circle contains 360 degrees.
• The angle used for each piece of the pie can be calculated as
follows:
Component part
Angle of sec tor 3600
Total
• These angles are made in the circle by mean of a protractor to
show different components.
• The arrangement of the sectors is usually anti-clock wise.
42
• Example: The following table gives the details of quarterly sale of
a Sport Wear company’s profit (in millions of dollar) in four
quarters of a year.
Month Profit($,000,000)
1st quarter 100
2nd quarter 300
3rd quarter 500
4th quarter 600
Total 1500
43
Quarter Profit($,000, Angle of sector Percent
000) (in degrees) (%)
1st quarter 100 24 7
2nd quarter 300 72 20
3rd quarter 500 120 33
4th quarter 600 144 40
Total 1500 360 100
1st quarter
7%
2nd quarter
33%
44
The Bar Chart
• Like the histogram, the bar chart represents frequencies
according to the relative lengths of a set of rectangles, but it
differs in two respects from the histogram:
(1) the histogram is used in representing quantitative data, while
the bar chart represents qualitative data; and
(2) adjacent rectangles in the histogram share a common
side, while those in the bar chart have a gap between them.
• Bar charts can be
– Simple bar chart,
– Multiple bar charts,
– Stratified or stacked bar chart
– Deviation bar chart
Simple Bar Chart
46
Multiple Bar Chart
100 40
80 Ball
60 27 37
50 T-shirt
40 16 13
Shoe
20 30 33 37
0
X Y Z
49
Company
Deviation Bar Chart
• Used when the data contains both positive and negative values
such as data on net profit, net expense, percent change etc
• Suppose we have the following data relating to net profit (percent)
of commodity.
Coffee 125 50
0 Net profit
-50 Soap Sugar Coffee
-100
-150
50
The Line Graph
• The line graph is capable of simultaneously showing values of
two quantitative variables (y, or vertical axis, and x, or
horizontal axis); it consists of linear segments connecting points
observed or measured for each variable.
• When x represents time, the result is a time series view of the y
variable.
• Even more information can be presented if two or more y
variables are graphed together.
The Pictogram
• Using symbols instead of a bar, the pictogram can describe
frequencies or other values of interest.
• The following figure is an example of this method; it was used
by Panamco to describe soft drink sales in Central America over
a 3-year period.
• In the diagram, each truck represents about 12.5 million cases of
soft drink products.
• When setting up a pictogram, the choice of symbols is up to you.
This is an important consideration because the right (or wrong)
symbols can lend nonverbal or emotional content to the display.
• For example, a drawing of a sad child with her arm in a cast
(each symbol representing 10,000 abused children) could help to
emphasize the emotional and social costs of child abuse.
• In the pictogram, the symbols represent frequencies or other
values of interest. This chart shows how soft drink sales
(millions of cases) in Central America increased from 1996
through 1998.
THE SCATTER DIAGRAM
• There are times when we would like to find out whether there is a
relationship between two quantitative variables—for
example, whether sales is related to advertising, whether starting
salary is related to undergraduate grade point average, or whether
the price of a stock is related to the company’s profit per share.
• To examine whether a relationship exists, we can begin with a
graphical device known as the scatter diagram, or scatterplot.
• Think of the scatter diagram as a sort of two-dimensional dotplot.
Each point in the diagram represents a pair of known or observed
values of two variables, generally referred to as y and x, with y
represented along the vertical axis and x represented along the
horizontal axis.
• The two variables are referred to as the dependent (y) and
independent (x) variables, since a typical purpose for this type of
analysis is to estimate or predict what y will be for a given value of
x.
• Once we have drawn a scatter diagram, we can ―fit‖ a line to it in such a
way that the line is a reasonable approximation to the points in the
diagram.
• In viewing the ―best-fit‖ line and the nature of the scatter diagram, we
can tell more about whether the variables are related and, if so, in what
way.
1. A direct (positive) linear relationship between the variables, as shown in
part (a) of Figure 2.6. The best-fit line is linear and has a positive
slope, with both y and x increasing together.
2. An inverse (negative) linear relationship between the variables, as shown
in part (b) of Figure 2.6. The best-fit line is linear and has a negative
slope, with y decreasing as x increases.
3. A curvilinear relationship between the variables, as shown in part (c) of
Figure 2.6. The best-fit line is a curve. As with a linear relationship, a
curvilinear relationship can be either direct (positive) or inverse
(negative).
4. No relationship between the variables, as shown in part (d) of Figure 2.6.
The best-fit line is horizontal, with a slope of zero and, when we view
the scatter diagram, knowing the value of x is of no help whatsoever in
predicting the value of y.
• Example: 10 students sat both a Math and a Stat exam, here are
their scores:
Subj Student
1 2 3 4 5 6 7 8 9 10
Math 56 24 67 70 71 42 48 32 52 80
Stat 65 38 71 72 73 51 56 42 57 82