0% found this document useful (0 votes)
40 views58 pages

Basic Stat - Chapter 2 Visual Description of Data

Uploaded by

yosefkulidante
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views58 pages

Basic Stat - Chapter 2 Visual Description of Data

Uploaded by

yosefkulidante
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Methods of Data Collection and Presentation

Sources of Data
• Primary data
– data measured or collect by the investigator or the user directly
from the source
– the data you collect is unique to you and your research and,
until you publish, no one else has access to it
– The primary sources of data are objects or persons from
which we collect the figures used for first hand information.
• Secondary data
– second-hand information and data or information that was
either gathered by someone else
– The secondary sources are either published or unpublished
materials or records.
– Few of sources of secondary data are
2
Sources of Data

3
Methods of Data Collection

• Planning to data collection requires

– Identify source and elements of the data

– Decide whether to consider sample or census

– If sampling is preferred, decide on sample size, selection


method, etc

– Decide measurement procedure

– Set up the necessary organizational structure

– Collect data using different (appropriate) techniques

4
Methods of Data Collection
• There are three major methods of data collection.
1) Observational or measurement.
2) Interview with questionnaires.
a. Face to face interview.
b. Telephone interview.
c. Self administered questionnaires returned by mail (mailed
questionnaire).
3) The use of documentary sources
Observational or measurement ( direct personal observation)
• In this case data can be obtained through direct observation or
measurement. This requires training and monitoring of the measurer
to ensure the use of standard procedure.
• Provides accurate information but it is expensive and inconvenient.
• Example: laboratory tests, clinical measurements and physical 5
examination etc.
• Interview with questionnaires: Hear one drafts a detailed
questionnaire. These questionnaires can either be mailed to the
respondent for filling and returning, or can put in charge of the
enumerators who go around and fill them after obtaining the
desired information.

• Questionnaires: are written documents which instruct the reader


or listener to answer the questions written on it.

• Respondents (Interviewees): are individuals those who are


answered the questions on the questionnaire.

• Interviewers: are individuals those who are recorded the


responses given by the respondents. 6
a) Face to Face Interviews (questionnaires in charge of enumerators)
• The interviewer knows exactly who is responding to the questionnaire.
– Advantages
• The interviewer can help the respondent if he/she has difficulty in understanding
the questions. The difficulty could be due to language, concentration or limited
intellectual capacity.
• There is more flexibility in presenting the items; they can range from closed to
open.
• There is the ability to use the method of skip patterns.
• Skip patterns means skipping a questions or a group of questions which are not
applicable.
Disadvantages
 It costs much in terms of time and money.
 Attribute of the interviewer may affect the responses due to:
a) Bias of the interviewer and
b) his/her social or ethnic characteristics.
 Untrained interviewer may distort the meaning of the questions.
b. Telephone Interviews
Advantages
• It is less expensive in time and money compared with face to face
interviews.
• The interviewer is able to help the respondent if he/she doesn’t
understand the question (as seen with face to face interview)
• Broad representative samples can be obtained for those who have
telephone lines.
Disadvantage
• Under representation of those groups which do not have
telephones.
• Respondent may be substituted by another.
• Problem with unlisted telephone number in the directory. 8
c. Self administered questionnaires returned by mail
(mailed questionnaire)
• Here the questionnaire is mailed to the respondents to be filled.
Sometimes it is known as self enumeration.
Advantages
• These are the cheapest.
• There is no need for trained interviewer.
• There is no interviewer bias.
Disadvantage
• Low response rate
• Uncompleted questionnaires due to omission or invalid responses.
• No assurance that the questionnaire was answered by the right
9
person
• Needs intense follow up to get a high response rate.
3. The use of documentary sources
 Extracting information from existing sources (e.g. Hospital records) is
much less expensive than the other two methods. It can be an important
source of data.
Advantage of secondary data
– Secondary data may help to clarify or redefine the definition of the
problem as part of the exploratory research process.
– Provides a larger database as compared to primary data
– Time saving
– Does not involve collection of data
Disadvantages of secondary data

• It is difficult to get information needed, when records are compiled in


unstandardized manner.
 Lack of availability  Inaccurate data
 Lack of relevance  Insufficient data
Methods of Data Presentation

• The major objectives of data presentation are


– To presenting data in visual display and more understandable

– To have great attraction about the data

– To facilitate quick comparisons using measures of location and dispersion.

– To enable the reader to determine the shape and nature of distribution to


make statistical inference, and to facilitate further statistical analysis.
• There are three methods of data presentation
– Tables,

– Diagrams, and

– Graphs

11
Tabular presentation of data

– Tables are important to summarize large volume of data in


more understandable way.
– Tables can be
• Simple (one way table): table which present one
characteristics for example age distribution.
• Two way table: it presents two characteristics in columns
and rows for example age versus sex.
• A higher order table: table which presents two or more
characteristics in one table.

12
Frequency Distribution

– It is the organization of raw data in table form, using classes


and frequencies.

– Frequency is the number of values in a specific class of the


distribution.

– There are three basic types of frequency distributions

• Categorical frequency distribution

• Ungrouped frequency distribution

• Grouped frequency distribution

13
• When data have been collected, they are of little use until they have
been organized and represented in a form that helps us understand the
information contained.
• We’ll discuss how raw data are converted to frequency distributions and
visual displays that provide us with a ―big picture‖ of the information
collected.
• By so organizing the data, we can better identify trends, patterns, and
other characteristics that would not be apparent during a simple shuffle
through a pile of questionnaires or other data collection forms.
• Such summarization also helps us compare data that have been
collected at different points in time, by different researchers, or from
different sources.
• It can be very difficult to reach conclusions unless we simplify
the mass of numbers contained in the original data.
• Variables are either quantitative or qualitative.
• In turn, the appropriate methods for representing the data will
depend on whether the variable is quantitative or qualitative.
• The frequency distribution, histogram, stem-and-leaf display,
dot plot, and scatter diagram techniques of this chapter are
applicable to quantitative data, while the contingency table is
used primarily for counts involving qualitative data.
Categorical Frequency Distribution

– The categorical frequency distribution is used for data which


can be placed in specific categories such as nominal or ordinal
level data
– The major components of categorical frequency distribution
are class, tally and frequency (or proportion).
• Percentages are also usable
– Forms of a categorical distribution

A B C D
Class Tally Frequency Percent

16
• Example: Data on smoking status by gender of a sample of 20
health workers in Jimma Hospital 1986 E.C was given. Construct
categorical frequency distribution.
Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Gender M F M M F F F M M M F F F F M F M F M M
Smoking Y N N Y N N Y N N N N N N Y Y Y N N Y Y
status

Characteristics Tally Frequency


Gender
Male //// //// 10
Female //// //// 10
Smoking status
No //// //// // 12
Yes //// /// 8

17
Ungrouped Frequency Distribution

– It is the distribution that use individual data values along with


their frequencies.
– often constructed for small set of data on discrete variable
(when data are numerical), and when the range of the data is
small.
– sometimes it is complicated to use ungrouped frequency
distribution for large mass of data, as result we use grouped
frequency distribution.
– The major components of this type of frequency distributions
are class, tally, frequency, and cumulative frequency (less
than/more than).
18
Example: Age in year of 20 women who attended health
education at Jimma Health center in 1986 are given as follows.
Construct ungrouped frequency distribution

30 25 23 41 39 27 41 24 32 29 29 35 31 36 33 36 42
35 37 41

Age(xj) 23 24 25 27 29 30 31 32 33 35 36 37 39 41 42

Tally / / / / // / / / / // // / / /// /

Frequency(f) 1 1 1 1 2 1 1 1 1 2 2 1 1 3 1

19
THE FREQUENCY DISTRIBUTION
AND THE HISTOGRAM
• Raw data have not been manipulated or treated in any way beyond
their original collection. As such, they will not be arranged or organized
in any meaningful manner.
• When the data are quantitative, two of the ways we can address this
problem are the frequency distribution and the histogram.
• The frequency distribution is a table that divides the data values into
classes and shows the number of observed values that fall into each
class.
• By converting data to a frequency distribution, we gain a perspective
that helps us see the forest instead of the individual trees.
• A more visual representation, the histogram describes a frequency
distribution by using a series of adjacent rectangles, each of which has a
length that is proportional to the frequency of the observations within
the range of values it represents.
• In either case, we have summarized the raw data in a condensed form
that can be readily understood and easily interpreted.
The Frequency Distribution
• We’ll discuss the frequency distribution in the context of a
research study that involves both safety and fuel-efficiency
implications. Data are the speeds (miles per hour) of 105 vehicles
observed along a section of highway where both accidents and
fuel-inefficient speeds have been a problem.
EXAMPLE
Raw Data and Frequency Distribution
• Part A of Table 2.1 lists the raw data consisting of measured
speeds (mph) of 105 vehicles along a section of highway. There
was a wide variety of speeds.
• If we want to learn more from this information by visually
summarizing it, one of the ways is to construct a frequency
distribution like the one shown in part B of the table.
TABLE 2.1: Raw data and frequency distribution for observed speeds of 105 vehicles.
Key Terms
• In generating the frequency distribution in part B of Table 2.1,
several judgmental decisions were involved, but there is no single
―correct‖ frequency distribution for a given set of data.
• There are a number of guidelines for constructing a frequency
distribution. Before discussing these rules of thumb and their
application, we’ll first define a few key terms upon which they
rely:
• Class Each category of the frequency distribution.
– Too few intervals are undesirable because of the loss of information. On the
other hand, if too many intervals are used, the objective of summarization is
not being met.
– A commonly followed rule of thumb states 5≤N.C ≤15.
– The formula K  1 3.322log n is a formula by ―Sturges‖.
• K=number of class intervals.
• n=number of values in the data.
– But this should not be regarded as final answer.
• Frequency The number of data values falling within each class.
• Class limits The boundaries for each class. These determine
which data values are assigned to that class.
• Class interval The width of each class. This is the difference
between the lower limit of the class and the lower limit of the
next higher class.
– When a frequency distribution is to have equally wide classes, the
approximate width of each class is

• Class mark The midpoint of each class. This is midway


between the upper and lower class limits.
• Unit of measurement (U)
– This is the smallest possible difference between successive
values. E.g. 1, 0.1, 0.01 …
• Class boundary(CB)
– Separate one class in a grouped frequency distribution from
the other.
– The boundary has one more decimal place than the raw data.
– There is no gap between the upper boundaries of one class
and the lower boundaries of the succeeding class.
Guidelines for the Frequency Distribution
• In constructing a frequency distribution for a given set of data, the
following guidelines should be observed:
1. The set of classes must be mutually exclusive (i.e., a given data value
can fall into only one class). There should be no overlap between classes,
and limits such as the following would be inappropriate:
– Not allowed, since a value of 60 could fit into either class: 55–60
60–65
– Not allowed, since there’s an overlap between the classes: 50–under 55
53–under 58
2. The set of classes must be exhaustive (i.e., include all possible data
values). No data values should fall outside the range covered by the
frequency distribution.
3. If possible, the classes should have equal widths. Unequal class widths
make it difficult to interpret both frequency distributions and their
graphical presentations.
4. Selecting the number of classes to use is a subjective process. If
we have too few classes, important characteristics of the data
may be buried within the small number of categories. If there are
too many classes, many categories will contain either zero or a
small number of values. In general, about 5 to 15 classes will be
suitable.
5. Whenever possible, class widths should be round numbers (e.g.,
5, 10, 25, 50, 100). For the highway speed data, selecting a width
of 2.3 mph for each class would enhance neither the visual
attractiveness nor the information value of the frequency
distribution.
6. If possible, avoid using open-end classes. These are classes with
either no lower limit or no upper limit—e.g., 85 mph or more.
Such classes may not always be avoidable, however, since some
data may include just a few values that are either very high or
very low compared to the others.
Relative and Cumulative Frequency Distributions
• Relative Frequency Distribution. Another useful approach to data
expression is the relative frequency distribution, which describes the
proportion or percentage of data values that fall within each category.
• The relative frequency distribution for the speed data is shown in Table
2.2; for example, of the 105 motorists, 15 of them (14.3%) were in the
55–under 60 class.
• Relative frequencies can be useful in comparing two groups of unequal
size, since the actual frequencies would tend to be greater for each class
within the larger group than for a class in the smaller one.
• For example, if a frequency distribution of incomes for 100 physicians is
compared with a frequency distribution for 500 business executives, more
executives than physicians would be likely to fall into a given class.
Relative frequency distributions would convert the groups to the same
size: 100 percentage points each.
• Relative frequencies will play an important role in our discussion of
probabilities in Chapter 4.
• Cumulative Frequency Distribution. Another approach to the
frequency distribution is to list the number of observations that
are within or below each of the classes. This is known as a
cumulative frequency distribution.
• When cumulative frequencies are divided by the total number of
observations, the result is a cumulative relative frequency
distribution.
• The ―Cumulative Relative Frequency (%)‖ column in Table 2.2
shows the cumulative relative frequencies for the speed data in
Table 2.1.
• Examining this column, we can readily see that 62.85% of the
motorists had a speed less than 70 mph.
• Cumulative percentages can also operate in the other direction
(i.e., ―greater than or within‖). Based on Table 2.2, we can
determine that 90.48% of the 105 motorists had a speed of at least
55 mph.
TABLE 2.2: Frequencies, relative frequencies, cumulative frequencies, and
cumulative relative frequencies for the speed data of Table 2.1.
Steps to construct grouped frequency distribution
• Find smallest (S) and largest (L) values in your data
• Compute difference between L and S, R
• Determine the number of class using Sturge’s rule, round up!
• determine class width, ratio of R and K, round up!
• Take the smallest value as the first class lower class limit, and add class
width to get consecutive lower class limits
• To get upper class limit subtract unit of measurement from second class
lower class limit, and add class width to get remaining upper class limits
• Subtract half of unit of measurement from lower class limit to get class
boundary, and add half of unit of measurement to upper class limit to get
upper class boundary
• Tally data
• Find cumulative frequency
Example: Age in year of 20 women who attended health
education at Jimma Health center in 1986 are given as follows.
Construct grouped frequency distribution
30 25 23 41 39 27 41 24 32 29 29 35 31 36 33 36 42
35 37 41
n=20
k=1+3.322(log20) =1+3.322(1.3010) = 5.196  k=6
w= (42-23)/54
The grouped frequency table using Sturges formula
Class Frequency Class cf Class Relatives Mcf
(f) boundaries mark frequency
23-26 3
27-30 4
31-34 3
35-38 5
39-42 5 32
The Histogram
• The histogram describes a frequency distribution by using a series
of adjacent rectangles, each of which has a length proportional to
either the frequency or the relative frequency of the class it
represents.
• The histogram in part (a) of Figure 2.1 is based on the speed-
measurement data summarized in Table 2.1.
• The lower class limits (e.g., 45 mph, 50 mph, 55 mph, and so on)
have been used in constructing the horizontal axis of the histogram.
• The tallest rectangle in part (a) of Figure 2.1 is associated with the
60–under 65 class of Table 2.1, identifying this as the class having
the greatest number of observations.
• The relative heights of the rectangles visually demonstrate how the
frequencies tend to drop off as we proceed from the 60–under 65
class to the 65–under 70 class and higher.
The Frequency Polygon
• Closely related to the histogram, the frequency polygon consists
of line segments connecting the points formed by the
intersections of the class marks with the class frequencies.
• Relative frequencies or percentages may also be used in
constructing the figure.
• Empty classes are included at each end so the curve will
intersect the horizontal axis.
• For the speed-measurement data in Table 2.1, these are the 40–
under 45 and 90–under 95 classes. (Note: Had this been a
distribution for which the first nonempty class was ―0 but under
5,‖ the empty class at the left would have been ―25 but under
0.‖)
• The frequency polygon for the speed-measurement data is
shown in part (b) of Figure 2.1.
• Compared to the histogram, the frequency polygon is more
realistic in that the number of observations increases or decreases
more gradually across the various classes.
• The two endpoints make the diagram more complete by allowing
the frequencies to taper off to zero at both ends.
• Related to the frequency polygon is the ogive, a graphical display
providing cumulative values for frequencies, relative frequencies,
or percentages.
• These values can be either ―greater than‖ or ―less than.‖
• The ogive diagram in part (c) of Figure 2.1 shows the percentage
of observations that are less than the upper limit of each class.
THE STEM-AND-LEAF DISPLAY
AND THE DOTPLOT
The Stem-and-Leaf Display
• The stem-and-leaf display, a variant of the frequency
distribution, uses a subset of the original digits as class
descriptors.
• The technique is best explained through a few examples. The
raw data are the numbers of Congressional bills vetoed during
the administrations of seven U.S. presidents, from Johnson to
Clinton.
• In stem-and-leaf terms, we could describe these data as follows:

• The figure to the left of the divider ( | ) is the stem, and the digits
to the right are referred to as leaves.
• By using the digits in the data values, we have identified five
different categories (30s, 40s, 50s, 60s, and 70s) and can see that
there are three data values in the 30s, two in the 40s, one in the
60s, and one in the 70s.
• Like the frequency distribution, the stem-and-leaf display allows
us to quickly see how the data are arranged.
The Dotplot
• The dotplot displays each data value as a dot and allows us to
readily see the shape of the distribution as well as the high and
low values.
Pie Chart
• The pie chart is a circular display divided into sections based on
either the number of observations within or the relative values of
the segments.
• If the pie chart is not computer generated, it can be constructed by
using the principle that a circle contains 360 degrees.
• The angle used for each piece of the pie can be calculated as
follows:
Component part
Angle of sec tor   3600
Total
• These angles are made in the circle by mean of a protractor to
show different components.
• The arrangement of the sectors is usually anti-clock wise.
42
• Example: The following table gives the details of quarterly sale of
a Sport Wear company’s profit (in millions of dollar) in four
quarters of a year.

Month Profit($,000,000)
1st quarter 100
2nd quarter 300
3rd quarter 500
4th quarter 600
Total 1500

– Construct a pie chart

43
Quarter Profit($,000, Angle of sector Percent
000) (in degrees) (%)
1st quarter 100 24 7
2nd quarter 300 72 20
3rd quarter 500 120 33
4th quarter 600 144 40
Total 1500 360 100

1st quarter
7%
2nd quarter

20% 3rd quarter


40%
4th quarter

33%

44
The Bar Chart
• Like the histogram, the bar chart represents frequencies
according to the relative lengths of a set of rectangles, but it
differs in two respects from the histogram:
(1) the histogram is used in representing quantitative data, while
the bar chart represents qualitative data; and
(2) adjacent rectangles in the histogram share a common
side, while those in the bar chart have a gap between them.
• Bar charts can be
– Simple bar chart,
– Multiple bar charts,
– Stratified or stacked bar chart
– Deviation bar chart
Simple Bar Chart

• Used to represents data involving only one variable classified on


spatial, quantitative or temporal basis
• Make bars of equal width but variable length
• Example (Sports Wear company quarterly sales)

46
Multiple Bar Chart

• When two or more interrelated series of data are depicted by a bar


diagram
• Make bars of equal width but variable length
• Example: Suppose we have export and import (in million) figures
for a company working on mineral for few years.
70
60
50
40
Export
30
Import
20
10
0
2010 2011 2012
47
Stratified/Stacked Bar Chart

• Used to represent data in which the total magnitude is


divided into different or components.
• First make simple bars for each class taking total
magnitude in that class and then divide these simple bars
into parts in the ratio of various components
• Shows the variation in different components within each
class as well as between different classes.
• Stratified bar diagram is also known as component bar
chart.
48
• Example: The table below shows the profit of a company ($
Millions) from different item sales in 1st quarter of the year. Draw
stratified/stacked bar chart
Company Shoe T-shirt Ball Total
X 30 50 40 120
Y 33 16 27 76
Z 37 13 37 87
140
120
Sales in $,000,000

100 40
80 Ball
60 27 37
50 T-shirt
40 16 13
Shoe
20 30 33 37
0
X Y Z
49
Company
Deviation Bar Chart

• Used when the data contains both positive and negative values
such as data on net profit, net expense, percent change etc
• Suppose we have the following data relating to net profit (percent)
of commodity.

Commodity Net profit Net profit


Soap 80 150

Sugar -95 100

Coffee 125 50
0 Net profit
-50 Soap Sugar Coffee

-100
-150

50
The Line Graph
• The line graph is capable of simultaneously showing values of
two quantitative variables (y, or vertical axis, and x, or
horizontal axis); it consists of linear segments connecting points
observed or measured for each variable.
• When x represents time, the result is a time series view of the y
variable.
• Even more information can be presented if two or more y
variables are graphed together.
The Pictogram
• Using symbols instead of a bar, the pictogram can describe
frequencies or other values of interest.
• The following figure is an example of this method; it was used
by Panamco to describe soft drink sales in Central America over
a 3-year period.
• In the diagram, each truck represents about 12.5 million cases of
soft drink products.
• When setting up a pictogram, the choice of symbols is up to you.
This is an important consideration because the right (or wrong)
symbols can lend nonverbal or emotional content to the display.
• For example, a drawing of a sad child with her arm in a cast
(each symbol representing 10,000 abused children) could help to
emphasize the emotional and social costs of child abuse.
• In the pictogram, the symbols represent frequencies or other
values of interest. This chart shows how soft drink sales
(millions of cases) in Central America increased from 1996
through 1998.
THE SCATTER DIAGRAM
• There are times when we would like to find out whether there is a
relationship between two quantitative variables—for
example, whether sales is related to advertising, whether starting
salary is related to undergraduate grade point average, or whether
the price of a stock is related to the company’s profit per share.
• To examine whether a relationship exists, we can begin with a
graphical device known as the scatter diagram, or scatterplot.
• Think of the scatter diagram as a sort of two-dimensional dotplot.
Each point in the diagram represents a pair of known or observed
values of two variables, generally referred to as y and x, with y
represented along the vertical axis and x represented along the
horizontal axis.
• The two variables are referred to as the dependent (y) and
independent (x) variables, since a typical purpose for this type of
analysis is to estimate or predict what y will be for a given value of
x.
• Once we have drawn a scatter diagram, we can ―fit‖ a line to it in such a
way that the line is a reasonable approximation to the points in the
diagram.
• In viewing the ―best-fit‖ line and the nature of the scatter diagram, we
can tell more about whether the variables are related and, if so, in what
way.
1. A direct (positive) linear relationship between the variables, as shown in
part (a) of Figure 2.6. The best-fit line is linear and has a positive
slope, with both y and x increasing together.
2. An inverse (negative) linear relationship between the variables, as shown
in part (b) of Figure 2.6. The best-fit line is linear and has a negative
slope, with y decreasing as x increases.
3. A curvilinear relationship between the variables, as shown in part (c) of
Figure 2.6. The best-fit line is a curve. As with a linear relationship, a
curvilinear relationship can be either direct (positive) or inverse
(negative).
4. No relationship between the variables, as shown in part (d) of Figure 2.6.
The best-fit line is horizontal, with a slope of zero and, when we view
the scatter diagram, knowing the value of x is of no help whatsoever in
predicting the value of y.
• Example: 10 students sat both a Math and a Stat exam, here are
their scores:
Subj Student

1 2 3 4 5 6 7 8 9 10
Math 56 24 67 70 71 42 48 32 52 80
Stat 65 38 71 72 73 51 56 42 57 82

You might also like