0% found this document useful (0 votes)
6 views

2 Types of Data

Uploaded by

Krishna Dembla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

2 Types of Data

Uploaded by

Krishna Dembla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Classification and Types of Data

Introduction
• In previous sessions we pointed out that statistics is divided into two basic areas: descriptive
statistics and inferential statistics
• Business firms/Managers frequently have access to large masses of potentially useful data.
• Once the data is organized and summarized can be used to support the decision
• Example: Problems faced by the firms which have access to the databases created by the use of
debit cards. The database consists of the personal information supplied by the customer when he
or she applied for the debit card.
• This information includes age, gender, residence, and the cardholder’s income.
• In addition, each time the card is used the database grows to include a history of the timing,
price, and brand of each product purchased.
• Using the appropriate statistical technique, managers can determine which segments of the
market are buying their company’s brands.
• Specialized marketing campaigns, including telemarketing, can be developed.
• Both descriptive and inferential statistics would likely be employed in the analysis
Introduction
• Descriptive statistics involves arranging, summarizing, and presenting a set of data in such a way
that useful information is produced.
• Its methods make use of graphical techniques and numerical descriptive measures (such as
averages) to summarize and present the data, allowing managers to make decisions based on the
information generated.
• According to a Wharton Business School study, top managers reach a consensus 25% more
quickly when responding to a presentation in which graphics are used.
• Recall that a population is the entire set of observations under study, whereas a sample is a
subset of a population
• Descriptive methods can be applied to both a set of data constituting a population and a set of
data constituting a sample
• Critical part of learning includes an understanding of not only how to draw graphs and calculate
statistics (manually or by computer) but also when to use each technique that we cover.
• The two most important factors that determine the appropriate method to use are (1) the type
of data and (2) the information that is needed
Types of Data

• The objective of statistics is to extract information from data.


• There are different types of data and information.
• To help explain this important principle, we need to define some
terms.
Variable
• A variable is some characteristic of a population or sample.
• For example, the mark on a statistics exam is a characteristic of
statistics exams that is certainly of interest to the students.
• Not all students achieve the same mark.
• The marks will vary from student to student, thus the name variable.
• The price of a stock is another variable.
• The prices of most stocks vary daily.
• We usually represent the name of the variable using uppercase letters
such as X, Y , and Z.
Values

• The values of the variable are the possible observations of the


variable.
• The values of statistics exam marks are the integers between 0 and
100 (assuming the exam is marked out of 100).
• The values of a stock price are real numbers that are usually
measured in dollars and cents (sometimes in fractions of a cent).
• The values range from 0 to hundreds of dollars.
Data
• Data are the observed values of a variable.
• For example, suppose that we observe the following midterm test marks
of 10 students:
67 74 71 83 93 55 48 82 68 62
• These are the data from which we will extract the information we seek.
• Incidentally, data is plural for datum.
• The mark of one student is a datum.
• When most people think of data, they think of sets of numbers.
• However, there are three types of data: interval, nominal, and ordinal
Interval data/ Quantitative/Numerical
• Interval data are real numbers, such as heights, weights, incomes,
and distances.
• We also refer to this type of data as quantitative or numerical.
Nominal Data/Qualitative/Categorical
• The values of nominal data are categories.
• For example, responses to questions about marital status produce nominal data.
• The values of this variable are single, married, divorced, and widowed.
• Notice that the values are not numbers but instead are words that describe the categories.
• We often record nominal data by arbitrarily assigning a number to each category.
• For example, we could record marital status using the following codes:
Single = 1, married = 2, divorced = 3, widowed = 4
• However, any other numbering system is valid provided that each category has a
different number assigned to it.
• Here is another coding system that is just as valid as the previous one.
Single = 7, married = 4, divorced = 13, widowed = 1
• Nominal data are also called qualitative or categorical.
Ordinal Data
• The third type of data is ordinal.
• Ordinal data appear to be nominal, but the difference is that the order of their values
has meaning.
• For ex: at the end of the course you are asked to evaluate the course and anchor.
• The variables are the ratings of various aspects of the course and the anchor. The values
are:
poor, fair, good, very good, and excellent
• The difference between nominal and ordinal types of data is that the order of the values
of the latter indicate a higher rating
• Consequently, when assigning codes to the values, we should maintain the order of the
values. For example, we can record the students’ evaluations as
Poor = 1, Fair = 2, Good = 3, Very good = 4, Excellent = 5
• We can use any set of codes that are in order, it’s not the magnitude of the values that
is important, it’s their order
Calculations for Interval Data

• All calculations are permitted on interval data.


• We often describe a set of interval data by calculating the average.
• For example, the average of the 10 marks listed in previous example
(67 74 71 83 93 55 48 82 68 62) is 70.3.
• There are many other important statistics we discuss in upcoming
units
Calculations for Nominal Data
• Because the codes of nominal data are completely arbitrary, we cannot perform
any calculations on these codes.
• To understand why, consider a survey that asks people to report their marital
status.
• Suppose that the first 10 people surveyed gave the following responses:
Single, Married, Married, Married, Widowed,
Single, Married, Married, Single, Divorced
• Using the codes
Single = 1, Married = 2, Divorced = 3, Widowed = 4
• we would record these responses as
1 2 2 2 4 1 2 2 1 3
• The average of these numerical codes is 2.0.
Calculations for Nominal Data
• This example illustrates a fundamental truth about nominal data:
Calculations based on the codes used to store this type of data are
meaningless.
• All that we are permitted to do with nominal data is count or
compute the percentages of the occurrences of each category.
• Thus, we would describe the 14 observations by counting the number
of each marital status category and reporting the frequency.
1 2 2 2 4 1 2 2 1 3 4 4 4 3
Category Code Frequency
Single 1 3
Married 2 5
Divorced 3 2
Widowed 4 4
Hierarchy of Data
• The data types can be placed in order of the permissible calculations.
• At the top of the list, we place the interval data type because virtually all
computations are allowed.
• The nominal data type is at the bottom because no calculations other than
determining frequencies are permitted. (We are permitted to perform
calculations using the frequencies of codes, but this differs from
performing calculations on the codes themselves.)
• In between interval and nominal data lies the ordinal data type.
• Permissible calculations are ones that rank the data.
• Higher-level data types may be treated as lower-level ones.
• For example, in universities and colleges, we convert the marks in a
course, which are interval, to letter grades, which are ordinal.
Hierarchy of Data
• Some graduate courses feature only a pass or fail designation.
• In this case, the interval data are converted to nominal.
• It is important to point out that when we convert higher-level data
as lower-level we lose information.
• For example, a mark of 89 on an accounting course exam gives far more
information about the performance of that student than does a letter grade
of B, which might be the letter grade for marks between 80 and 90.
• As a result, we do not convert data unless it is necessary to do so.
• It is also important to note that we cannot treat lower-level data
types as higher level types.
Describing a Set of Nominal Data
• As we discussed, the only allowable calculation on nominal data is to
count the frequency or compute the percentage that each value of
the variable represents.
• We can summarize the data in a table, which presents the categories
and their counts, called a frequency distribution.
• A relative frequency distribution lists the categories and the
proportion with which each occurs.
• We can use graphical techniques to present a picture of the data.
• There are two graphical methods we can use: the bar chart and the
pie chart.
Example: Work Status in the GSS 2014 Survey
• A major problem with the official unemployment rate is that it excludes people who have given
up trying to find a job even though they would like to find employment.
• In an effort to track the numbers of Americans in the various categories of work status the GSS
(general social survey) asked “Last week were you working full time, part time, going to school,
keeping house, or what?”
• The responses are as follows:
1. Working full time
2. Working part time
3. Temporarily not working
4. Unemployed, laid off
5. Retired
6. School
7. Keeping house
8. Other
Example: Work Status in the GSS 2014 Survey
• The responses were recorded using the codes 1,2,3,4,5,6,7, and 8, respectively. The first 150 observations are listed here.
• The name of the variable is WRKSTAT and is stored in Column X.
• Construct a frequency and relative frequency distribution for these data and graphically summarize the data by producing a bar
chart and a pie chart.
171721281111711
187151154511511
457381814811251
216117117211531
561111117417185
122112177311513
921115114357511
111111111151157
811211171111111
111117181411511
Frequency and Relative Frequency: Solution
• Scan the data.
• If we had listed all 2,536 observations one would be even less likely to discover anything
useful about the data.
• To extract useful information requires the application of a statistical or graphical
technique.
• To choose the appropriate technique we must first identify the type of data.
• In this example the data are nominal because the numbers represent categories.
• The only calculation permitted on nominal data is to count the number of occurrences
of each category (Frequency)
• Hence, we count the number of 1s, 2s, 3s, 4s, 5s, 6s, 7s, and 8s.
• The list of the categories and their counts constitute the frequency distribution.
• The relative frequency distribution is produced by converting the frequencies into
proportions
Frequency and Relative Frequency: Solution
• The frequency and the relative
frequency distributions are
combined in a below Table
• There were two individuals
who refused to answer
resulting in a total of 2,536
observations.
• Following are the are the Excel
outputs and instructions on
how to produce the frequency
and the relative frequency
distributions and specifically
how to get the results shown
in Table.
Frequency and Relative Frequency: Interpret

• Only 48.5% of respondents were


working full time, 18.1% were retired,
10.4% were keeping house, 10.8% were
working part time, and the remaining
12.2% were in one of the other four
categories.
Bar and Pie Charts
• The information contained in the data is summarized well in the table.
• However, graphical techniques generally catch a reader’s eye more quickly
than does a table of numbers.
• Two graphical techniques can be used to display the results shown in the
table.
• A bar chart is often used to display frequencies
• A pie chart graphically shows relative frequencies
• The bar chart is created by drawing a rectangle representing each category.
• The height of the rectangle represents the frequency. The base is
arbitrary.
Bar Chart
Pie Chart
• If we wish to emphasize the relative frequencies instead of drawing the
bar chart, we draw a pie chart.
• A pie chart is simply a circle subdivided into slices that represent the
categories.
• It is drawn so that the size of each slice is proportional to the percentage
corresponding to that category.
• For example, because the entire circle is composed of 360 degrees, a
category that contains 25% of the observations is represented by a slice of
the pie that contains 25% of 360 degrees, which is equal to 90 degrees.
• The number of degrees for each category in Example of work status is
shown in Table 2.2.
Pie Chart
Example 2.2: Energy Consumption in the
United States in 2015
• Table 2.3 lists the total energy
consumption of the United States
from all sources in 2015.
• To make it easier to see the
details, the table measures the
energy in quadrillions of British
thermal units (BTUs). Xm02-02

• Use an appropriate graphical


technique to depict these figures.
Solution

• We are interested in describing


the proportion of total energy
consumption for each source.
• Thus, the appropriate technique
is the pie chart.
• The next step is to determine the
proportions and sizes of the pie
slices from which the pie chart is
drawn.
• The following pie chart was
created by Excel.
Interpret

• The United States depends heavily on


petroleum, coal, and natural gas.
• More than 80% of national energy use is
based on these sources.
• The renewable energy sources amount
to less than 9%, of which about a third is
hydroelectric and probably cannot be
expanded much further.
• Wind and solar barely appear in the
chart.
• See Exercises 2.13 to 2.18 for more
information on the subject.
Example 2.3: Beer Consumption (Top 20 Countries)
• Table 2.4 lists the per capita
beer consumption for each
of the top 20 countries
around the world.
• Graphically present these
numbers.

Xm02-03
Solution
• In this example, we are primarily
interested in the numbers.
• There is no use in presenting
proportions here.
• The following is Excel version of
the bar chart.

Interpret:
The Czech Republic, Ireland, and Germany head the list. Both the United States and the United
Kingdom rank far lower. Surprised?
Describing Ordinal Data
• There are no specific graphical techniques for ordinal data.
• Consequently, when we wish to describe a set of ordinal data, we will
treat the data as if they were nominal.
• The only criterion is that the bars in bar charts should be arranged in
ascending (or descending) ordinal values;
• in pie charts, the wedges are typically arranged clockwise in
ascending or descending order.
Describing Ordinal Data
Xr02-13
Exercise 2.13:
• When will the world run out of oil?
• One way to judge is to determine the oil reserves
of the countries around the world.
• The next table displays the known reserves of
the top 15 countries.
• Graphically describe the figures.

• The total reserves of oil in the world are


1,689,078,618,100 barrels.
• The total reserves of the top 15 countries
listed in Exercise 2.13 are 1,563,350,000,000
barrels.
• Use a graphical technique that emphasizes
the percentage breakdown of the top 15
countries.
Describing Ordinal Data
Exercise 2.15
• The following table lists the
average oil consumption per
day of the top 20 oil-consuming
nations.
• Use a graphical technique to
display this information.

• Practice the exercises given in


the text book
Describing the Relationship between Two Nominal Variables and
Comparing Two or More Nominal Data Sets

• Graphical and tabular techniques used to summarize a set of nominal


data
• Techniques applied to single sets of data are called univariate
• There are many situations where we wish to depict the relationship
between variables; in such cases, bivariate methods are required
• A cross-classification table (also called a cross-tabulation table) is
used to describe the relationship between two nominal variables
Tabular Method of Describing the Relationship
between Two Nominal Variables

• To describe the relationship between two nominal variables, we must


remember that we are permitted only to determine the frequency of
the values.
• As a first step, we need to produce a cross-classification table that
lists the frequency of each combination of the values of the two
variables.
Example 2.4: Newspaper Readership Survey
• A major North American city has four competing
newspapers: the Globe and Mail (G&M), Post, Star, and
Sun.
• To help design advertising campaigns, the advertising
managers of the newspapers need to know which
segments of the newspaper market are reading their
papers.
• A survey was conducted to analyze the relationship
between newspapers read and occupation.
• A sample of newspaper readers was asked to report
which newspaper they read—Globe and Mail (1), Post (2),
Star (3), and Sun (4)—and indicate whether they were
blue-collar workers (1), white-collar workers (2), or
professionals (3).
• Some of the data are listed here.
• Determine whether the two nominal variables are related.
Solution
• By counting the number of times
each of the 12 combinations occurs,
we produced the Table 2.5.
• If occupation and newspaper are
related, there will be differences in
the newspapers read among the
occupations. An easy way to see this
is to convert the frequencies in each
row (or column) to relative
frequencies in each row (or column).
That is, compute the row (or column)
totals and divide each frequency by
its row (or column) total, as shown in
Table 2.6. Totals may not equal 1
because of rounding.
Instructions
• The data must be stored in (at least) three
columns as we have done in Xm02-04. Put the
cursor somewhere in the data range.
• 1. Click Insert and PivotTable.
• 2. Make sure that the Table/Range is correct.
Click OK.
• 3. In the PivotTable Fields Click Occupation,
right click and choose Add to Row Labels.
• 4. Click Newspaper, right click and choose Add
to Column Labels.
• 5. Place the cursor in the table and right click
Summarize Values by and click Count.
• 6. To convert to row percentages, right-click any
number, click Show Values As and click % of
rows. We then formatted the data into
decimals.
Interpret
• Notice that the relative
frequencies in the second
and third rows are similar
and that there are large
differences between row 1
and rows 2 and 3.
• This tells us that blue collar
workers tend to read
different newspapers from
both white-collar workers
and professionals and that
white-collar workers and
professionals are quite
similar in their newspaper
choices.
Graphing the Relationship
between Two Nominal
Variables

• We have chosen to draw three bar charts,


one for each occupation depicting the
four newspapers.
• We’ll use Excel for this purpose. The
manually drawn charts are identical.

IIf the two variables are unrelated, then the patterns exhibited in the bar charts should be
approximately the same. If some relationship exists, then some bar charts will differ from
others. The graphs tell us the same story as did the table. The shapes of the bar charts for
occupations 2 and 3 (white collar and professional) are very similar. Both differ considerably
from the bar chart for occupation 1 (blue collar).

You might also like